AI & Machine Learning

Running Large Language Models on Your Own Hardware

Last year I spent $4,200 building a local inference rig that could run a 70B parameter model at reasonable speed. A colleague looked at my setup and asked the obvious question: 'Why not just use the API? That's like eight years of API credits.' He wasn't wrong about the math. But he was wrong about the calculus.

Running large language models locally used to be the domain of research labs with rack-mounted GPU clusters. That's changed fast. Consumer hardware, quantization techniques, and optimized inference engines have brought 70B+ models within reach of dedicated hobbyists and small teams. Devices like the Tinybox are pushing even further — purpose-built hardware designed to run 120B parameter models offline, no cloud required.

But 'possible' and 'practical' are different things. Let's talk about what it actually takes to run large models locally, when it makes sense, and when you're better off paying for an API.

Why Run Models Locally at All?

The API model — sending your data to a cloud provider, getting results back — works fine for most use cases. It's simpler, cheaper per query at low volumes, and always gives you access to the latest models. So why would anyone bother with local inference?

  • Privacy and compliance. Some data can't leave your network. Medical records, legal documents, proprietary code, financial data — regulated industries often have hard requirements about data residency. Sending patient records to an API endpoint, even an encrypted one, may violate HIPAA. Local inference keeps everything on-premises.
  • Latency. API calls involve network round trips, potential queue wait times, and rate limits. Local inference has zero network latency and you're never in a queue. For interactive applications — real-time coding assistants, on-device translation, voice interfaces — the difference between 50ms and 500ms is the difference between 'responsive' and 'sluggish.'
  • Cost at scale. API pricing is per-token. At low volumes, it's negligible. At high volumes, it compounds brutally. A team doing heavy code review, document analysis, or batch processing can burn through thousands of dollars per month in API costs. Local hardware has a fixed cost — once you've paid for it, inference is effectively free.
  • Availability. Cloud APIs go down. Rate limits get imposed. Pricing changes without notice. Models get deprecated. If your product depends on a third-party API, you're at the mercy of their business decisions. Local inference means your capability doesn't evaporate because someone else's server has a bad day.
  • Experimentation freedom. API providers have usage policies. They decide what you can and can't do with the model. Local models have no such restrictions — you can fine-tune them, modify them, use them for any purpose, and run them as many times as you want.

The Hardware Reality

The fundamental constraint for LLM inference is memory, not compute. A model's parameters need to fit in memory (GPU VRAM or system RAM) before you can do anything with them. A 70B parameter model in 16-bit floating point requires approximately 140 GB of memory. That's more than any single consumer GPU offers.

This is where quantization changes the game. By reducing the precision of model weights from 16-bit to 8-bit, 4-bit, or even 2-bit, you can dramatically reduce memory requirements:

Memory requirements for a 70B parameter model:
FP16 (full precision):  ~140 GB  → requires multiple A100s
INT8 (8-bit quant):     ~70 GB   → requires 2x RTX 4090 (48GB total)
Q4_K_M (4-bit quant):   ~40 GB   → fits on 2x RTX 3090 or 1x A6000
Q2_K (2-bit quant):     ~25 GB   → fits on 1x RTX 4090 (24GB)
For a 120B parameter model:
FP16:                   ~240 GB  → enterprise GPU territory
INT8:                   ~120 GB  → 5x RTX 4090 or purpose-built device
Q4_K_M:                 ~70 GB   → 3x RTX 4090
Q2_K:                   ~40 GB   → 2x RTX 4090

4-bit quantization (Q4_K_M in the llama.cpp ecosystem) is the current sweet spot. Quality degradation is measurable but often acceptable for practical use — most people can't distinguish Q4 output from full precision in blind tests. 2-bit quantization noticeably impacts quality, especially on reasoning-heavy tasks, but still works for simpler applications like text classification or summarization.

Consumer Hardware Builds

If you're building your own local inference setup, you have three basic paths, each with different price-performance profiles.

The Single GPU Path

One RTX 4090 (24 GB VRAM, ~$1,600) can run 4-bit quantized models up to about 30B parameters comfortably, or 70B models at aggressive 2-bit quantization. For most 7B–13B models, it's overkill — you'll get 40+ tokens per second, which is faster than most people can read. This is the easiest path: buy a GPU, install llama.cpp or Ollama, and go.

The Multi-GPU Path

Two or more GPUs let you split a model across devices (tensor parallelism). Two RTX 3090s (48 GB total VRAM, ~$2,200 used) can comfortably run 4-bit 70B models. The catch: you need a motherboard with enough PCIe lanes and physical space for multiple full-size GPUs. Airflow becomes a real concern — two 350W GPUs in a case generate serious heat.

The Unified Memory Path

Apple Silicon Macs with large unified memory offer a surprisingly viable option. An M2 Ultra with 192 GB of unified memory can fit a full-precision 70B model entirely in memory. The inference speed is slower than dedicated GPUs — maybe 10-15 tokens per second for a 70B model — but the simplicity is hard to beat. No driver issues, no multi-GPU configuration, no thermal management. Just a Mac Studio sitting on your desk running a 70B model.

The M-series chips achieve this through unified memory architecture: the CPU and GPU share the same memory pool, so there's no bottleneck from copying data between CPU RAM and GPU VRAM. The memory bandwidth is lower than a dedicated GPU setup, which is why inference is slower, but having 192 GB of addressable memory in a device that draws 60 watts is genuinely impressive.

Purpose-Built Inference Devices

The newest category is purpose-built local inference hardware — dedicated devices designed specifically to run large models efficiently. These aim to solve the multi-GPU hassle: instead of cobbling together consumer GPUs with cable management nightmares, you get an appliance that's designed from the ground up for inference.

The appeal is obvious. You plug it in, point your application at it, and it runs your model. No driver conflicts, no CUDA version management, no thermal throttling because someone put three GPUs too close together. The trade-off is cost — purpose-built devices typically cost more per FLOP than equivalent consumer GPUs. You're paying for integration, reliability, and not having to debug PCIe lane allocation.

For small companies that need local inference but don't have hardware engineers on staff, these devices make sense. For hobbyists who enjoy building things, DIY multi-GPU rigs remain cheaper and more flexible.

The Software Stack

Hardware is only half the story. The inference software stack has evolved rapidly, and the right software choice can double your throughput on the same hardware.

  • llama.cpp — The Swiss army knife of local inference. Written in C/C++, runs on everything from Raspberry Pis to multi-GPU servers. Supports dozens of model architectures and quantization formats. Not always the fastest, but the most portable and actively maintained.
  • vLLM — Optimized for throughput on NVIDIA GPUs. Uses PagedAttention to efficiently manage GPU memory, which dramatically improves batched inference. If you're serving multiple users from a single machine, vLLM is typically the best choice.
  • Ollama — The 'Docker for LLMs' approach. Wraps llama.cpp in a user-friendly interface with a model registry. Run ollama run llama3:70b and it downloads the model, configures quantization, and starts serving. Excellent for getting started, but less configurable than raw llama.cpp.
  • MLX — Apple's machine learning framework, optimized for Apple Silicon. If you're on an M-series Mac, MLX typically gives better performance than llama.cpp by leveraging the unified memory architecture more effectively.
# Getting started with Ollama (the easiest path)
# Install: https://ollama.ai
# Run a 7B model (downloads automatically, ~4GB)
$ ollama run mistral
# Run a 70B model (needs ~40GB RAM/VRAM)
$ ollama run llama3:70b-instruct-q4_K_M
# Serve as an API endpoint (OpenAI-compatible)
$ ollama serve
$ curl http://localhost:11434/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "mistral",
"messages": [{"role": "user", "content": "Explain TCP handshakes"}]
}'

The Honest Cost Comparison

Let's do the math that actually matters. Assume you're running a 70B model and processing about 1 million tokens per day (roughly equivalent to analyzing 50-100 documents or handling a few hundred chat conversations).

Cloud API (approximate pricing for 70B-class model):
Input:  $0.50 per million tokens
Output: $1.50 per million tokens
Daily cost: ~$2.00
Monthly cost: ~$60
Yearly cost: ~$720
Local inference (2x RTX 4090 build):
Hardware: $4,200 (one-time)
Electricity: ~$30/month (assuming 700W, 8h/day, $0.15/kWh)
Break-even: ~5.5 years
Local inference (Mac Studio M2 Ultra 192GB):
Hardware: $5,800 (one-time)
Electricity: ~$3/month (60W)
Break-even: ~8 years

At 1 million tokens per day, the cloud API is cheaper for years. But that calculation changes dramatically if your volume increases. At 10 million tokens per day, the API costs $7,200 per year and local hardware breaks even in about 7 months. At 50 million tokens per day, local inference pays for itself in weeks.

The cost comparison also ignores the non-financial factors: privacy, latency, availability, and experimentation freedom. If any of those are requirements (not just nice-to-haves), the financial comparison is secondary.

What Gets Lost in Quantization

Quantization is what makes local inference possible for large models, but it's not free. Reducing precision means some information is lost, and the degradation isn't uniform across tasks.

In my testing, 4-bit quantized models perform nearly identically to full precision on: text generation, summarization, simple Q&A, translation, and code generation for common patterns. The degradation shows up on: complex multi-step reasoning, mathematical computation, tasks requiring precise recall of training data, and nuanced instruction following.

The practical implication: if you're using a local model for code completion, document summarization, or conversational AI, 4-bit quantization is perfectly fine. If you're using it for complex analytical reasoning or tasks where subtle accuracy differences matter, you'll want to test carefully and possibly use higher precision at the cost of requiring more memory or a smaller model.

Making the Decision

After a year of running models locally, here's my framework for deciding between local and cloud inference:

Use cloud APIs when: you need the absolute best model quality, your volume is low to moderate, you don't have hardware expertise, you need to switch models frequently, or latency isn't critical (a few hundred milliseconds is fine).

Run locally when: your data can't leave your network, you need consistent sub-100ms latency, your token volume is high enough to justify hardware costs, you want to experiment freely without per-query costs, or you need inference availability independent of third-party uptime.

The landscape is shifting fast. Models are getting smaller and more efficient. Quantization techniques are getting better. Hardware is getting cheaper. The volume threshold at which local inference makes financial sense is dropping every year. If it doesn't make sense for you today, it might in eighteen months — and the software stack will only get easier to use in the meantime.