Self-Hosting an LLM on Homelab Hardware

Archive note: this is a backdated post, written years later while rebuilding this site. It’s dated to the moment it covers, but the hindsight is real.

Last month I promised to get Llama 2 running on my own hardware. Report: it runs. A language model — the same category of thing I’ve been renting from OpenAI by the token — is answering questions on equipment I own, with the network cable conceptually unplugged. For a self-hoster, this is the most satisfying software moment since the first time Jellyfin replaced a streaming subscription.

The stack, for fellow travelers

The enabling miracle is llama.cpp — Georgi Gerganov’s C++ inference engine that started as a “can it run on a MacBook” stunt in March and became the foundation of the entire local-LLM movement. The pieces:

Quantization is the unlock. Full-precision models need server GPUs. Quantized to 4-bit — squeezing weights into less precision with surprisingly mild quality loss — Llama 2 7B fits in ~4GB, 13B in ~8GB. That’s “decent gaming GPU” or even “plain CPU with enough RAM” territory. The ex-mining cards flooding the market have found their second life.
The new GGUF format (just replaced GGML this month) is the packaging standard — one file, metadata included, download and point the runtime at it.
Ollama is the ergonomic layer — ollama run llama2 and you’re chatting, with an API endpoint on localhost for free. It’s the Docker moment for local models: the hard parts wrapped in a one-liner. For UI, the web frontends are multiplying weekly; pick any.
Hardware reality: more RAM beats most other upgrades; GPU offloading helps but a modern CPU alone delivers usable tokens-per-second on 7B models. Usable, not fast.

The honest verdict

Calibration matters, so: a quantized 13B is nowhere near GPT-4. It’s somewhere in the GPT-3.5 neighborhood on a good day — fine for summarization, drafting, classification, code explanation; shaky on multi-step reasoning; hallucinates with enthusiasm. If your daily driver is GPT-4, the local model feels like trading a sedan for a bicycle.

But the bicycle is yours, and that changes which trips you take:

Privacy-complete workloads. Client code, personal documents, anything under NDA — local inference means the data provably never leaves. There are tasks I simply wouldn’t paste into a cloud chatbot that the local model now handles.
Zero marginal cost. Batch jobs that would be irresponsible at API prices — classify every email, summarize every document in a folder — are free after electricity. Different economics produce different ideas.
No deprecations, no policy drift. The model on my disk behaves the same tomorrow. Nobody can acquire it, rug it, or rate-limit it.

Where this is going

The trajectory writes itself: models improve, quantization improves, hardware ages forward. Today local models trail the frontier by maybe 18 months of capability; if that gap merely holds, the local tier crosses “genuinely good” soon. And the NAS-and-mini-PC crowd should note — inference is a service like any other. It containerizes, it sits behind a reverse proxy, it joins the homelab stack next to Pi-hole and Jellyfin. A small box that runs your media, your DNS, and your AI is no longer a weird sentence.

The frontier will stay rented for a while. But the floor — private, free, owned — just got real. The most important git clone I’ve run all year.