Self-Hosting an LLM on Homelab Hardware
· Jerwin Arnado
Archive note: this is a backdated post, written years later while rebuilding this site. It’s dated to the moment it covers, but the hindsight is real.
Last month I promised to get Llama 2 running on my own hardware. Report: it runs. A language model — the same category of thing I’ve been renting from OpenAI by the token — is answering questions on equipment I own, with the network cable conceptually unplugged. For a self-hoster, this is the most satisfying software moment since the first time Jellyfin replaced a streaming subscription.
The stack, for fellow travelers
The enabling miracle is llama.cpp — Georgi Gerganov’s C++ inference engine that started as a “can it run on a MacBook” stunt in March and became the foundation of the entire local-LLM movement. The pieces:
- Quantization is the unlock. Full-precision models need server GPUs. Quantized to 4-bit — squeezing weights into less precision with surprisingly mild quality loss — Llama 2 7B fits in ~4GB, 13B in ~8GB. That’s “decent gaming GPU” or even “plain CPU with enough RAM” territory. The ex-mining cards flooding the market have found their second life.
- The new GGUF format (just replaced GGML this month) is the packaging standard — one file, metadata included, download and point the runtime at it.
- Ollama is the ergonomic layer —
ollama run llama2and you’re chatting, with an API endpoint on localhost for free. It’s the Docker moment for local models: the hard parts wrapped in a one-liner. For UI, the web frontends are multiplying weekly; pick any. - Hardware reality: more RAM beats most other upgrades; GPU offloading helps but a modern CPU alone delivers usable tokens-per-second on 7B models. Usable, not fast.
The honest verdict
Calibration matters, so: a quantized 13B is nowhere near GPT-4. It’s somewhere in the GPT-3.5 neighborhood on a good day — fine for summarization, drafting, classification, code explanation; shaky on multi-step reasoning; hallucinates with enthusiasm. If your daily driver is GPT-4, the local model feels like trading a sedan for a bicycle.
But the bicycle is yours, and that changes which trips you take:
- Privacy-complete workloads. Client code, personal documents, anything under NDA — local inference means the data provably never leaves. There are tasks I simply wouldn’t paste into a cloud chatbot that the local model now handles.
- Zero marginal cost. Batch jobs that would be irresponsible at API prices — classify every email, summarize every document in a folder — are free after electricity. Different economics produce different ideas.
- No deprecations, no policy drift. The model on my disk behaves the same tomorrow. Nobody can acquire it, rug it, or rate-limit it.
Where this is going
The trajectory writes itself: models improve, quantization improves, hardware ages forward. Today local models trail the frontier by maybe 18 months of capability; if that gap merely holds, the local tier crosses “genuinely good” soon. And the NAS-and-mini-PC crowd should note — inference is a service like any other. It containerizes, it sits behind a reverse proxy, it joins the homelab stack next to Pi-hole and Jellyfin. A small box that runs your media, your DNS, and your AI is no longer a weird sentence.
The frontier will stay rented for a while. But the floor — private, free, owned — just got real. The most important git clone I’ve run all year.