Local AI on the NAS: Practical Homelab Inference

Archive note: this is a backdated post, written years later while rebuilding this site. It’s dated to the moment it covers, but the hindsight is real.

Two years ago, running a language model at home was an event — a weekend project producing a slow, charming bicycle of a model. The update for mid-2025: local inference is now just a service in the stack, sitting in the compose file between the reverse proxy and the media server, and the interesting questions have shifted from “can it run?” to “what’s it for?” A practical state-of-play:

What changed underneath

Small models crossed the usefulness line. The efficiency wave — distilled reasoning models, relentlessly better small bases — means today’s 7–14B class delivers what required a frontier API in 2023 for a wide band of tasks: summarization, classification, extraction, drafting, code explanation. The bicycle grew a motor.
The tooling went boring (the highest compliment). Ollama or llama.cpp’s server as the runtime, an OpenAI-compatible API by convention, a web UI in front, an MCP server beside it — every piece dockerized, documented, and dull. The era of compiling inference engines from git main is over; it’s Pi-hole-grade plumbing now.
Hardware reality stayed honest. A pure-CPU NAS runs small models acceptably for background jobs; a modest GPU (or one of the NPU-equipped mini-PCs maturing fast) makes them interactive. The right mental model: the NAS is the host, inference is a workload — schedule it like one. Nobody needs the heater that is a 70B model at home; the floor models are the point.

What it’s actually for

The honest answer, from a year of settled use — not chat. The local model earns its electricity as infrastructure:

The privacy-complete tier. Documents, finances, client material under NDA, family photos’ metadata — the entire category of “would never paste into a cloud box” gets AI features with provable data custody. This was the original thesis and it holds.
Zero-marginal-cost batch work. Tag and describe the photo library overnight. Summarize every document in the archive. Classify years of email. Jobs that are irresponsible at per-token prices are free after sunk hardware — different economics produce different software, and batch-everything is the local tier’s native genre.
The house’s quiet brain. Wired through MCP to the homelab itself: ask about server state, summarize the logs, draft tonight’s media suggestions from the Jellyfin history. Small, scoped, local — the agent pattern with a LAN-sized blast radius.

The division of labor, settled

The two-tier shape this blog predicted has stabilized into practice: frontier rented for depth, floor owned for volume and privacy. The terminal agents and hard reasoning stay on frontier APIs; everything private, repetitive, or ambient runs at home. The gap between tiers keeps narrowing on a four-month fuse, which means the owned tier’s territory only grows.

Two years from “look, it talks!” to “it’s in the compose file.” That’s the whole local-AI story in one sentence — and exactly the trajectory the homelab sermon promised: the platforms demo the future; the rack in the corner gets to keep it.