What's the difference between 'hosting' and 'serving' a local LLM?
The answer
One paragraph. No hedging beyond what the data actually warrants.
Hosting is "the model is reachable at an HTTP endpoint." Serving is "the model handles N concurrent users without falling over." The gap between them is where local-AI deployments most often break.
Hosting (the easy mode):
- Ollama at
localhost:11434is the canonical example - Single-user / low-concurrency
- One request at a time; subsequent requests queue
- 30-100ms scheduling overhead per request
- Memory is loaded once and stays loaded
- Failure mode: 3rd concurrent user waits ~30 seconds in queue
Serving (the production mode):
- vLLM, TGI, TensorRT-LLM, SGLang are the production runtimes
- Continuous batching — multiple requests share the model's forward pass
- Paged attention — KV cache memory is sliced like virtual memory, not pre-allocated
- Prefix caching — repeated prompt prefixes (system messages) are computed once
- Failure mode: misconfigured GPU memory fragmentation drops throughput 50%+
Operator-grade specifics:
| Metric | Ollama (hosting) | vLLM (serving) |
|---|---|---|
| Single-user latency | comparable to vLLM (within ~10%) | comparable to Ollama |
| Concurrent throughput | flat — queues form fast | scales materially with concurrency |
| Multi-GPU tensor parallel | limited | first-class |
| Setup | one binary, ~5 min | Python env, ~30 min, more knobs |
| Hardware coverage | NVIDIA, Apple Silicon, AMD ROCm | NVIDIA primary; ROCm 6.4+ partial |
We deliberately don't list req/s headlines in this table. Published vLLM continuous-batching benchmarks are model-, batch-, and hardware-specific; pretending a single number works for everyone is exactly what the audit caught us doing elsewhere. The shape of the curve is the takeaway: Ollama plateaus quickly under concurrency; vLLM scales until you hit GPU memory.
The decision rule:
- Solo operator, 1-3 daily users (yourself + friends): Ollama. The throughput gap doesn't matter because you're not concurrency-bound.
- Small team, 5-15 daily users: Still Ollama if you have GPU headroom. Switch to vLLM only when you observe queue waits.
- Production serving, 20+ concurrent or paying users: vLLM. Continuous batching is the difference between "this scales" and "this falls over."
- Per-team self-hosted Copilot replacement: Tabby self-hosted server (which wraps vLLM internally + adds SSO + audit logs).
The misconception to avoid: "vLLM is faster than Ollama, so I should always use vLLM." False for single-user workloads. On a one-request-at-a-time scenario, Ollama and vLLM are comparable — community results put them within tens of percent of each other, varying by model and runtime build. vLLM's actual advantage is concurrency. If you have no concurrent users, its operational complexity is overhead.
The other misconception: "Ollama can't scale at all." Also false — Ollama with proper sizing handles 5-10 concurrent users acceptably. The "Ollama doesn't scale" framing exists because production-grade teams (Hugging Face, vLLM team, etc.) compare benchmarks at 50+ concurrency where Ollama isn't the right tool.
Explore the numbers for your specific stack
Where we got the numbers
Throughput numbers from vLLM continuous-batching paper + community benchmarks r/LocalLLaMA 2026. ROCm 6.4 vLLM parity from ROCm release notes. Ollama scheduling overhead from ollama/ollama issue threads.
Also see
Ollama vs llama.cpp vs vLLM — the 30-second decision rule + honest tradeoffs.
Configuration, continuous batching, paged attention, tensor parallelism.
Setup, common gotchas, when it's the right answer.
The team-friendly server that wraps vLLM + adds SSO + audit logs.
Other questions in this thread
Other /q/ landings on the same topic — same editorial discipline.
Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.