What's the difference between 'hosting' and 'serving' a local LLM?

Reviewed May 15, 20262 min read

hostingservingollamavllmproduction

The answer

One paragraph. No hedging beyond what the data actually warrants.

Hosting is "the model is reachable at an HTTP endpoint." Serving is "the model handles N concurrent users without falling over." The gap between them is where local-AI deployments most often break.

Hosting (the easy mode):

Ollama at localhost:11434 is the canonical example
Single-user / low-concurrency
One request at a time; subsequent requests queue
30-100ms scheduling overhead per request
Memory is loaded once and stays loaded
Failure mode: 3rd concurrent user waits ~30 seconds in queue

Serving (the production mode):

vLLM, TGI, TensorRT-LLM, SGLang are the production runtimes
Continuous batching — multiple requests share the model's forward pass
Paged attention — KV cache memory is sliced like virtual memory, not pre-allocated
Prefix caching — repeated prompt prefixes (system messages) are computed once
Failure mode: misconfigured GPU memory fragmentation drops throughput 50%+

Operator-grade specifics:

Metric	Ollama (hosting)	vLLM (serving)
Single-user latency	comparable to vLLM (within ~10%)	comparable to Ollama
Concurrent throughput	flat — queues form fast	scales materially with concurrency
Multi-GPU tensor parallel	limited	first-class
Setup	one binary, ~5 min	Python env, ~30 min, more knobs
Hardware coverage	NVIDIA, Apple Silicon, AMD ROCm	NVIDIA primary; ROCm 6.4+ partial

We deliberately don't list req/s headlines in this table. Published vLLM continuous-batching benchmarks are model-, batch-, and hardware-specific; pretending a single number works for everyone is exactly what the audit caught us doing elsewhere. The shape of the curve is the takeaway: Ollama plateaus quickly under concurrency; vLLM scales until you hit GPU memory.

The decision rule:

Solo operator, 1-3 daily users (yourself + friends): Ollama. The throughput gap doesn't matter because you're not concurrency-bound.
Small team, 5-15 daily users: Still Ollama if you have GPU headroom. Switch to vLLM only when you observe queue waits.
Production serving, 20+ concurrent or paying users: vLLM. Continuous batching is the difference between "this scales" and "this falls over."
Per-team self-hosted Copilot replacement: Tabby self-hosted server (which wraps vLLM internally + adds SSO + audit logs).

The misconception to avoid: "vLLM is faster than Ollama, so I should always use vLLM." False for single-user workloads. On a one-request-at-a-time scenario, Ollama and vLLM are comparable — community results put them within tens of percent of each other, varying by model and runtime build. vLLM's actual advantage is concurrency. If you have no concurrent users, its operational complexity is overhead.

The other misconception: "Ollama can't scale at all." Also false — Ollama with proper sizing handles 5-10 concurrent users acceptably. The "Ollama doesn't scale" framing exists because production-grade teams (Hugging Face, vLLM team, etc.) compare benchmarks at 50+ concurrency where Ollama isn't the right tool.

Explore the numbers for your specific stack

Open /stack-builder for production-scale recipe →

Pre-filled with team-medium scale (5-20 users). Generates a vLLM-based recipe with the right hardware + model picks.

Where we got the numbers

Throughput numbers from vLLM continuous-batching paper + community benchmarks r/LocalLLaMA 2026. ROCm 6.4 vLLM parity from ROCm release notes. Ollama scheduling overhead from ollama/ollama issue threads.

Also see

Runtime decision rule →

Ollama vs llama.cpp vs vLLM — the 30-second decision rule + honest tradeoffs.

vLLM tool page →

Configuration, continuous batching, paged attention, tensor parallelism.

Ollama tool page →

Setup, common gotchas, when it's the right answer.

Tabby (self-hosted Copilot) →

The team-friendly server that wraps vLLM + adds SSO + audit logs.

What's the difference between 'hosting' and 'serving' a local LLM?

The answer

Explore the numbers for your specific stack

Where we got the numbers

Also see

Other questions in this thread