What's the difference between 'hosting' and 'serving' a local LLM?

Reviewed May 15, 20262 min read
hostingservingollamavllmproduction

The answer

One paragraph. No hedging beyond what the data actually warrants.

Hosting is "the model is reachable at an HTTP endpoint." Serving is "the model handles N concurrent users without falling over." The gap between them is where local-AI deployments most often break.

Hosting (the easy mode):

  • Ollama at localhost:11434 is the canonical example
  • Single-user / low-concurrency
  • One request at a time; subsequent requests queue
  • 30-100ms scheduling overhead per request
  • Memory is loaded once and stays loaded
  • Failure mode: 3rd concurrent user waits ~30 seconds in queue

Serving (the production mode):

  • vLLM, TGI, TensorRT-LLM, SGLang are the production runtimes
  • Continuous batching — multiple requests share the model's forward pass
  • Paged attention — KV cache memory is sliced like virtual memory, not pre-allocated
  • Prefix caching — repeated prompt prefixes (system messages) are computed once
  • Failure mode: misconfigured GPU memory fragmentation drops throughput 50%+

Operator-grade specifics:

Metric Ollama (hosting) vLLM (serving)
Single-user latency comparable to vLLM (within ~10%) comparable to Ollama
Concurrent throughput flat — queues form fast scales materially with concurrency
Multi-GPU tensor parallel limited first-class
Setup one binary, ~5 min Python env, ~30 min, more knobs
Hardware coverage NVIDIA, Apple Silicon, AMD ROCm NVIDIA primary; ROCm 6.4+ partial

We deliberately don't list req/s headlines in this table. Published vLLM continuous-batching benchmarks are model-, batch-, and hardware-specific; pretending a single number works for everyone is exactly what the audit caught us doing elsewhere. The shape of the curve is the takeaway: Ollama plateaus quickly under concurrency; vLLM scales until you hit GPU memory.

The decision rule:

  • Solo operator, 1-3 daily users (yourself + friends): Ollama. The throughput gap doesn't matter because you're not concurrency-bound.
  • Small team, 5-15 daily users: Still Ollama if you have GPU headroom. Switch to vLLM only when you observe queue waits.
  • Production serving, 20+ concurrent or paying users: vLLM. Continuous batching is the difference between "this scales" and "this falls over."
  • Per-team self-hosted Copilot replacement: Tabby self-hosted server (which wraps vLLM internally + adds SSO + audit logs).

The misconception to avoid: "vLLM is faster than Ollama, so I should always use vLLM." False for single-user workloads. On a one-request-at-a-time scenario, Ollama and vLLM are comparable — community results put them within tens of percent of each other, varying by model and runtime build. vLLM's actual advantage is concurrency. If you have no concurrent users, its operational complexity is overhead.

The other misconception: "Ollama can't scale at all." Also false — Ollama with proper sizing handles 5-10 concurrent users acceptably. The "Ollama doesn't scale" framing exists because production-grade teams (Hugging Face, vLLM team, etc.) compare benchmarks at 50+ concurrency where Ollama isn't the right tool.

Where we got the numbers

Throughput numbers from vLLM continuous-batching paper + community benchmarks r/LocalLLaMA 2026. ROCm 6.4 vLLM parity from ROCm release notes. Ollama scheduling overhead from ollama/ollama issue threads.

Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.