vLLM vs llama.cpp — server vs portable inference

vLLMEditorial

Production serving runtime — continuous batching + paged attention.

llama.cppEditorial

Cross-platform CPU+GPU inference; the reference portable runtime.

vLLM and llama.cpp solve different problems. vLLM is a production-grade LLM serving runtime; llama.cpp is a portable inference engine that runs anywhere. They overlap in single-stream tok/s on a single GPU but diverge on every other axis.

If you're serving multiple concurrent users, vLLM's continuous batching + paged attention will outperform llama.cpp by orders of magnitude. If you're running one model on one machine for one user — or if your machine isn't an NVIDIA GPU on Linux — llama.cpp wins on portability + simplicity.

Most operators end up using both: llama.cpp on the laptop / Mac / homelab, vLLM on the production rack.

Quick decision rules

Production multi-user serving

→ Choose vLLM

llama.cpp can't compete on concurrent throughput.

macOS, Windows native, or non-NVIDIA hardware

→ Choose llama.cpp

vLLM is Linux+NVIDIA-first; everything else is second-class.

Single-user, single-machine, simplicity matters

→ Choose llama.cpp

Mobile / embedded / edge deployment

→ Choose llama.cpp

vLLM is server-class; not in scope for edge.

Operational matrix

Dimension	vLLM Production serving runtime — continuous batching + paged attention.	llama.cpp Cross-platform CPU+GPU inference; the reference portable runtime.
Concurrent serving 10+ users on one rig.	Excellent Built for it.	Limited Sequential by default; LocalAI/llama-swap to multiplex.
Single-stream tok/s One user at a time.	Excellent Fastest in the category.	Strong Within a few % on the same GPU.
OS portability Realistic stable platforms.	Limited Linux first-class; Windows WSL2; no macOS.	Excellent Linux + macOS + Windows + iOS + Android.
Hardware portability Card types supported.	Strong NVIDIA + AMD ROCm; CUDA-first.	Excellent CUDA + Metal + Vulkan + OpenCL + CPU-only.
Reproducibility Stand the same setup six months later.	Acceptable Multi-knob; pin Python + CUDA + flash-attention + vLLM.	Strong Pin commit + GGUF; that's it.
Multi-GPU Tensor-parallel across cards.	Excellent Tensor + pipeline parallel; first-class.	Strong Layer-split; functional but slower than vLLM TP.
Mobile / embedded Phones, RPi, Jetson.	— Server runtime; out of scope.	Excellent Reference mobile inference runtime.
Maintenance burden Operator hours per month.	Limited 5-10 h/mo on driver / runtime / pin updates.	Strong <1 h/mo. Self-contained binary.
Observability Logs + metrics.	Strong Prometheus endpoint native.	Acceptable Verbose stderr; you write your own metrics.

Failure modes — what breaks first

vLLM

Flash-attention pinning incompatibilities after a CUDA upgrade
Pip dependency conflicts when the runtime ships a major release
OOM on long contexts when KV cache isn't pre-sized
WSL2 GPU passthrough breaks on Windows kernel updates

llama.cpp

Outdated GGUF format after a major schema change (rare but happens)
Metal kernel issues on macOS major-version transitions
Vulkan support varies by driver — Intel/AMD inconsistent
Older quants (Q4_0 / Q5_0) deprecated in favor of K-quants

Editorial verdict

If your workload is single-user single-machine, llama.cpp is almost always the right answer. The maintenance burden is dramatically lower, the OS coverage is dramatically wider, and the throughput gap on single-stream is small enough not to matter day-to-day.

If you're serving anyone other than yourself — paying users, a small team, even a few colleagues — switch to vLLM the moment concurrent throughput matters. llama.cpp + a multiplexer (LocalAI, llama-swap) gets you 80% there but vLLM's continuous batching is the structural answer.

Use both. Operators we trust run llama.cpp on every laptop/desktop they touch and vLLM only on the production rack. They're different tools.

Related operator surfaces

Workflows

Local coding agent →

Stacks

RTX 4090 workstation →Apple Silicon AI →

Benchmark cohorts

See real measurements:

Browse the corpus →See cohort coverage →

Continue comparing

All engine comparisons

OrCompare runtimes (overview)Local AI engine choice matrix