Engine vs engine
Editorial

vLLM vs llama.cpp — server vs portable inference

vLLMEditorial

Production serving runtime — continuous batching + paged attention.

Project page →
llama.cppEditorial

Cross-platform CPU+GPU inference; the reference portable runtime.

Project page →

vLLM and llama.cpp solve different problems. vLLM is a production-grade LLM serving runtime; llama.cpp is a portable inference engine that runs anywhere. They overlap in single-stream tok/s on a single GPU but diverge on every other axis.

If you're serving multiple concurrent users, vLLM's continuous batching + paged attention will outperform llama.cpp by orders of magnitude. If you're running one model on one machine for one user — or if your machine isn't an NVIDIA GPU on Linux — llama.cpp wins on portability + simplicity.

Most operators end up using both: llama.cpp on the laptop / Mac / homelab, vLLM on the production rack.

Quick decision rules

Production multi-user serving
→ Choose vLLM
llama.cpp can't compete on concurrent throughput.
macOS, Windows native, or non-NVIDIA hardware
→ Choose llama.cpp
vLLM is Linux+NVIDIA-first; everything else is second-class.
Single-user, single-machine, simplicity matters
→ Choose llama.cpp
Mobile / embedded / edge deployment
→ Choose llama.cpp
vLLM is server-class; not in scope for edge.

Operational matrix

Dimension
vLLM
Production serving runtime — continuous batching + paged attention.
llama.cpp
Cross-platform CPU+GPU inference; the reference portable runtime.
Concurrent serving
10+ users on one rig.
Excellent
Built for it.
Limited
Sequential by default; LocalAI/llama-swap to multiplex.
Single-stream tok/s
One user at a time.
Excellent
Fastest in the category.
Strong
Within a few % on the same GPU.
OS portability
Realistic stable platforms.
Limited
Linux first-class; Windows WSL2; no macOS.
Excellent
Linux + macOS + Windows + iOS + Android.
Hardware portability
Card types supported.
Strong
NVIDIA + AMD ROCm; CUDA-first.
Excellent
CUDA + Metal + Vulkan + OpenCL + CPU-only.
Reproducibility
Stand the same setup six months later.
Acceptable
Multi-knob; pin Python + CUDA + flash-attention + vLLM.
Strong
Pin commit + GGUF; that's it.
Multi-GPU
Tensor-parallel across cards.
Excellent
Tensor + pipeline parallel; first-class.
Strong
Layer-split; functional but slower than vLLM TP.
Mobile / embedded
Phones, RPi, Jetson.
Server runtime; out of scope.
Excellent
Reference mobile inference runtime.
Maintenance burden
Operator hours per month.
Limited
5-10 h/mo on driver / runtime / pin updates.
Strong
<1 h/mo. Self-contained binary.
Observability
Logs + metrics.
Strong
Prometheus endpoint native.
Acceptable
Verbose stderr; you write your own metrics.

Failure modes — what breaks first

vLLM

  • Flash-attention pinning incompatibilities after a CUDA upgrade
  • Pip dependency conflicts when the runtime ships a major release
  • OOM on long contexts when KV cache isn't pre-sized
  • WSL2 GPU passthrough breaks on Windows kernel updates

llama.cpp

  • Outdated GGUF format after a major schema change (rare but happens)
  • Metal kernel issues on macOS major-version transitions
  • Vulkan support varies by driver — Intel/AMD inconsistent
  • Older quants (Q4_0 / Q5_0) deprecated in favor of K-quants

Editorial verdict

If your workload is single-user single-machine, llama.cpp is almost always the right answer. The maintenance burden is dramatically lower, the OS coverage is dramatically wider, and the throughput gap on single-stream is small enough not to matter day-to-day.

If you're serving anyone other than yourself — paying users, a small team, even a few colleagues — switch to vLLM the moment concurrent throughput matters. llama.cpp + a multiplexer (LocalAI, llama-swap) gets you 80% there but vLLM's continuous batching is the structural answer.

Use both. Operators we trust run llama.cpp on every laptop/desktop they touch and vLLM only on the production rack. They're different tools.

Related operator surfaces