RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Tools
  4. /vLLM
server
Open source
free
4.8/5
Operational review

vLLM

High-throughput inference engine with PagedAttention, continuous batching, and tensor + pipeline parallelism. The reference deployment runtime when you've outgrown llama.cpp / Ollama for production serving. Backed by Anyscale + UC Berkeley.

By Fredoline Eruo·Reviewed May 6, 2026·50,000 GitHub stars

What this tool actually is

vLLM is the high-throughput inference engine that turned self-hosted LLM serving from a research project into a production category. Calling it "an OpenAI-compatible server" — which is how vendor docs and most listings frame it — undersells it by one whole architectural layer. The OpenAI compatibility is the wrapper; the engine underneath is a paged-memory KV cache scheduler with continuous batching that gets ~5-24x more throughput than naïve HuggingFace generate() on the same hardware.

The layer it occupies in the stack:

  • Below: the model weights (HuggingFace format, AWQ, GPTQ, FP8, BNB) on one or more GPUs. CUDA / ROCm / TPU / Gaudi backends.
  • Above: any HTTP client that speaks the OpenAI Chat / Completions API — or, increasingly, a Kubernetes-orchestrated serving layer like Ray Serve or KServe.

What it replaces: HuggingFace TGI for teams that want raw throughput over polish, hand-rolled FastAPI + transformers wrappers, and the "we'll just call OpenAI" line item once the math stops working. The 2025-2026 cycle made vLLM the assumed answer when an engineering team says "we need an LLM endpoint."

Who it is for. Anyone serving more than ~1 request/second sustained, anyone who needs multi-tenancy on a single endpoint, anyone running 70B+ models that need tensor parallelism. Who it is not for. Single-user laptop chat (use Ollama), Apple Silicon (use MLX-LM), edge / mobile deployment (use llama.cpp), or NVIDIA-only shops that need every last token-per-second on Hopper / Blackwell (compile to TensorRT-LLM instead).

Architecture

The mental model that makes vLLM make sense — and that explains why its throughput numbers look impossible until you understand the memory layout:

┌────────────────────────────────────────────────────────────────┐
│  vLLM AsyncLLMEngine                                           │
│                                                                │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Scheduler                                               │  │
│  │   - prefill queue, decode queue                          │  │
│  │   - continuous batching: new request joins on next step  │  │
│  │   - chunked prefill: long prompts split, interleaved     │  │
│  └─────────────────────────┬────────────────────────────────┘  │
│                            │                                    │
│  ┌─────────────────────────▼────────────────────────────────┐  │
│  │  PagedAttention KV-cache manager                         │  │
│  │   - GPU memory split into fixed-size blocks (e.g. 16t)   │  │
│  │   - per-request "page table" maps logical→physical block │  │
│  │   - prefix cache: shared system prompts stay resident    │  │
│  └─────────────────────────┬────────────────────────────────┘  │
│                            │                                    │
│  ┌─────────────────────────▼────────────────────────────────┐  │
│  │  Worker(s) — one per GPU                                 │  │
│  │   - tensor parallel within a node (NCCL all-reduce)      │  │
│  │   - pipeline parallel across nodes (Ray transports)      │  │
│  │   - speculative decoding (draft + target model)          │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘

Three things to understand:

  1. PagedAttention is the headline innovation. Naïve KV-cache allocates contiguous memory per request sized to the max sequence length — wasting 60-80% of cache space when most requests are short. PagedAttention chops cache into fixed blocks (default 16 tokens) and assigns blocks dynamically. Memory efficiency goes from ~30% to >95%, and that headroom turns directly into batch size, which turns into throughput.
  2. Continuous batching runs requests through prefill and decode independently. New requests join the running batch on the next decode step rather than waiting for the current batch to finish. This is what closes the latency gap with single-request inference at high QPS.
  3. Prefix caching matters more than benchmarks suggest. When 100 requests share the same system prompt, vLLM keeps that prefix's KV cache resident across requests — first-token latency drops from 200ms to 10ms for cache hits. Multi-tenant chat applications get most of vLLM's wall-clock win from this single feature.

The serving layer on top is a thin FastAPI wrapper that exposes /v1/chat/completions, /v1/completions, and /v1/embeddings. Same client SDKs that speak OpenAI work without modification.

Local stack compatibility

vLLM is GPU-and-Linux-biased by design. CUDA is the mature backend; ROCm is good and getting better; Gaudi / TPU / Metal are partial-to-experimental. The matrix above shows nine backends with the operational notes that matter when wiring each. The short version: NVIDIA H100/A100/L40S are reference targets, RTX 4090/5090 are fine for single-card homelab, MI300X is the credible AMD alternative for 192GB-HBM workloads, and everything else needs verification on your specific model. Apple Silicon and CPU paths exist but you'd be using vLLM wrong on either.

Real deployment paths

The four ways teams actually run vLLM in 2026, ordered by operator skill required. (Cards above this section show hardware + complexity at a glance; the prose here is operator-grade detail.)

The single-GPU homelab path is where most readers start. pip install vllm, then vllm serve meta-llama/Llama-3.1-8B-Instruct, point any OpenAI client at http://localhost:8000/v1. Total time from zero to working endpoint is under 10 minutes if CUDA is already set up. The constraint is KV cache headroom: a 13B FP16 model on a 24GB card leaves ~6-8GB for KV, which caps you at roughly 32K total tokens in flight. Drop to AWQ-INT4 to triple that.

The multi-GPU server path is the production middle. tensor_parallel_size=4 shards a 70B model across 4xA100 80GB, with NCCL all-reduce on every layer. The key constraint people miss: NVLink matters. PCIe-only multi-GPU works but loses 30-40% of throughput to interconnect bandwidth. Pipeline parallelism is rarely worth it within a node — use TP unless the model literally won't fit.

The multi-node distributed path is where vLLM's design starts to look like Pathways or DeepSpeed-Inference. Ray orchestrates one head node + N worker nodes; tensor parallel within each node, pipeline parallel across nodes. Required for 405B / 671B-class models. InfiniBand or RoCE is non-negotiable; Ethernet kills you on inter-stage activations. Expect to spend a week on Ray cluster sizing before throughput numbers stabilize.

The Kubernetes production path is where serious self-hosted inference lives. KServe or Ray Serve in front of vLLM, GPU operator below, an autoscaler that spins up replicas based on QPS or queue depth. This is what teams build when they replace an OpenAI bill — the engineering investment is real (1-2 platform engineers' worth of attention) but the ROI math works at $50K+/month OpenAI spend.

Resource usage and performance

Numbers to plan around (single-card unless noted):

  • VRAM = model weights + KV cache + activations + overhead. Rough budget: 70B FP16 ≈ 140GB weights, so it doesn't fit on a single 80GB card — TP across 2x H100 with ~15GB each left for KV. 70B AWQ-INT4 ≈ 35-40GB weights and fits on one H100 with 30GB+ KV headroom.
  • gpu_memory_utilization (default 0.9) is the most important knob you'll touch. Lower it to 0.85 if you OOM on the first request; raise it to 0.95 if you have CPU headroom and want bigger batches. Do not touch max_num_seqs until you've understood this one.
  • Throughput math: on H100, Llama-3.1-70B AWQ at concurrent batch ~32 sustains ~3500 tok/s aggregate. On 4xA100 with TP, the same workload runs ~2800 tok/s — the H100 wins on memory bandwidth.
  • TTFT (time to first token): ~50ms cold prefix on a short prompt; <10ms warm prefix cache. Long-context (32K+ prompt) prefills take 1-3 seconds — chunked prefill smooths this out by interleaving prefill and decode.
  • Prefix cache hit rate is the metric that predicts your wall-clock cost. Workloads with shared system prompts (chat apps, RAG with stable instructions) hit 70-90%; ad-hoc generation hits ~5%. Plan capacity at the workload's actual hit rate.

The honest scaling limit on a single replica: ~50-100 concurrent active requests before scheduler tail latency degrades. Past that, scale horizontally — more replicas behind a router beats one mega-replica.

Failure modes

The list of things that will go wrong in production, in rough order of how often we've seen them:

  1. OOM on first inference (not on load). Model loads fine, first request crashes with CUDA OOM. The KV cache wasn't sized correctly. Lower gpu_memory_utilization from 0.9 to 0.85, or set max_model_len lower than the model's native context. vLLM doesn't reserve KV up front aggressively enough for some configs.
  2. tensor_parallel_size mismatched to GPU count. Setting TP=4 on an 8-GPU box leaves 4 cards idle and silently halves throughput. Always verify with nvidia-smi that all expected GPUs see traffic.
  3. Long-prompt prefill spike stalls all decode requests. A 32K-token prompt without chunked prefill enabled blocks the whole engine for 1-3 seconds. Enable --enable-chunked-prefill in any deployment that accepts long contexts.
  4. Prefix cache invalidation on system prompt drift. Changing the system prompt by one token invalidates the cached prefix. Apps that template variable user data into the system prompt get 0% cache hit rate; move variable parts to the user message.
  5. Driver / CUDA / PyTorch version mismatch. vLLM ships against specific CUDA + PyTorch versions per release. Mismatch causes silent kernel selection failures (you don't get errors, you get 30% throughput). Always pin the Docker image rather than pip install vllm on the host.
  6. Streaming connection drop with no client cleanup. If a client disconnects mid-stream, vLLM keeps generating until max_tokens then drops the result. Set tight max_tokens and use abort signals; otherwise GPU time burns on dead requests.
  7. Multi-LoRA loading silent failure. Loading 8 LoRAs against a base model — vLLM applies them but the wrong adapter binds to the wrong request under high concurrency. Fixed in 0.6+ but worth verifying with synthetic traffic.
  8. NCCL hang on multi-node TP. Inter-node tensor parallel with non-uniform NIC settings deadlocks at startup. NCCL_DEBUG=INFO and NCCL_IB_HCA configuration are mandatory; assume a half-day of cluster debugging the first time.

How it compares

vs SGLang. SGLang is the credible challenger as of 2026. RadixAttention vs PagedAttention is the architectural difference: SGLang's tree-structured cache is faster on heavily-shared prefix workloads (think structured generation, agent loops with stable system prompts); vLLM's flat block model is faster on diverse-prompt workloads. Pick SGLang if your traffic is structurally repetitive; vLLM if it isn't.

vs TensorRT-LLM. TensorRT-LLM compiles a model to a fixed engine for a specific GPU; vLLM runs PyTorch kernels with dynamic batching. TensorRT-LLM wins on raw single-request latency on Hopper / Blackwell (15-30% lower TTFT). vLLM wins on iteration speed (no recompile to test a config change), backend coverage, and quant flexibility. Use TensorRT-LLM when you've committed to one SKU and need every microsecond.

vs HuggingFace TGI. TGI was the production default in 2023-2024; vLLM ate that lunch through 2024-2025. TGI still has tighter HF Hub integration and slightly nicer ops surface; vLLM has the throughput lead and the ecosystem momentum. New deployments default to vLLM unless HF integration matters.

vs llama.cpp server mode. llama.cpp is the right answer for CPU, Apple Silicon, edge, and "I want it to work on a Raspberry Pi." vLLM is the right answer for GPU production scale. Different categories — they barely overlap.

vs Ray Serve. Not an alternative — a layer above. Ray Serve orchestrates vLLM (or other engines) into multi-replica autoscaling deployments. The right pattern is Ray Serve + vLLM, not Ray Serve vs vLLM.

Best use cases

Where vLLM is genuinely the right answer:

  • Self-hosted OpenAI-API replacement at SaaS scale. Same client code, different bill.
  • Multi-tenant inference behind a single endpoint — prefix caching makes the per-request cost collapse on shared system prompts.
  • 70B+ model serving that needs tensor parallelism. The TP path is mature and predictable.
  • High-throughput batch inference — process millions of prompts overnight, vLLM's continuous batching turns it into hours instead of days.
  • Multi-LoRA serving where one base model + many adapters serves multiple fine-tuned variants from the same memory.

Where vLLM is the wrong answer:

  • Single-user laptop chat (Ollama is simpler).
  • Apple Silicon (MLX-LM).
  • Edge / mobile / Raspberry Pi (llama.cpp).
  • Hard real-time, single-request, NVIDIA-only (TensorRT-LLM).
  • Models with attention variants vLLM hasn't merged kernels for (verify before committing).

Verdict

vLLM is the production-default inference engine for self-hosted LLM serving in 2026. PagedAttention turned KV-cache memory efficiency from a research footnote into a 5-24x throughput delta against naïve baselines, and the project's discipline through 2024-2025 — continuous batching, prefix caching, chunked prefill, multi-LoRA, speculative decoding — turned that single innovation into a complete production stack. The OpenAI-compatible API on top makes it a drop-in for any team running an OpenAI bill they'd rather not pay.

The honest tradeoffs: it's GPU-and-Linux-biased; the operations skill required is real (NCCL, gpu_memory_utilization, prefix cache invalidation are not "configure once and forget" knobs); and on heavily structured workloads SGLang now has a credible architectural advantage. None of those are reasons to default away from vLLM — they're the conditions under which a different tool might win.

Buy / use this if you're serving more than ~1 request/second sustained on GPU and you don't have a hard requirement that pulls you to TensorRT-LLM or SGLang. Skip it if you're at single-user scale, on Apple Silicon, or running inference on anything other than NVIDIA / AMD datacenter GPUs.

Rating math: 4.8/5 — the half-point lost is for the operational skill ramp (the engine rewards expertise; punishes "we'll figure it out as we go") and for the kernel coverage gap on niche model architectures. We've recommended vLLM as the production default for two consecutive cycles and the recommendation hasn't aged.

Sources

  • vLLM GitHub — release notes, kernel coverage matrix, supported architectures.
  • vLLM documentation — operator reference for tensor parallelism, prefix caching, chunked prefill.
  • PagedAttention paper (SOSP 2023) — the original architectural argument.

Related

  • SGLang — closest credible alternative engine
  • TensorRT-LLM — when you've committed to NVIDIA and need the last microsecond
  • Ray Serve — the orchestration layer above vLLM in production
  • llama.cpp, Ollama — the local-first / single-user side of inference
  • /maps/local-ai-agents-2026 — where vLLM sits in the runtime zone
  • /authors/fred-oline — about the author
Local stack compatibility
StatusRuntime / StackNotes
ExcellentNVIDIA H100 / H200Reference target. FP8 first-class via Hopper Transformer Engine; speculative decoding + chunked prefill all stable. Production sweet spot for 70B-class models.
ExcellentNVIDIA A100 (80GB / 40GB)Workhorse. No FP8 native (use FP16 / BF16 / AWQ). Tensor parallel scales linearly to 8x; the 40GB SKU is tight for 70B + long context.
GoodNVIDIA RTX 4090 / 5090Consumer path. Single-card 13B-class fine; 70B AWQ runs but KV-cache headroom shrinks fast at long context. NVLink absent on consumer SKUs limits TP.
ExcellentNVIDIA L40S / L4Enterprise sweet spot for cost-per-token. Same 48GB / 24GB headroom as datacenter Ada parts; FP8 on L40S, FP16 on L4.
GoodAMD MI300X / MI250ROCm path matured through 2025. MI300X's 192GB HBM is genuinely useful for 405B-class models. Kernel coverage trails CUDA — verify your model's attention variant is supported before committing.
PartialIntel Gaudi 2 / 3Habana fork merged upstream; works for the headline LLaMA / Mistral / Qwen architectures. Niche models still hit unsupported-op errors. Intel-backed roadmap; trajectory is positive.
PartialGoogle TPU v5e / v5pTPU backend lives but lags features by 3-6 months. Use vLLM-on-TPU when GCP economics force the choice; otherwise prefer JAX-native paths.
LimitedApple Silicon (Metal)An experimental Metal path exists; treat it as a tech demo. For Apple Silicon serving use MLX-LM or llama.cpp instead.
LimitedCPU-onlyAn x86 backend exists but the engine's value (paged KV + continuous batching at GPU memory bandwidth) doesn't translate. Use llama.cpp for CPU.
Real deployment paths

Single-GPU homelab

moderate

One 24-48GB consumer GPU. 13B FP16 or 70B AWQ-INT4. The fastest path from zero to OpenAI-compatible endpoint on your own hardware. Most readers come from here.

Hardware: RTX 4090 / 5090 / L4 24GB · 32GB+ system RAM · Linux + recent CUDA 12.x

Multi-GPU server (tensor parallel)

involved

2-8 GPUs in one box. tensor_parallel_size = N to shard model weights across cards. Required path for 70B+ at FP16/BF16, or for 405B at any quant. NVLink helps; PCIe-only works but is bandwidth-bound.

Hardware: 2-8x A100 80GB / H100 / MI300X · NVLink ideal · 256GB+ system RAM

Multi-node distributed (TP + PP via Ray)

expert

Splitting a model across machines. tensor parallel within a node, pipeline parallel across nodes, all orchestrated by Ray. The path for 405B / 671B-class models that genuinely need multi-host inference.

Hardware: 2-4x DGX-class nodes · InfiniBand / RoCE · dedicated Ray head node

Kubernetes production (KServe / Ray Serve)

expert

Autoscaling endpoints, traffic splitting, canary deploys. vLLM as the engine, KServe or Ray Serve as the orchestrator. The path for 'we replaced our OpenAI bill with self-hosted inference at SaaS scale.'

Hardware: Multi-node GPU cluster · Kubernetes 1.28+ · GPU operator + NVIDIA runtime

Setup guidance

Install via pip in a Python 3.10+ venv with CUDA 12.4+: pip install vllm. For production, use the Docker image: docker pull vllm/vllm-openai:latest. Start with vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000. If you hit CUDA OOM on the first inference request (not on model load), the --gpu-memory-utilization default of 0.9 is too aggressive — drop to 0.85. Verify with curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Hello"}]}'. First run downloads the model checkpoint from HuggingFace — budget 5–20 minutes for a 70B model on a decent connection. Time-to-first-useful-response from zero: ~10 minutes with Docker and a cached model. vLLM uses its own OpenAI-compatible REST API at /v1/chat/completions and /v1/completions. The Prometheus metrics endpoint is at /metrics by default.

Workload fit

Best for: multi-tenant LLM serving on NVIDIA datacenter GPUs (H100, A100, H200, B200), production API backends with >1 concurrent request sustained, LoRA-swapping serving of many fine-tuned adapters from one base model, speculative decoding pipelines where 1.5–2.5× throughput matters. Not suited for: single-user local chat (use Ollama), CPU-only or Apple Silicon inference (use llama.cpp or MLX-LM), GGUF-centric model ecosystems without conversion, embedding-only workloads (use Text Embeddings Inference). The migration point from Ollama to vLLM: sustained >1 concurrent request or throughput requirements above ~100 tok/s aggregate.

Alternatives

Use vLLM when your workload has sustained multi-request concurrency — its continuous batching and PagedAttention deliver 5–24× throughput over naive HuggingFace generate(). Switch to SGLang when prefix cache hit rate exceeds ~50% (agent loops, stable system prompts) — RadixAttention wins 15–40% there. Use TensorRT-LLM when locked to a single NVIDIA GPU SKU (H100/H200/B200) and you need every microsecond of single-request latency; it wins 15–30% but requires a 2-hour engine compile for large models. Pick Ollama or LM Studio for single-user desktop use — they are simpler and cover CPU/Apple Silicon. Use llama.cpp when you need CPU inference, Apple Silicon, Vulkan, or the broadest GGUF quantization coverage.

Troubleshooting + when to switch

Problem: CUDA OOM on first inference request but model loaded fine. Fix: Lower --gpu-memory-utilization from 0.9 to 0.85. The KV cache needs headroom beyond weight memory. Check actual VRAM usage with nvidia-smi and tune down until stable. Problem: Requests stall at high concurrency. Fix: Enable --enable-chunked-prefill. Long prompts without chunked prefill block the scheduler from interleaving decode steps. Problem: Prefix cache never hits despite identical system prompts. Fix: Any change to the system prompt — including one trailing space — invalidates the cache. Template variable data into user messages instead of modifying the system prompt. Use --enable-prefix-caching explicitly and monitor cache hit rate via Prometheus metrics.

Stack & relationships

How vLLM relates to other entries in the catalog — recommended pairings, alternatives, dependencies, and edges to avoid. Each edge carries a one-line operator note from our editorial team.

vLLM ↔ ecosystem

Recommended stack

  • Commonly deployed with
    Ray Serve

    Ray Serve in front of vLLM is the canonical K8s production pattern — autoscaling replicas, traffic splitting, canary deploys.

  • Commonly deployed with
    Ray Serve

    Ray Serve orchestrates vLLM replicas in K8s. The canonical 'we replaced our OpenAI bill' production stack.

  • Commonly deployed with
    OpenHands

    vLLM is the default inference engine in the canonical OpenHands stack. Continuous batching pays for itself when the agent makes 10+ tool calls per task.

  • Commonly deployed with
    OpenClaw

    Production OpenClaw deployments default to vLLM. Same logic as OpenHands — continuous batching matters for autonomous agent loops.

  • Commonly deployed with
    Ray Serve

    Ray Serve in front of vLLM = the canonical K8s production pattern. Autoscaling replicas, traffic splitting, canary deploys.

  • Commonly deployed with
    AnythingLLM

    AnythingLLM points at any OpenAI-compatible endpoint; vLLM is the production runtime when team size grows past 5 users.

  • Pairs with
    Letta (memory framework)

    Letta drives an inference engine via OpenAI-compatible API. vLLM's continuous batching matters because Letta makes 5-15 retrieval-then-generate calls per task. Same wiring pattern as Mem0.

  • Commonly deployed with
    Goose

    vLLM is the production runtime pairing for Goose. OpenAI-compatible plug-in with no adapter.

  • Pairs with
    Phoenix (Arize AI)

    vLLM exposes OpenInference-compatible spans; Phoenix consumes them directly. The default OSS observability pairing for self-hosted vLLM deployments.

Works with

  • Works with
    AnythingLLM

    Treats vLLM as a generic OpenAI endpoint. The throughput upgrade once you've moved past single-laptop deployment.

  • Works with
    Open WebUI

    Talks to vLLM's OpenAI-compatible endpoint with no adapter. Pairs naturally with the /stacks/rtx-4090-workstation deployment.

  • Works with
    LocalAI

    LocalAI can route to a vLLM backend for production-throughput LLM inference while still serving image/audio/TTS through other backends behind the same endpoint.

Alternatives

  • Competes with
    SGLang

    RadixAttention vs PagedAttention. SGLang wins on heavily-shared prefix workloads (structured generation, agent loops); vLLM wins on diverse prompts. Pick by traffic shape.

  • Competes with
    TensorRT-LLM

    TensorRT-LLM compiles to a fixed engine for one GPU SKU; vLLM runs PyTorch kernels with dynamic batching. Pick TensorRT-LLM if you need every microsecond on Hopper/Blackwell.

  • Competes with
    Text Generation Inference (TGI)

    TGI was the 2023-2024 production default; vLLM ate that lunch through 2024-2025. New deployments default to vLLM unless HF Hub integration matters.

  • Alternative to
    Ollama

    Different category, common confusion. Ollama is for single-user laptops; vLLM is for production GPU serving. They barely overlap.

  • Alternative to
    Petals

    Different category, common confusion. Petals is for 'I cannot fit this model anywhere and don't have a GPU cluster'; vLLM is for 'I have a GPU cluster and need throughput.' Surface the boundary explicitly.

  • Alternative to
    Exo

    Different hardware target. vLLM = NVIDIA/Linux datacenter; Exo = Apple Silicon LAN cluster. Pick by which hardware you already own.

  • Alternative to
    SGLang

    Direct architectural alternative. RadixAttention vs PagedAttention. SGLang wins on shared-prefix workloads (agents, structured generation, RAG with stable instructions); vLLM wins on diverse prompts and ecosystem maturity.

  • Alternative to
    Petals

    Different categories, common confusion. Petals is for 'I cannot fit this model anywhere'; vLLM is for 'I have a GPU cluster.' Surface the boundary explicitly.

Depends on

  • Requires
    llama.cpp

    Not a runtime dependency — but vLLM does NOT replace llama.cpp for CPU / Apple Silicon / edge. Different categories; if your hardware is outside vLLM's wheelhouse use llama.cpp.

Lifecycle

  • Succeeded by
    Text Generation Inference (TGI)

    TGI was the 2023-2024 production default; vLLM ate that lunch through 2024-2025. New deployments default to vLLM unless HuggingFace Hub integration matters specifically.

Avoid pairing with

  • Incompatible with
    MLX-LM

    Different ecosystems entirely — vLLM is GPU/Linux/CUDA, MLX-LM is Apple Silicon/Metal. They don't compete; they don't pair. Listed here so the page graph makes the boundary explicit.

Featured in these stacks

The L3 execution stacks that pick this tool as a recommended component, with the one-line note explaining the role it plays in each.

  • Stack · L3·Workstation tier·Role: Inference engine (production-grade serving)
    Build a local coding-agent stack (May 2026)

    vLLM over Ollama for this stack: continuous batching means an agent making 5-10 concurrent tool calls per task doesn't queue, prefix caching keeps the system prompt resident across iterations, and the OpenAI-compatible API plugs into OpenHands with zero adapter code. Use Ollama only for single-user laptop chat.

  • Stack · L3·Workstation tier·Role: Inference engine (production-grade serving)
    Build an RTX 4090 AI workstation stack (May 2026)

    vLLM over Ollama for the production-serving role on this box — continuous batching matters when 3-5 users hit the same model concurrently, and the OpenAI-compatible endpoint makes Open WebUI / AnythingLLM / OpenHands plug in without adapter code. Keep Ollama installed alongside for ad-hoc model swaps.

  • Stack · L3·Production tier·Role: Inference engine (TP within node, PP across nodes)
    Build a distributed inference homelab stack (May 2026)

    vLLM over SGLang for distributed homelab: better-tested multi-node TP+PP path, broader kernel coverage on Hopper / Blackwell, and the Ray integration is first-class. SGLang's RadixAttention advantage applies but the multi-node story is younger; pick vLLM unless your traffic is heavily prefix-shared agent loops.

  • Stack · L3·Workstation tier·Role: Inference engine
    Build a memory-enabled local agent stack (May 2026)

    vLLM continuous batching matters here: a memory-enabled agent makes 10-30 retrieval-then-generate calls per task. Prefix caching keeps the memory-injection prompt resident across iterations. Use Ollama only if the agent runs at single-user pace.

  • Stack · L3·Workstation tier·Role: Inference engine
    Build a local reasoning-model stack (May 2026)

    vLLM over Ollama for reasoning models: continuous batching matters because reasoning-token emissions are long (a single query can emit 5000+ tokens). Prefix caching helps when batch reasoning-mode queries share system prompts. KV-cache management matters more here than on chat models.

  • Stack · L3·Workstation tier·Role: Inference engine (vision-aware)
    Build a local vision-model stack (May 2026)

    vLLM has first-class vision-language model support as of v0.7+. Image preprocessing happens server-side; the OAI endpoint accepts image URLs and base64 images. Continuous batching matters for vision because image tokenization is more expensive than text.

  • Stack · L3·Workstation tier·Role: Inference engine
    Build a fully offline coding stack (May 2026)

    vLLM with pre-pulled Docker image + pre-staged HuggingFace cache runs entirely offline. Continuous batching matters because the agent makes 5-15 tool calls per task. The OpenAI-compatible API plugs into OpenHands with no adapter.

  • Stack · L3·Workstation tier·Role: Inference engine (production tensor-parallel-2)
    Dual RTX 3090 workstation stack — 70B-class on $1,800 of used GPUs

    vLLM tensor-parallel-2 with --tensor-parallel-size 2 is the canonical configuration. AWQ-INT4 fits 70B in the 48 GB envelope with ~6 GB headroom for KV cache at 8K context. Continuous batching extracts throughput at 4-8 concurrent agent loops; at this scale SGLang's prefix-cache wins are also significant — pick by workload.

  • Stack · L3·Production tier·Role: Inference engine (PCIe-aware tensor-parallel-2)
    Dual RTX 4090 workstation stack — newer-architecture 70B serving without NVLink

    vLLM tensor-parallel-2 with --tensor-parallel-size 2. Verify NVLink is NOT engaged via nvidia-smi — that&apos;s expected on 4090. Performance trail vs NVLink-equipped pair is ~10-15% on tensor-parallel decode. SGLang is the better pick when prefix-cache hit rate is high (agent loops with stable system prompts).

  • Stack · L3·Homelab tier·Role: Inference engine (tensor-parallel-4)
    Quad RTX 3090 workstation stack — the prosumer 100B-class ceiling

    vLLM --tensor-parallel-size 4 is the canonical configuration for quad-GPU. NVLink between paired cards (0-1, 2-3); cross-pair traffic goes over PCIe. For maximum single-stream latency, run 2× tensor-parallel-2 replicas instead — counter-intuitive but the cross-pair PCIe overhead at TP-4 dominates.

  • Stack · L3·Production tier·Role: Inference engine (TP-4 + FP8 + MTP)
    4× H100 SXM tensor-parallel workstation — frontier MoE serving reference

    vLLM is the production reference. --tensor-parallel-size 4 with FP8 quants extracts the H100's transformer engine; multi-token-prediction (MTP) head for V4 Pro gives ~1.8× decode throughput. Set --gpu-memory-utilization 0.95 — H100s have generous memory bandwidth headroom.

Featured in these workflows

Full-system workflows that include this tool as part of their service ledger — with the one-line operator note for each.

  • Workflow · System·homelab·Role: Inference engine
    Local coding-agent system

    AWQ-INT4 path lets Qwen 2.5 Coder 32B fit a single 4090 with 32K context plus headroom. Continuous batching handles the agent's tool-call burst pattern.

  • Workflow · System·edge·Role: Inference engine
    Offline RAG pipeline

    Concurrent batching scales to 5-15 users on a single 4090. Ollama works for solo but caps out at one stream.

  • Workflow · System·homelab·Role: Primary inference
    Homelab AI API gateway

    Continuous batching makes this the right backend when N small clients are firing 1-shot requests on overlapping schedules.

  • Workflow · System·research·Role: Inference engine
    Local evaluation lab

    Continuous batching makes harness runs ~3-5× faster than single-stream inference. Reproducibility is solid (deterministic seeds work).

  • Workflow · System·research·Role: Eval inference
    Local fine-tuning workstation

    After training, merge LoRA → load in vLLM → run lm-eval-harness. The same engine that serves production also evaluates fine-tuned models.

Pros

  • Best throughput in class
  • OpenAI-compatible API
  • Tensor parallelism
  • Speculative decoding

Cons

  • Linux-only
  • GPU-only
  • Steeper learning curve than Ollama

Compatibility

Operating systems
Linux
GPU backends
NVIDIA CUDA
AMD ROCm
Intel Gaudi
TPU
LicenseOpen source · free

Runtime health

Operator-grade signals on how actively vLLM is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.

Release cadence

Derived from the most recent editorial signal on this row.

Active
Updated Jun 12, 2026

8 days since last refresh · source: lastUpdated

Benchmark freshness

How recent the editorial measurements on this runtime are.

0editorial benchmarks

No editorial benchmarks for this runtime yet.

Community reproduction

Submissions that match an editorial measurement on similar hardware.

0reproduced reports

No community reproductions on file yet.

Ecosystem stability

Editorial rating from RunLocalAI — qualitative, not measured.

4.8/5✓Editorial

Get vLLM

Official site
https://docs.vllm.ai
GitHub
https://github.com/vllm-project/vllm

Frequently asked

Is vLLM free?

Yes — vLLM is free to use and open-source.

What operating systems does vLLM support?

vLLM supports Linux.

Which GPUs work with vLLM?

vLLM supports NVIDIA CUDA, AMD ROCm, Intel Gaudi, TPU. CPU-only operation is also possible but typically slower.
See something off?Report outdated·Suggest a correctionWe read every submission. Editorial review takes 1-7 days.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.

Related — keep moving

Compare hardware
  • RTX 4090 vs RTX 5090 →
  • Dual 3090 vs RTX 5090 (tensor-parallel) →
  • RTX 5090 vs H100 →
Buyer guides
  • Best GPU for local AI →
  • Best AI PC build under $2,000 →
When it doesn't work
  • vLLM CUDA version mismatch →
  • Tensor parallelism crash →
  • CUDA driver too old →
  • CUDA out of memory →
Recommended hardware
  • RTX 4090 (24 GB) →
  • RTX 5090 (32 GB) →
  • H100 PCIe (datacenter) →
Alternatives
SGLangText Generation Inference (TGI)QdrantWeaviateGraphiti (Zep)LanceDBRedis (vector search)Milvus
Before you buy

Verify vLLM runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →