How to configure vLLM GPU memory allocation
vLLM installed, single or multi-GPU setup
What this does
Controls how much of each GPU's VRAM is allocated to model weights versus the KV cache during inference. Proper allocation balances throughput against memory errors to maximize GPU utilization.
Steps
Determine total GPU VRAM. Knowing available memory guides every subsequent setting.
nvidia-smi --query-gpu=memory.total,memory.free --format=csvExpected output: a table listing total and free VRAM per GPU in MiB.
Set the GPU memory utilization fraction. The
--gpu-memory-utilizationflag reserves a fraction of free VRAM for the KV cache.vllm serve <model> \ --gpu-memory-utilization 0.85 \ --max-model-len 8192Expected output: server starts, KV cache arena allocated to the target utilization.
Tune max model length separately. The
max_model_lenparameter constrains the context window and indirectly controls per-sequence memory footprint.vllm serve <model> \ --max-model-len 4096 \ --gpu-memory-utilization 0.85Expected output: context window capped at 4096 tokens.
Set block size for KV cache granularity. The default of 16 works for most cases.
vllm serve <model> \ --block-size 16 \ --gpu-memory-utilization 0.85Expected output: server logs display the block size as configured.
Verification
curl -s http://localhost:8000/v1/models | python -m json.tool
# Expected: model listed, confirming server started with configured memory budget
Common failures
- CUDA out of memory at startup —
--gpu-memory-utilizationset too high. Reduce to 0.75 or 0.7. max_model_lenexceeds available memory — Lowermax_model_lenuntil startup succeeds without OOM.- GPU utilization below 50% — Utilization fraction is too low. Raise
--gpu-memory-utilizationin steps of 0.05. - Multi-GPU memory imbalance —
CUDA_VISIBLE_DEVICESordering may not match NCCL expectations.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.