What this does

Controls how much of each GPU's VRAM is allocated to model weights versus the KV cache during inference. Proper allocation balances throughput against memory errors to maximize GPU utilization.

Steps

Determine total GPU VRAM. Knowing available memory guides every subsequent setting.
```
nvidia-smi --query-gpu=memory.total,memory.free --format=csv
```
Expected output: a table listing total and free VRAM per GPU in MiB.
Set the GPU memory utilization fraction. The --gpu-memory-utilization flag reserves a fraction of free VRAM for the KV cache.
```
vllm serve <model> \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192
```
Expected output: server starts, KV cache arena allocated to the target utilization.
Tune max model length separately. The max_model_len parameter constrains the context window and indirectly controls per-sequence memory footprint.
```
vllm serve <model> \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85
```
Expected output: context window capped at 4096 tokens.
Set block size for KV cache granularity. The default of 16 works for most cases.
```
vllm serve <model> \
  --block-size 16 \
  --gpu-memory-utilization 0.85
```
Expected output: server logs display the block size as configured.

Verification

curl -s http://localhost:8000/v1/models | python -m json.tool
# Expected: model listed, confirming server started with configured memory budget

Common failures

CUDA out of memory at startup — --gpu-memory-utilization set too high. Reduce to 0.75 or 0.7.
max_model_len exceeds available memory — Lower max_model_len until startup succeeds without OOM.
GPU utilization below 50% — Utilization fraction is too low. Raise --gpu-memory-utilization in steps of 0.05.
Multi-GPU memory imbalance — CUDA_VISIBLE_DEVICES ordering may not match NCCL expectations.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to configure vLLM GPU memory allocation

What this does

Steps

Verification

Common failures

Operator checkpoint

Related guides