HOW-TO · SET

How to configure vLLM GPU memory allocation

intermediate15 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.xWindows 11 · Ollama 0.4.xmacOS 15 · Ollama 0.4.x
PREREQUISITES

vLLM installed, single or multi-GPU setup

What this does

Controls how much of each GPU's VRAM is allocated to model weights versus the KV cache during inference. Proper allocation balances throughput against memory errors to maximize GPU utilization.

Steps

  1. Determine total GPU VRAM. Knowing available memory guides every subsequent setting.

    nvidia-smi --query-gpu=memory.total,memory.free --format=csv
    

    Expected output: a table listing total and free VRAM per GPU in MiB.

  2. Set the GPU memory utilization fraction. The --gpu-memory-utilization flag reserves a fraction of free VRAM for the KV cache.

    vllm serve <model> \
      --gpu-memory-utilization 0.85 \
      --max-model-len 8192
    

    Expected output: server starts, KV cache arena allocated to the target utilization.

  3. Tune max model length separately. The max_model_len parameter constrains the context window and indirectly controls per-sequence memory footprint.

    vllm serve <model> \
      --max-model-len 4096 \
      --gpu-memory-utilization 0.85
    

    Expected output: context window capped at 4096 tokens.

  4. Set block size for KV cache granularity. The default of 16 works for most cases.

    vllm serve <model> \
      --block-size 16 \
      --gpu-memory-utilization 0.85
    

    Expected output: server logs display the block size as configured.

Verification

curl -s http://localhost:8000/v1/models | python -m json.tool
# Expected: model listed, confirming server started with configured memory budget

Common failures

  • CUDA out of memory at startup--gpu-memory-utilization set too high. Reduce to 0.75 or 0.7.
  • max_model_len exceeds available memory — Lower max_model_len until startup succeeds without OOM.
  • GPU utilization below 50% — Utilization fraction is too low. Raise --gpu-memory-utilization in steps of 0.05.
  • Multi-GPU memory imbalanceCUDA_VISIBLE_DEVICES ordering may not match NCCL expectations.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

RELATED GUIDES