CUDA OOM that only happens at long context (KV cache blowup)
Cause
Model loads fine and runs short prompts, then OOMs partway into a long conversation or after the prompt grows past a threshold. This is KV cache pressure — KV memory grows linearly with context length, and the runner pre-allocates the worst-case slot for the configured max_model_len.
Quick check: KV_per_1k = 2 × num_layers × num_kv_heads × head_dim × 2 bytes. Llama 3.1 8B at FP16 KV cache costs ~128 MB per 1K tokens. At 128K context, that's 16 GB just for KV — more than the model weights themselves at Q4.
Solution
1. Lower the served context length to something realistic for your VRAM:
# vLLM
vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 16384
# llama.cpp / llama-server
./llama-server -m model.gguf -c 16384
# Ollama (Modelfile)
PARAMETER num_ctx 16384
2. Quantize the KV cache. vLLM, llama.cpp, and SGLang support FP8 or INT4 KV — typically halves or quarters cache memory with minimal quality impact:
# vLLM
vllm serve ... --kv-cache-dtype fp8
# llama.cpp
./llama-server -m model.gguf -c 32768 --cache-type-k q8_0 --cache-type-v q8_0
3. Pick a model with GQA. Models with grouped-query attention (num_kv_heads << num_attention_heads) have 4-8× smaller KV cache. Llama 3.1, Qwen 2.5, Mistral Nemo all use GQA; older Llama 2 and Mistral 7B v0.1 do not.
4. Pre-flight with /will-it-run to compute the max context that fits before you start the server.
Alternative solutions
If you must keep long context: rent an H100 80GB hourly, run the long job, terminate. vllm serve --enable-prefix-caching plus a sticky session helps amortize across requests.
Related errors
Did this fix it?
If your case was different, email support@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.