How do you fix "CUDA OOM that only happens at long context (KV cache blowup)"?

**1. Lower the served context length** to something realistic for your VRAM: ```bash # vLLM vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 16384 # llama.cpp / llama-server ./llama-server -m model.gguf -c 16384 # Ollama (Modelfile) PARAMETER num_ctx 16384 ``` **2. Quantize the KV cache.** vLLM, llama.cpp, and SGLang support FP8 or INT4 KV — typically halves or quarters cache memory with minimal quality impact: ```bash # vLLM vllm serve ... --kv-cache-dtype fp8 # llama.cpp ./llama-server -m model.gguf -c 32768 --cache-type-k q8_0 --cache-type-v q8_0 ``` **3. Pick a model with GQA.** Models with grouped-query attention (`num_kv_heads << num_attention_heads`) have 4-8× smaller KV cache. Llama 3.1, Qwen 2.5, Mistral Nemo all use GQA; older Llama 2 and Mistral 7B v0.1 do not. **4. Pre-flight with [/will-it-run](/will-it-run)** to compute the max context that fits before you start the server.

Out of memory

Verified by owner

CUDA OOM that only happens at long context (KV cache blowup)

Q: What causes "CUDA OOM that only happens at long context (KV cache blowup)"?

Model loads fine and runs short prompts, then OOMs partway into a long conversation or after the prompt grows past a threshold. This is KV cache pressure — KV memory grows linearly with context length, and the runner pre-allocates the worst-case slot for the configured `max_model_len`. Quick check: `KV_per_1k = 2 × num_layers × num_kv_heads × head_dim × 2 bytes`. Llama 3.1 8B at FP16 KV cache costs ~128 MB per 1K tokens. At 128K context, that's 16 GB just for KV — more than the model weights themselves at Q4.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate

By Fredoline Eruo · Last verified May 8, 2026

Cause

Model loads fine and runs short prompts, then OOMs partway into a long conversation or after the prompt grows past a threshold. This is KV cache pressure — KV memory grows linearly with context length, and the runner pre-allocates the worst-case slot for the configured max_model_len.

Quick check: KV_per_1k = 2 × num_layers × num_kv_heads × head_dim × 2 bytes. Llama 3.1 8B at FP16 KV cache costs ~128 MB per 1K tokens. At 128K context, that's 16 GB just for KV — more than the model weights themselves at Q4.

Solution

1. Lower the served context length to something realistic for your VRAM:

# vLLM
vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 16384

# llama.cpp / llama-server
./llama-server -m model.gguf -c 16384

# Ollama (Modelfile)
PARAMETER num_ctx 16384

2. Quantize the KV cache. vLLM, llama.cpp, and SGLang support FP8 or INT4 KV — typically halves or quarters cache memory with minimal quality impact:

# vLLM
vllm serve ... --kv-cache-dtype fp8

# llama.cpp
./llama-server -m model.gguf -c 32768 --cache-type-k q8_0 --cache-type-v q8_0

3. Pick a model with GQA. Models with grouped-query attention (num_kv_heads << num_attention_heads) have 4-8× smaller KV cache. Llama 3.1, Qwen 2.5, Mistral Nemo all use GQA; older Llama 2 and Mistral 7B v0.1 do not.

4. Pre-flight with /will-it-run to compute the max context that fits before you start the server.

Alternative solutions

If you must keep long context: rent an H100 80GB hourly, run the long job, terminate. vllm serve --enable-prefix-caching plus a sticky session helps amortize across requests.

Related errors

Did this fix it?

If your case was different, email support@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.