Out of memory

Out of memory specifically at long context lengths

Q: How do you fix "Out of memory specifically at long context lengths"?

**Quantize the KV cache** — biggest single win: ```bash # llama.cpp — INT8 KV cache halves memory ./main --cache-type-k q8_0 --cache-type-v q8_0 # Or INT4 KV (more aggressive, slight quality cost) ./main --cache-type-k q4_0 --cache-type-v q4_0 ``` **Enable Flash Attention** if not already on (some runners default it off): ```bash ./main --flash-attn ``` **Use a smaller working context.** A model that "supports 128K" doesn't mean you have to use it. **Move to a model designed for long context efficiency** — Mistral Small 3, Llama 4 Scout (10M context with native efficiency), or Qwen 3 with its sliding window mode. **More VRAM is the only real fix** for very-long-context workloads. Calculate your specific scenario at [/will-it-run](/will-it-run) — pick a context where the prediction shows reasonable headroom, not the model's maximum.

torch.cuda.OutOfMemoryError or 'cannot allocate KV cache' at >32K tokens

By Fredoline Eruo · Last verified May 6, 2026

Cause

KV cache memory grows linearly with context length. A model that comfortably runs at 4K context can OOM at 32K because the cache went from 1 GB to 8 GB.

The math: KV cache bytes = 2 × num_layers × num_kv_heads × head_dim × context × bytes_per_element. Llama 3.1 8B at 32K context = ~4 GB just for KV cache, on top of weights.

Solution

Quantize the KV cache — biggest single win:

# llama.cpp — INT8 KV cache halves memory
./main --cache-type-k q8_0 --cache-type-v q8_0

# Or INT4 KV (more aggressive, slight quality cost)
./main --cache-type-k q4_0 --cache-type-v q4_0

Enable Flash Attention if not already on (some runners default it off):

./main --flash-attn

Use a smaller working context. A model that "supports 128K" doesn't mean you have to use it.

Move to a model designed for long context efficiency — Mistral Small 3, Llama 4 Scout (10M context with native efficiency), or Qwen 3 with its sliding window mode.

More VRAM is the only real fix for very-long-context workloads. Calculate your specific scenario at /will-it-run — pick a context where the prediction shows reasonable headroom, not the model's maximum.

Related errors

Did this fix it?

If your case was different, email hello@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.