Out of memory specifically at long context lengths
Cause
KV cache memory grows linearly with context length. A model that comfortably runs at 4K context can OOM at 32K because the cache went from 1 GB to 8 GB.
The math: KV cache bytes = 2 × num_layers × num_kv_heads × head_dim × context × bytes_per_element. Llama 3.1 8B at 32K context = ~4 GB just for KV cache, on top of weights.
Solution
Quantize the KV cache — biggest single win:
# llama.cpp — INT8 KV cache halves memory
./main --cache-type-k q8_0 --cache-type-v q8_0
# Or INT4 KV (more aggressive, slight quality cost)
./main --cache-type-k q4_0 --cache-type-v q4_0
Enable Flash Attention if not already on (some runners default it off):
./main --flash-attn
Use a smaller working context. A model that "supports 128K" doesn't mean you have to use it.
Move to a model designed for long context efficiency — Mistral Small 3, Llama 4 Scout (10M context with native efficiency), or Qwen 3 with its sliding window mode.
More VRAM is the only real fix for very-long-context workloads. Calculate your specific scenario at /will-it-run — pick a context where the prediction shows reasonable headroom, not the model's maximum.
Related errors
Did this fix it?
If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.