Out of memory specifically at long context lengths
Cause
KV cache memory grows linearly with context length. A model that comfortably runs at 4K context can OOM at 32K because the cache went from 1 GB to 8 GB.
The math: KV cache bytes = 2 × num_layers × num_kv_heads × head_dim × context × bytes_per_element. Llama 3.1 8B at 32K context = ~4 GB just for KV cache, on top of weights.
Solution
Quantize the KV cache — biggest single win:
# llama.cpp — INT8 KV cache halves memory
./main --cache-type-k q8_0 --cache-type-v q8_0
# Or INT4 KV (more aggressive, slight quality cost)
./main --cache-type-k q4_0 --cache-type-v q4_0
Enable Flash Attention if not already on (some runners default it off):
./main --flash-attn
Use a smaller working context. A model that "supports 128K" doesn't mean you have to use it.
Move to a model designed for long context efficiency — Mistral Small 3, Llama 4 Scout (10M context with native efficiency), or Qwen 3 with its sliding window mode.
More VRAM is the only real fix for very-long-context workloads. Calculate your specific scenario at /will-it-run — pick a context where the prediction shows reasonable headroom, not the model's maximum.
Related errors
Did this fix it?
If your case was different, email hello@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.