RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Errors / Out of memory / Out of memory specifically at long context lengths
Out of memory

Out of memory specifically at long context lengths

torch.cuda.OutOfMemoryError or 'cannot allocate KV cache' at >32K tokens
By Fredoline Eruo · Last verified Jun 12, 2026

Cause

KV cache memory grows linearly with context length. A model that comfortably runs at 4K context can OOM at 32K because the cache went from 1 GB to 8 GB.

The math: KV cache bytes = 2 × num_layers × num_kv_heads × head_dim × context × bytes_per_element. Llama 3.1 8B at 32K context = ~4 GB just for KV cache, on top of weights.

Solution

Quantize the KV cache — biggest single win:

# llama.cpp — INT8 KV cache halves memory
./main --cache-type-k q8_0 --cache-type-v q8_0

# Or INT4 KV (more aggressive, slight quality cost)
./main --cache-type-k q4_0 --cache-type-v q4_0

Enable Flash Attention if not already on (some runners default it off):

./main --flash-attn

Use a smaller working context. A model that "supports 128K" doesn't mean you have to use it.

Move to a model designed for long context efficiency — Mistral Small 3, Llama 4 Scout (10M context with native efficiency), or Qwen 3 with its sliding window mode.

More VRAM is the only real fix for very-long-context workloads. Calculate your specific scenario at /will-it-run — pick a context where the prediction shows reasonable headroom, not the model's maximum.

Related errors

  • Ollama: model requires more system memory than is available
  • SGLang: RadixAttention KV cache overflow / out of memory
  • CUDA OOM that only happens at long context (KV cache blowup)
  • vLLM AsyncEngineDeadError after large batch / OOM
  • Process killed (OOM killer) when loading large model

Did this fix it?

If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.