RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Errors / Out of memory / SGLang: RadixAttention KV cache overflow / out of memory
Out of memory

SGLang: RadixAttention KV cache overflow / out of memory

RuntimeError: KV cache pool full (RadixAttention) — increase --mem-fraction-static or reduce --max-running-requests
By Fredoline Eruo · Last verified Jun 12, 2026

Cause

Environment: SGLang production serving — typical on H100/A100 fleets running large models with many concurrent users.

Severity: high — newly-arriving requests fail.

  • --mem-fraction-static set too low for the model size (default 0.88; large models need 0.92+)
  • --max-running-requests permits more concurrent requests than KV cache can hold
  • RadixAttention's prefix-sharing pool fills with long shared prefixes that don't evict
  • Model context-length × batch × KV-precision exceeds VRAM minus model weights
  • Long-running cold prefixes never reused — they should evict but don't if pool is misconfigured

Solution

1. Raise the static memory fraction (the single biggest lever):

python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --mem-fraction-static 0.92

2. Cap concurrent requests so the pool isn't oversubscribed:

python -m sglang.launch_server --model ... \
  --max-running-requests 32 \
  --max-total-tokens 65536

3. Trim model max-context to what your workload actually needs:

python -m sglang.launch_server --model ... \
  --context-length 16384

4. Quantize the KV cache (huge memory win at minor quality cost):

python -m sglang.launch_server --model ... \
  --kv-cache-dtype fp8_e5m2

5. Disable RadixAttention prefix-sharing if your workload has unique prompts (no benefit and the pool just thrashes):

python -m sglang.launch_server --model ... \
  --disable-radix-cache

6. Calculate the budget: KV-cache GB ≈ 2 × layers × kv_heads × head_dim × ctx × bytes / 1e9. For Llama 3.1 70B at 16K ctx, FP16: ~12 GB just for cache.

Related errors

  • Ollama: model requires more system memory than is available
  • CUDA OOM that only happens at long context (KV cache blowup)
  • vLLM AsyncEngineDeadError after large batch / OOM
  • Process killed (OOM killer) when loading large model
  • Out of memory specifically at long context lengths

Did this fix it?

If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.