SGLang: RadixAttention KV cache overflow / out of memory
Cause
Environment: SGLang production serving — typical on H100/A100 fleets running large models with many concurrent users.
Severity: high — newly-arriving requests fail.
--mem-fraction-staticset too low for the model size (default 0.88; large models need 0.92+)--max-running-requestspermits more concurrent requests than KV cache can hold- RadixAttention's prefix-sharing pool fills with long shared prefixes that don't evict
- Model context-length × batch × KV-precision exceeds VRAM minus model weights
- Long-running cold prefixes never reused — they should evict but don't if pool is misconfigured
Solution
1. Raise the static memory fraction (the single biggest lever):
python -m sglang.launch_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--mem-fraction-static 0.92
2. Cap concurrent requests so the pool isn't oversubscribed:
python -m sglang.launch_server --model ... \
--max-running-requests 32 \
--max-total-tokens 65536
3. Trim model max-context to what your workload actually needs:
python -m sglang.launch_server --model ... \
--context-length 16384
4. Quantize the KV cache (huge memory win at minor quality cost):
python -m sglang.launch_server --model ... \
--kv-cache-dtype fp8_e5m2
5. Disable RadixAttention prefix-sharing if your workload has unique prompts (no benefit and the pool just thrashes):
python -m sglang.launch_server --model ... \
--disable-radix-cache
6. Calculate the budget: KV-cache GB ≈ 2 × layers × kv_heads × head_dim × ctx × bytes / 1e9. For Llama 3.1 70B at 16K ctx, FP16: ~12 GB just for cache.
Related errors
Did this fix it?
If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.