What causes "SGLang: RadixAttention KV cache overflow / out of memory"?

**Environment:** [SGLang](/tools/sglang) production serving — typical on H100/A100 fleets running large models with many concurrent users. **Severity: high** — newly-arriving requests fail. - `--mem-fraction-static` set too low for the model size (default 0.88; large models need 0.92+) - `--max-running-requests` permits more concurrent requests than KV cache can hold - RadixAttention's prefix-sharing pool fills with long shared prefixes that don't evict - Model context-length × batch × KV-precision exceeds VRAM minus model weights - Long-running cold prefixes never reused — they should evict but don't if pool is misconfigured

SGLang: RadixAttention KV cache overflow / out of memory — fix and explanation

Q: How do you fix "SGLang: RadixAttention KV cache overflow / out of memory"?

**1. Raise the static memory fraction** (the single biggest lever): ```bash python -m sglang.launch_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --mem-fraction-static 0.92 ``` **2. Cap concurrent requests so the pool isn't oversubscribed:** ```bash python -m sglang.launch_server --model ... \ --max-running-requests 32 \ --max-total-tokens 65536 ``` **3. Trim model max-context to what your workload actually needs:** ```bash python -m sglang.launch_server --model ... \ --context-length 16384 ``` **4. Quantize the KV cache** (huge memory win at minor quality cost): ```bash python -m sglang.launch_server --model ... \ --kv-cache-dtype fp8_e5m2 ``` **5. Disable RadixAttention prefix-sharing** if your workload has unique prompts (no benefit and the pool just thrashes): ```bash python -m sglang.launch_server --model ... \ --disable-radix-cache ``` **6. Calculate the budget**: `KV-cache GB ≈ 2 × layers × kv_heads × head_dim × ctx × bytes / 1e9`. For Llama 3.1 70B at 16K ctx, FP16: ~12 GB just for cache.

Cause

Environment: SGLang production serving — typical on H100/A100 fleets running large models with many concurrent users.

Severity: high — newly-arriving requests fail.

--mem-fraction-static set too low for the model size (default 0.88; large models need 0.92+)
--max-running-requests permits more concurrent requests than KV cache can hold
RadixAttention's prefix-sharing pool fills with long shared prefixes that don't evict
Model context-length × batch × KV-precision exceeds VRAM minus model weights
Long-running cold prefixes never reused — they should evict but don't if pool is misconfigured

Solution

1. Raise the static memory fraction (the single biggest lever):

python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --mem-fraction-static 0.92

2. Cap concurrent requests so the pool isn't oversubscribed:

python -m sglang.launch_server --model ... \
  --max-running-requests 32 \
  --max-total-tokens 65536

3. Trim model max-context to what your workload actually needs:

python -m sglang.launch_server --model ... \
  --context-length 16384

4. Quantize the KV cache (huge memory win at minor quality cost):

python -m sglang.launch_server --model ... \
  --kv-cache-dtype fp8_e5m2

5. Disable RadixAttention prefix-sharing if your workload has unique prompts (no benefit and the pool just thrashes):

python -m sglang.launch_server --model ... \
  --disable-radix-cache

6. Calculate the budget: KV-cache GB ≈ 2 × layers × kv_heads × head_dim × ctx × bytes / 1e9. For Llama 3.1 70B at 16K ctx, FP16: ~12 GB just for cache.

SGLang: RadixAttention KV cache overflow / out of memory

Cause

Solution

Related errors

Did this fix it?