Out of memory

vLLM: No available KV cache blocks

Q: What causes "vLLM: No available KV cache blocks"?

vLLM pre-allocates KV cache blocks at startup based on `gpu_memory_utilization` (default 0.9). Once running, requests with long prompts can exhaust the pre-allocated pool — vLLM doesn't dynamically grow it. A common scenario: running a 14B model on 24 GB VRAM at 90% utilization leaves only enough KV cache for ~8K combined tokens across all concurrent requests. The 9th 4K-prompt request errors.

Q: How do you fix "vLLM: No available KV cache blocks"?

**Lower model max length to free more cache:** ```bash vllm serve qwen2.5-7b --max-model-len 16384 ``` **Increase gpu_memory_utilization** (more KV cache, less safety margin): ```bash vllm serve qwen2.5-7b --gpu-memory-utilization 0.95 ``` Risk: leaves no room for activation memory spikes; can OOM on bursty load. **Add swap_space for CPU offload of cache:** ```bash vllm serve qwen2.5-7b --swap-space 8 # 8 GB ``` Hot blocks stay in VRAM, cold blocks evict to system RAM. Slight latency hit, more capacity. **Reduce max_num_seqs** to limit concurrency: ```bash vllm serve qwen2.5-7b --max-num-seqs 16 ``` **Use a smaller model** if you genuinely need to serve many concurrent users. A 7B model at Q4 with 24 GB VRAM happily serves 32 concurrent 4K-context users; a 14B at FP16 won't.

RuntimeError: No available KV cache blocks

By Fredoline Eruo · Last verified May 6, 2026

Cause

vLLM pre-allocates KV cache blocks at startup based on gpu_memory_utilization (default 0.9). Once running, requests with long prompts can exhaust the pre-allocated pool — vLLM doesn't dynamically grow it.

A common scenario: running a 14B model on 24 GB VRAM at 90% utilization leaves only enough KV cache for ~8K combined tokens across all concurrent requests. The 9th 4K-prompt request errors.

Solution

Lower model max length to free more cache:

vllm serve qwen2.5-7b --max-model-len 16384

Increase gpu_memory_utilization (more KV cache, less safety margin):

vllm serve qwen2.5-7b --gpu-memory-utilization 0.95

Risk: leaves no room for activation memory spikes; can OOM on bursty load.

Add swap_space for CPU offload of cache:

vllm serve qwen2.5-7b --swap-space 8  # 8 GB

Hot blocks stay in VRAM, cold blocks evict to system RAM. Slight latency hit, more capacity.

Reduce max_num_seqs to limit concurrency:

vllm serve qwen2.5-7b --max-num-seqs 16

Use a smaller model if you genuinely need to serve many concurrent users. A 7B model at Q4 with 24 GB VRAM happily serves 32 concurrent 4K-context users; a 14B at FP16 won't.

Related errors

Did this fix it?

If your case was different, email hello@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.