vLLM: No available KV cache blocks
Cause
vLLM pre-allocates KV cache blocks at startup based on gpu_memory_utilization (default 0.9). Once running, requests with long prompts can exhaust the pre-allocated pool — vLLM doesn't dynamically grow it.
A common scenario: running a 14B model on 24 GB VRAM at 90% utilization leaves only enough KV cache for ~8K combined tokens across all concurrent requests. The 9th 4K-prompt request errors.
Solution
Lower model max length to free more cache:
vllm serve qwen2.5-7b --max-model-len 16384
Increase gpu_memory_utilization (more KV cache, less safety margin):
vllm serve qwen2.5-7b --gpu-memory-utilization 0.95
Risk: leaves no room for activation memory spikes; can OOM on bursty load.
Add swap_space for CPU offload of cache:
vllm serve qwen2.5-7b --swap-space 8 # 8 GB
Hot blocks stay in VRAM, cold blocks evict to system RAM. Slight latency hit, more capacity.
Reduce max_num_seqs to limit concurrency:
vllm serve qwen2.5-7b --max-num-seqs 16
Use a smaller model if you genuinely need to serve many concurrent users. A 7B model at Q4 with 24 GB VRAM happily serves 32 concurrent 4K-context users; a 14B at FP16 won't.
Related errors
Did this fix it?
If your case was different, email Contact support with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.