How to optimize vLLM for throughput
vLLM deployed with a model, benchmarking tool
What this does
Adjusts batching, parallelism, and memory settings to maximize tokens per second for a given model and GPU configuration. The goal is to keep all GPU compute units active by feeding a full batch of requests continuously.
Steps
Increase max concurrent sequences. The
--max-num-batched-tokensparameter sets how many tokens across all active sequences can be processed in one forward pass.vllm serve <model> \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.85Expected output: larger batches processed per iteration.
Enable prefix caching for repeated contexts. When requests share a common system prompt, caching avoids recomputing attention.
vllm serve <model> \ --enable-chunked-prefill \ --max-num-batched-tokens 4096Expected output: subsequent requests with identical prefixes show reduced prefill latency.
Set appropriate tensor-parallel degree.
vllm serve <model> \ --tensor-parallel-size 2 \ --num-speculative-decoder-samples 4Expected output: throughput scales linearly up to the bottleneck.
Tune block size for KV cache. Smaller blocks increase memory utilization but raise CPU overhead.
vllm serve <model> \ --block-size 32 \ --gpu-memory-utilization 0.88Expected output: fewer CPU overhead events per batch.
Benchmark with a synthetic load.
ab -n 1000 -c 32 -p requests.json -T application/json \ http://localhost:8000/v1/chat/completionsExpected output: requests-per-second figure; compare across configuration changes.
Verification
curl -s http://localhost:8000/metrics | grep vllm:num_generate_tokens_total
# Expected: an increasing counter confirming tokens are being generated continuously
Common failures
- GPU utilization spikes then drops to zero — Memory overflow triggers OOM eviction. Lower
--gpu-memory-utilizationto 0.75. - Throughput lower than single-sequence baseline — Over-parallel configuration. Reduce
--tensor-parallel-sizeto 1. - Chunked prefill causing output stalls — Increase
--max-num-batched-tokensto at least 2x the longest prefill length. - Low tokens/second despite high GPU utilization — The bottleneck is network I/O. Move the client closer to the server.
- Speculative decoding degradation — Draft model quality poor for the target domain. Disable with
--num-speculative-decoder-samples 0.