What this does

Adjusts batching, parallelism, and memory settings to maximize tokens per second for a given model and GPU configuration. The goal is to keep all GPU compute units active by feeding a full batch of requests continuously.

Steps

Increase max concurrent sequences. The --max-num-batched-tokens parameter sets how many tokens across all active sequences can be processed in one forward pass.
```
vllm serve <model> \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.85
```
Expected output: larger batches processed per iteration.
Enable prefix caching for repeated contexts. When requests share a common system prompt, caching avoids recomputing attention.
```
vllm serve <model> \
  --enable-chunked-prefill \
  --max-num-batched-tokens 4096
```
Expected output: subsequent requests with identical prefixes show reduced prefill latency.
Set appropriate tensor-parallel degree.
```
vllm serve <model> \
  --tensor-parallel-size 2 \
  --num-speculative-decoder-samples 4
```
Expected output: throughput scales linearly up to the bottleneck.
Tune block size for KV cache. Smaller blocks increase memory utilization but raise CPU overhead.
```
vllm serve <model> \
  --block-size 32 \
  --gpu-memory-utilization 0.88
```
Expected output: fewer CPU overhead events per batch.
Benchmark with a synthetic load.
```
ab -n 1000 -c 32 -p requests.json -T application/json \
  http://localhost:8000/v1/chat/completions
```
Expected output: requests-per-second figure; compare across configuration changes.

Verification

curl -s http://localhost:8000/metrics | grep vllm:num_generate_tokens_total
# Expected: an increasing counter confirming tokens are being generated continuously

Common failures

GPU utilization spikes then drops to zero — Memory overflow triggers OOM eviction. Lower --gpu-memory-utilization to 0.75.
Throughput lower than single-sequence baseline — Over-parallel configuration. Reduce --tensor-parallel-size to 1.
Chunked prefill causing output stalls — Increase --max-num-batched-tokens to at least 2x the longest prefill length.
Low tokens/second despite high GPU utilization — The bottleneck is network I/O. Move the client closer to the server.
Speculative decoding degradation — Draft model quality poor for the target domain. Disable with --num-speculative-decoder-samples 0.

How to optimize vLLM for throughput

What this does

Steps

Verification

Common failures

Related guides