How to optimize batch inference throughput
vLLM with batch inference running
What this does
Batch inference throughput depends on prompt lengths, batch size, and scheduling strategy. This guide applies advanced optimizations including chunked prefill, continuous batching, and tensor parallelism.
Steps
Enable chunked prefill to reduce peak memory.
python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B \ --enable-chunked-prefill \ --max-num-batched-tokens 8192Chunked prefill processes long prompts in segments, preventing VRAM spikes.
Tune
max-num-seqsfor your hardware. This controls how many sequences are batched together.python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B \ --max-num-seqs 256 \ --gpu-memory-utilization 0.90Apply tensor parallelism (multi-GPU).
python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-70B \ --tensor-parallel-size 4 \ --pipeline-parallel-size 1Use a benchmark harness to iterate configurations.
import subprocess, json, time configs = [ {"enable-chunked-prefill": True, "max-num-batched-tokens": 4096}, {"enable-chunked-prefill": True, "max-num-batched-tokens": 8192}, {"enable-chunked-prefill": False, "max-num-batched-tokens": 0}, ] for cfg in configs: # Start server, send benchmark prompts, record throughput start = time.perf_counter() # ... run benchmark ... elapsed = time.perf_counter() - start print(f"{cfg}: {100/elapsed:.1f} req/s")Enable prefix caching for repeated prompt prefixes.
--enable-prefix-cachingThis caches KV for shared prefixes across requests, saving compute on system prompts.
Verification
# Expected: 2-4x throughput improvement with chunked prefill and optimal batch settings
# Monitor GPU utilization
nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader -l 1
# Target: GPU util > 85%
Common failures
- Chunked prefill slows short prompts: It helps long prompts but adds overhead for short ones. Test both.
- Tensor parallelism overhead for small models: TP only benefits models > 13B on multi-GPU. For smaller models, use single GPU.
- Prefix caching ineffective: Requires identical prompt prefixes. Structure prompts with a shared system prompt at the start.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.