What this does

Batch inference throughput depends on prompt lengths, batch size, and scheduling strategy. This guide applies advanced optimizations including chunked prefill, continuous batching, and tensor parallelism.

Steps

Enable chunked prefill to reduce peak memory.

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192

Chunked prefill processes long prompts in segments, preventing VRAM spikes.

Tune max-num-seqs for your hardware. This controls how many sequences are batched together.

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.90

Apply tensor parallelism (multi-GPU).

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 1

Use a benchmark harness to iterate configurations.

import subprocess, json, time

configs = [
    {"enable-chunked-prefill": True, "max-num-batched-tokens": 4096},
    {"enable-chunked-prefill": True, "max-num-batched-tokens": 8192},
    {"enable-chunked-prefill": False, "max-num-batched-tokens": 0},
]
for cfg in configs:
    # Start server, send benchmark prompts, record throughput
    start = time.perf_counter()
    # ... run benchmark ...
    elapsed = time.perf_counter() - start
    print(f"{cfg}: {100/elapsed:.1f} req/s")

Enable prefix caching for repeated prompt prefixes.
```
--enable-prefix-caching
```
This caches KV for shared prefixes across requests, saving compute on system prompts.

Verification

# Expected: 2-4x throughput improvement with chunked prefill and optimal batch settings
# Monitor GPU utilization
nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader -l 1
# Target: GPU util > 85%

Common failures

Chunked prefill slows short prompts: It helps long prompts but adds overhead for short ones. Test both.
Tensor parallelism overhead for small models: TP only benefits models > 13B on multi-GPU. For smaller models, use single GPU.
Prefix caching ineffective: Requires identical prompt prefixes. Structure prompts with a shared system prompt at the start.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to optimize batch inference throughput

What this does

Steps

Verification

Common failures

Operator checkpoint

Related guides