HOW-TO · INF
How to configure batch size limits to prevent memory overflow
PREREQUISITES
vLLM or similar batch inference setup
What this does
Setting appropriate batch size limits prevents out-of-memory (OOM) errors during batch inference. This guide provides formulas and tests to determine safe batch sizes for your hardware.
Steps
Calculate baseline per-request memory. A 7B model in Q4 consumes approximately 4 GB for weights + 2 MB per token of KV cache.
model_params_b = 7 quantization_bits = 4 ctx_length = 4096 weight_mem_gb = model_params_b * quantization_bits / 8 kv_cache_mb_per_token = model_params_b * 2 / 8 * 2 # key + value kv_cache_gb = kv_cache_mb_per_token * ctx_length / 1024 total_per_request = weight_mem_gb + kv_cache_gb print(f"Per request: ~{total_per_request:.1f} GB")Calculate max batch size for your VRAM.
vram_gb = 24 # RTX 4090 overhead_gb = 2 # Reserve for system max_batch = (vram_gb - overhead_gb) // (total_per_request - weight_mem_gb) print(f"Max batch size: {max_batch}")Set the limit in vLLM configuration.
python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-7B \ --max-num-seqs 8 \ --gpu-memory-utilization 0.85Test with a gradually increasing workload.
for batch in 1 2 4 8 16 32; do echo "Testing batch size $batch..." python -c " from vllm import LLM llm = LLM(model='meta-llama/Llama-3.2-7B', max_num_seqs=$batch) prompts = ['Test prompt'] * $batch outputs = llm.generate(prompts) print('Success with batch $batch') " done
Verification
nvidia-smi --query-gpu=memory.used --format=csv,noheader
# Expected: memory usage plateaus below VRAM capacity as batch size increases
# OOM occurs at the first batch size above your limit — note the value
Common failures
- Conservative estimate wastes capacity: The formula gives a safe starting point. Increase incrementally while monitoring
nvidia-smi. - OOM at smaller batch than calculated: Other processes consume VRAM. Close browsers, IDEs, or use
nvidia-smito check competing processes. - Batch size not the only factor: Long individual prompts consume more KV cache than short ones. Account for
max-model-len.
Related guides
RELATED GUIDES