What this does

Setting appropriate batch size limits prevents out-of-memory (OOM) errors during batch inference. This guide provides formulas and tests to determine safe batch sizes for your hardware.

Steps

Calculate baseline per-request memory. A 7B model in Q4 consumes approximately 4 GB for weights + 2 MB per token of KV cache.

model_params_b = 7
quantization_bits = 4
ctx_length = 4096
weight_mem_gb = model_params_b * quantization_bits / 8
kv_cache_mb_per_token = model_params_b * 2 / 8 * 2  # key + value
kv_cache_gb = kv_cache_mb_per_token * ctx_length / 1024
total_per_request = weight_mem_gb + kv_cache_gb
print(f"Per request: ~{total_per_request:.1f} GB")

Calculate max batch size for your VRAM.

vram_gb = 24  # RTX 4090
overhead_gb = 2  # Reserve for system
max_batch = (vram_gb - overhead_gb) // (total_per_request - weight_mem_gb)
print(f"Max batch size: {max_batch}")

Set the limit in vLLM configuration.

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-7B \
    --max-num-seqs 8 \
    --gpu-memory-utilization 0.85

Test with a gradually increasing workload.

for batch in 1 2 4 8 16 32; do
    echo "Testing batch size $batch..."
    python -c "
from vllm import LLM
llm = LLM(model='meta-llama/Llama-3.2-7B', max_num_seqs=$batch)
prompts = ['Test prompt'] * $batch
outputs = llm.generate(prompts)
print('Success with batch $batch')
"
done

Verification

nvidia-smi --query-gpu=memory.used --format=csv,noheader
# Expected: memory usage plateaus below VRAM capacity as batch size increases
# OOM occurs at the first batch size above your limit — note the value

Common failures

Conservative estimate wastes capacity: The formula gives a safe starting point. Increase incrementally while monitoring nvidia-smi.
OOM at smaller batch than calculated: Other processes consume VRAM. Close browsers, IDEs, or use nvidia-smi to check competing processes.
Batch size not the only factor: Long individual prompts consume more KV cache than short ones. Account for max-model-len.

How to configure batch size limits to prevent memory overflow

What this does

Steps

Verification

Common failures

Related guides