RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to configure batch size limits to prevent memory overflow
HOW-TO · INF

How to configure batch size limits to prevent memory overflow

intermediate·10 min·By Fredoline Eruo
PREREQUISITES

vLLM or similar batch inference setup

What this does

Setting appropriate batch size limits prevents out-of-memory (OOM) errors during batch inference. This guide provides formulas and tests to determine safe batch sizes for your hardware.

Steps

  1. Calculate baseline per-request memory. A 7B model in Q4 consumes approximately 4 GB for weights + 2 MB per token of KV cache.

    model_params_b = 7
    quantization_bits = 4
    ctx_length = 4096
    weight_mem_gb = model_params_b * quantization_bits / 8
    kv_cache_mb_per_token = model_params_b * 2 / 8 * 2  # key + value
    kv_cache_gb = kv_cache_mb_per_token * ctx_length / 1024
    total_per_request = weight_mem_gb + kv_cache_gb
    print(f"Per request: ~{total_per_request:.1f} GB")
    
  2. Calculate max batch size for your VRAM.

    vram_gb = 24  # RTX 4090
    overhead_gb = 2  # Reserve for system
    max_batch = (vram_gb - overhead_gb) // (total_per_request - weight_mem_gb)
    print(f"Max batch size: {max_batch}")
    
  3. Set the limit in vLLM configuration.

    python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-3.2-7B \
        --max-num-seqs 8 \
        --gpu-memory-utilization 0.85
    
  4. Test with a gradually increasing workload.

    for batch in 1 2 4 8 16 32; do
        echo "Testing batch size $batch..."
        python -c "
    from vllm import LLM
    llm = LLM(model='meta-llama/Llama-3.2-7B', max_num_seqs=$batch)
    prompts = ['Test prompt'] * $batch
    outputs = llm.generate(prompts)
    print('Success with batch $batch')
    "
    done
    

Verification

nvidia-smi --query-gpu=memory.used --format=csv,noheader
# Expected: memory usage plateaus below VRAM capacity as batch size increases
# OOM occurs at the first batch size above your limit — note the value

Common failures

  • Conservative estimate wastes capacity: The formula gives a safe starting point. Increase incrementally while monitoring nvidia-smi.
  • OOM at smaller batch than calculated: Other processes consume VRAM. Close browsers, IDEs, or use nvidia-smi to check competing processes.
  • Batch size not the only factor: Long individual prompts consume more KV cache than short ones. Account for max-model-len.

Related guides

  • How to run batch inference for processing multiple prompts
  • How to optimize batch inference throughput
RELATED GUIDES
INF
How to run batch inference for processing multiple prompts
INF
How to optimize batch inference throughput
← All how-to guidesCourses →