RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to optimize batch inference throughput
HOW-TO · INF

How to optimize batch inference throughput

advanced·20 min·By Fredoline Eruo
PREREQUISITES

vLLM with batch inference running

What this does

Batch inference throughput depends on prompt lengths, batch size, and scheduling strategy. This guide applies advanced optimizations including chunked prefill, continuous batching, and tensor parallelism.

Steps

  1. Enable chunked prefill to reduce peak memory.

    python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-3.2-3B \
        --enable-chunked-prefill \
        --max-num-batched-tokens 8192
    

    Chunked prefill processes long prompts in segments, preventing VRAM spikes.

  2. Tune max-num-seqs for your hardware. This controls how many sequences are batched together.

    python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-3.2-3B \
        --max-num-seqs 256 \
        --gpu-memory-utilization 0.90
    
  3. Apply tensor parallelism (multi-GPU).

    python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-3.1-70B \
        --tensor-parallel-size 4 \
        --pipeline-parallel-size 1
    
  4. Use a benchmark harness to iterate configurations.

    import subprocess, json, time
    
    configs = [
        {"enable-chunked-prefill": True, "max-num-batched-tokens": 4096},
        {"enable-chunked-prefill": True, "max-num-batched-tokens": 8192},
        {"enable-chunked-prefill": False, "max-num-batched-tokens": 0},
    ]
    for cfg in configs:
        # Start server, send benchmark prompts, record throughput
        start = time.perf_counter()
        # ... run benchmark ...
        elapsed = time.perf_counter() - start
        print(f"{cfg}: {100/elapsed:.1f} req/s")
    
  5. Enable prefix caching for repeated prompt prefixes.

    --enable-prefix-caching
    

    This caches KV for shared prefixes across requests, saving compute on system prompts.

Verification

# Expected: 2-4x throughput improvement with chunked prefill and optimal batch settings
# Monitor GPU utilization
nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader -l 1
# Target: GPU util > 85%

Common failures

  • Chunked prefill slows short prompts: It helps long prompts but adds overhead for short ones. Test both.
  • Tensor parallelism overhead for small models: TP only benefits models > 13B on multi-GPU. For smaller models, use single GPU.
  • Prefix caching ineffective: Requires identical prompt prefixes. Structure prompts with a shared system prompt at the start.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

  • How to run batch inference for processing multiple prompts
  • How to configure batch size limits to prevent memory overflow
RELATED GUIDES
INF
How to configure batch size limits to prevent memory overflow
INF
How to run batch inference for processing multiple prompts
← All how-to guidesCourses →