RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Model Optimization for Local Inference
  6. /Ch. 17
Model Optimization for Local Inference

17. Batch Optimization

Chapter 17 of 18 · 20 min
KEY INSIGHT

Batch optimization trades latency variance for throughput—understanding whether latency or throughput matters determines the right batching strategy.

Efficient batching transforms single-request latency into throughput. The challenge lies in batching variable-length sequences without excessive padding waste.

Static batching pre-pads sequences to maximum length:

# Inefficient static batching
batch = [
    prompt_10_tokens,
    prompt_1000_tokens,
    prompt_50_tokens,
]
# All padded to 1000 tokens = massive waste on short prompts
# GPU utilization: ~1% for prompt_10, ~10% for prompt_50

Dynamic batching groups sequences by similar length:

# vLLM dynamic batching
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    max_num_seqs=256,           # Maximum sequences per batch
    max_num_batched_tokens=8192,  # Token budget per batch
    enable_chunked_prefill=True,  # Split long prefill
)

# vLLM automatically groups sequences by length
# Prefers grouping: [10, 11, 12, 13] over [10, 1000]
# GPU utilization: much higher with grouped sequences

Continuous batching (iteration-level scheduling) preempts completed sequences and adds new ones each forward pass:

# Continuous batching behavior
# Forward pass 1: [seq1, seq2, seq3, seq4] - all generating
# Forward pass 2: [seq1, seq2, seq3] - seq4 completed, new seq5 added
# Forward pass 3: [seq2, seq3, seq5, seq6] - seq1 completed, new seq6 added
# GPU utilization: ~80-95% with sufficient queue depth

Batch size tuning depends on workload characteristics:

# For throughput-optimized serving (many concurrent users)
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    max_num_seqs=512,          # Many concurrent sequences
    max_num_batched_tokens=16384,  # Larger token budget
    throughput_optimization_level=1,  # Prioritize throughput
)

# For latency-optimized serving (few users, expect fast response)
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    max_num_seqs=16,           # Few concurrent sequences
    max_num_batched_tokens=2048,  # Smaller token budget
    num_scheduler_steps=1,    # Minimize scheduling overhead
)

Prefill vs decode balancing:

# Prefill-heavy workload (many new requests)
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    max_num_batched_tokens=8192,  # Larger prefill budget
    preemption_mode="swap",    # Swap preemption for long prefills
)

# Decode-heavy workload (long generations)
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    max_num_seqs=256,          # More decode sequences
    chunked_prefill_batch_size=16,  # Limit prefill batch size
)

Monitoring batch efficiency:

# Key metrics to track
metrics = {
    "batch_utilization": "tokens_used / max_num_batched_tokens",
    "queue_length": "waiting requests (look for spikes)",
    "prefill_decode_ratio": "prefill tokens / decode tokens",
    "gpu_utilization": "should be > 80% under load",
}

# Calculate batch efficiency
batch_efficiency = (
    sum(gen_tokens) / 
    (batch_size * max(gen_lengths))
)
# High efficiency: > 0.7
# Low efficiency: < 0.3 (indicates padding waste)
EXERCISE

Generate traffic patterns with varying concurrency levels. Measure throughput (tokens/second) and latency (p50, p99) at each concurrency level. Identify the concurrency threshold where latency begins degrading.

← Chapter 16
Prompt Caching
Chapter 18 →
End-to-End Optimization Project