Batch Optimization — Model Optimization for Local Inference (Chapter 17)

Efficient batching transforms single-request latency into throughput. The challenge lies in batching variable-length sequences without excessive padding waste.

Static batching pre-pads sequences to maximum length:

# Inefficient static batching
batch = [
    prompt_10_tokens,
    prompt_1000_tokens,
    prompt_50_tokens,
]
# All padded to 1000 tokens = massive waste on short prompts
# GPU utilization: ~1% for prompt_10, ~10% for prompt_50

Dynamic batching groups sequences by similar length:

# vLLM dynamic batching
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    max_num_seqs=256,           # Maximum sequences per batch
    max_num_batched_tokens=8192,  # Token budget per batch
    enable_chunked_prefill=True,  # Split long prefill
)

# vLLM automatically groups sequences by length
# Prefers grouping: [10, 11, 12, 13] over [10, 1000]
# GPU utilization: much higher with grouped sequences

Continuous batching (iteration-level scheduling) preempts completed sequences and adds new ones each forward pass:

# Continuous batching behavior
# Forward pass 1: [seq1, seq2, seq3, seq4] - all generating
# Forward pass 2: [seq1, seq2, seq3] - seq4 completed, new seq5 added
# Forward pass 3: [seq2, seq3, seq5, seq6] - seq1 completed, new seq6 added
# GPU utilization: ~80-95% with sufficient queue depth

Batch size tuning depends on workload characteristics:

# For throughput-optimized serving (many concurrent users)
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    max_num_seqs=512,          # Many concurrent sequences
    max_num_batched_tokens=16384,  # Larger token budget
    throughput_optimization_level=1,  # Prioritize throughput
)

# For latency-optimized serving (few users, expect fast response)
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    max_num_seqs=16,           # Few concurrent sequences
    max_num_batched_tokens=2048,  # Smaller token budget
    num_scheduler_steps=1,    # Minimize scheduling overhead
)

Prefill vs decode balancing:

# Prefill-heavy workload (many new requests)
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    max_num_batched_tokens=8192,  # Larger prefill budget
    preemption_mode="swap",    # Swap preemption for long prefills
)

# Decode-heavy workload (long generations)
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    max_num_seqs=256,          # More decode sequences
    chunked_prefill_batch_size=16,  # Limit prefill batch size
)

Monitoring batch efficiency:

# Key metrics to track
metrics = {
    "batch_utilization": "tokens_used / max_num_batched_tokens",
    "queue_length": "waiting requests (look for spikes)",
    "prefill_decode_ratio": "prefill tokens / decode tokens",
    "gpu_utilization": "should be > 80% under load",
}

# Calculate batch efficiency
batch_efficiency = (
    sum(gen_tokens) / 
    (batch_size * max(gen_lengths))
)
# High efficiency: > 0.7
# Low efficiency: < 0.3 (indicates padding waste)