17. Batch Optimization
Chapter 17 of 18 · 20 min
Efficient batching transforms single-request latency into throughput. The challenge lies in batching variable-length sequences without excessive padding waste.
Static batching pre-pads sequences to maximum length:
# Inefficient static batching
batch = [
prompt_10_tokens,
prompt_1000_tokens,
prompt_50_tokens,
]
# All padded to 1000 tokens = massive waste on short prompts
# GPU utilization: ~1% for prompt_10, ~10% for prompt_50
Dynamic batching groups sequences by similar length:
# vLLM dynamic batching
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
max_num_seqs=256, # Maximum sequences per batch
max_num_batched_tokens=8192, # Token budget per batch
enable_chunked_prefill=True, # Split long prefill
)
# vLLM automatically groups sequences by length
# Prefers grouping: [10, 11, 12, 13] over [10, 1000]
# GPU utilization: much higher with grouped sequences
Continuous batching (iteration-level scheduling) preempts completed sequences and adds new ones each forward pass:
# Continuous batching behavior
# Forward pass 1: [seq1, seq2, seq3, seq4] - all generating
# Forward pass 2: [seq1, seq2, seq3] - seq4 completed, new seq5 added
# Forward pass 3: [seq2, seq3, seq5, seq6] - seq1 completed, new seq6 added
# GPU utilization: ~80-95% with sufficient queue depth
Batch size tuning depends on workload characteristics:
# For throughput-optimized serving (many concurrent users)
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
max_num_seqs=512, # Many concurrent sequences
max_num_batched_tokens=16384, # Larger token budget
throughput_optimization_level=1, # Prioritize throughput
)
# For latency-optimized serving (few users, expect fast response)
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
max_num_seqs=16, # Few concurrent sequences
max_num_batched_tokens=2048, # Smaller token budget
num_scheduler_steps=1, # Minimize scheduling overhead
)
Prefill vs decode balancing:
# Prefill-heavy workload (many new requests)
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
max_num_batched_tokens=8192, # Larger prefill budget
preemption_mode="swap", # Swap preemption for long prefills
)
# Decode-heavy workload (long generations)
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
max_num_seqs=256, # More decode sequences
chunked_prefill_batch_size=16, # Limit prefill batch size
)
Monitoring batch efficiency:
# Key metrics to track
metrics = {
"batch_utilization": "tokens_used / max_num_batched_tokens",
"queue_length": "waiting requests (look for spikes)",
"prefill_decode_ratio": "prefill tokens / decode tokens",
"gpu_utilization": "should be > 80% under load",
}
# Calculate batch efficiency
batch_efficiency = (
sum(gen_tokens) /
(batch_size * max(gen_lengths))
)
# High efficiency: > 0.7
# Low efficiency: < 0.3 (indicates padding waste)
EXERCISE
Generate traffic patterns with varying concurrency levels. Measure throughput (tokens/second) and latency (p50, p99) at each concurrency level. Identify the concurrency threshold where latency begins degrading.