11. vLLM Optimization
Chapter 11 of 18 · 20 min
vLLM provides production-grade inference serving with PagedAttention at its core. Understanding its configuration knobs enables squeezing maximum performance from available hardware.
Memory allocation tuning:
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
# Memory configuration
gpu_memory_utilization=0.92, # Reserve 8% for model weights
max_model_len=16384, # Limit context to fit VRAM
swap_space=4, # GB of CPU swap for KV cache overflow
# Parallelism
tensor_parallel_size=2, # GPUs for model parallelism
pipeline_parallel_size=1, # Pipeline stages (usually 1 for LLMs)
# Batching
max_num_seqs=256, # Maximum concurrent sequences
max_num_batched_tokens=8192, # Tokens per forward pass
preemption_mode="swap", # Swap or recreate on preemption
)
Throughput optimization requires understanding the critical path. GPU compute utilization, memory bandwidth utilization, and scheduling overhead interact.
# Maximize throughput for long contexts
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
gpu_memory_utilization=0.85, # Lower utilization allows longer contexts
max_model_len=32768, # Enable very long contexts
block_size=32, # Larger blocks reduce overhead for long sequences
# Scheduler tuning
enable_prefix_caching=True, # Share KV cache for identical prefixes
disable_sliding_window=False, # Enable for sliding window models (Mistral)
num_scheduler_steps=10, # More steps = better batching, more latency
)
Latency optimization for interactive use:
# Minimize latency for single-user workloads
llm = LLM(
model="meta-llama/Llama-2-7b-hf", # Smaller model for lower latency
gpu_memory_utilization=0.95, # Maximize memory usage
max_model_len=4096, # Shorter context = faster
block_size=16,
num_scheduler_steps=1, # Single step = minimal latency
enable_chunked_prefill=True, # Split long prefill into chunks
)
Metrics monitoring reveals bottlenecks:
# Query vLLM metrics endpoint
import requests
import json
# vLLM exposes prometheus metrics at /metrics
response = requests.get("http://localhost:8000/metrics")
print(response.text[:2000])
# Key metrics:
# - vllm:num_requests_running: concurrent requests
# - vllm:num_batched_tokens: tokens in current batch
# - vllm:gpu_cache_usage_perc: KV cache utilization
Common configuration failures:
# Failure: max_model_len exceeds available memory
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
max_model_len=131072, # 128K context, requires ~160GB VRAM
)
# RuntimeError: KV cache size exceeds gpu_memory_utilization
# Solution: Calculate maximum feasible context
# Available memory for KV cache = VRAM * gpu_memory_utilization - model_weights
# max_model_len = available_memory / (2 * layers * heads * head_dim * bytes_per_param)
EXERCISE
Run a load test against vLLM with varying request patterns (bursty, steady, varied length). Plot throughput and latency curves. Identify the configuration that maximizes throughput while keeping p50 latency under your SLA.