vLLM Optimization — Model Optimization for Local Inference (Chapter 11)

vLLM provides production-grade inference serving with PagedAttention at its core. Understanding its configuration knobs enables squeezing maximum performance from available hardware.

Memory allocation tuning:

from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    
    # Memory configuration
    gpu_memory_utilization=0.92,      # Reserve 8% for model weights
    max_model_len=16384,             # Limit context to fit VRAM
    swap_space=4,                    # GB of CPU swap for KV cache overflow
    
    # Parallelism
    tensor_parallel_size=2,          # GPUs for model parallelism
    pipeline_parallel_size=1,        # Pipeline stages (usually 1 for LLMs)
    
    # Batching
    max_num_seqs=256,                # Maximum concurrent sequences
    max_num_batched_tokens=8192,     # Tokens per forward pass
    preemption_mode="swap",          # Swap or recreate on preemption
)

Throughput optimization requires understanding the critical path. GPU compute utilization, memory bandwidth utilization, and scheduling overhead interact.

# Maximize throughput for long contexts
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    gpu_memory_utilization=0.85,     # Lower utilization allows longer contexts
    max_model_len=32768,             # Enable very long contexts
    block_size=32,                   # Larger blocks reduce overhead for long sequences
    
    # Scheduler tuning
    enable_prefix_caching=True,      # Share KV cache for identical prefixes
    disable_sliding_window=False,    # Enable for sliding window models (Mistral)
    num_scheduler_steps=10,          # More steps = better batching, more latency
)

Latency optimization for interactive use:

# Minimize latency for single-user workloads
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",  # Smaller model for lower latency
    gpu_memory_utilization=0.95,       # Maximize memory usage
    max_model_len=4096,                # Shorter context = faster
    block_size=16,
    num_scheduler_steps=1,             # Single step = minimal latency
    enable_chunked_prefill=True,       # Split long prefill into chunks
)

Metrics monitoring reveals bottlenecks:

# Query vLLM metrics endpoint
import requests
import json

# vLLM exposes prometheus metrics at /metrics
response = requests.get("http://localhost:8000/metrics")
print(response.text[:2000])
# Key metrics:
# - vllm:num_requests_running: concurrent requests
# - vllm:num_batched_tokens: tokens in current batch
# - vllm:gpu_cache_usage_perc: KV cache utilization

Common configuration failures:

# Failure: max_model_len exceeds available memory
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    max_model_len=131072,  # 128K context, requires ~160GB VRAM
)
# RuntimeError: KV cache size exceeds gpu_memory_utilization

# Solution: Calculate maximum feasible context
# Available memory for KV cache = VRAM * gpu_memory_utilization - model_weights
# max_model_len = available_memory / (2 * layers * heads * head_dim * bytes_per_param)