vLLM Distributed Serving — Local AI Clusters (Chapter 7)

vLLM implements PagedAttention and tensor parallelism support for efficient multi-GPU LLM serving. Understanding its distributed architecture enables correct cluster deployment for serving large models beyond single-GPU memory.

vLLM's tensor parallelism implementation follows the Megatron pattern for linear layers. The tensor_parallel_size parameter controls GPU distribution—setting this to 2 splits the model across 2 GPUs with all-reduce synchronization. The pipeline_parallel_size parameter divides model layers across multiple GPUs.

Starting a distributed vLLM server requires initializing via torchrun or similar distributed launchers:

torchrun \
    --nproc_per_node=2 \
    --nnodes=1 \
    vllm/entrypoints/api_server.py \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 2 \
    --host 0.0.0.0 \
    --port 8000

Multi-node deployment uses torchrun across nodes:

# Node 0 (head)
torchrun \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=<head-node-ip> \
    --master_port=29500 \
    vllm/entrypoints/api_server.py \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 8

Common failure: mismatch between tensor-parallel size and available GPUs causes initialization errors. vLLM validates hardware placement and refuses to run with invalid configurations. Another frequent issue involves quantile model mismatch with the base model—verify model compatibility before serving.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.