How to enable tensor parallelism in vLLM
2+ GPUs, vLLM installed
What this does
Distributes a model's weight matrices across two or more GPUs so that inference fits in aggregate VRAM without sacrificing batch size or context length. Tensor parallelism shards individual matrix multiplications across GPUs in a single forward pass.
Steps
Check GPU topology. NVLink enables faster cross-GPU communication. PCIe is functional but slower for tensor-parallel all-reduce operations.
nvidia-smi topo -mExpected output: a topology matrix showing connections between GPU pairs.
Set the CUDA visible devices mask. To ensure NCCL uses the correct GPUs in order, expose them explicitly.
export CUDA_VISIBLE_DEVICES=0,1Expected output: no output; the environment variable is set.
Launch vLLM with tensor-parallel size. Replace
2with the desired GPU count.vllm serve <model-repo> \ --task generate \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.85Expected output:
INFO: Application startup complete.on port 8000.Confirm distributed weights across devices.
curl -s http://localhost:8000/v1/modelsExpected output: the model name listed; internally it spans both GPUs.
Verification
python -c "
import vllm
print('vLLM installed with tensor parallelism support')
"
# Expected: no error, vLLM imports successfully
Common failures
- NCCL timeout errors — GPUs cannot communicate quickly enough, often caused by PCIe-only links. Lower
--gpu-memory-utilizationto 0.75. tensor_parallel_sizeexceeds available GPUs — The CUDA mask exposes fewer devices than requested. ResetCUDA_VISIBLE_DEVICESto list all N GPUs.- NCCL vs. CUDA version mismatch — Reinstall vLLM targeting the host CUDA version.
- Model architecture lacks tensor-parallel support — Check the vLLM supported models list before loading.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.