What this does

Distributes a model's weight matrices across two or more GPUs so that inference fits in aggregate VRAM without sacrificing batch size or context length. Tensor parallelism shards individual matrix multiplications across GPUs in a single forward pass.

Steps

Check GPU topology. NVLink enables faster cross-GPU communication. PCIe is functional but slower for tensor-parallel all-reduce operations.
```
nvidia-smi topo -m
```
Expected output: a topology matrix showing connections between GPU pairs.
Set the CUDA visible devices mask. To ensure NCCL uses the correct GPUs in order, expose them explicitly.
```
export CUDA_VISIBLE_DEVICES=0,1
```
Expected output: no output; the environment variable is set.
Launch vLLM with tensor-parallel size. Replace 2 with the desired GPU count.
```
vllm serve <model-repo> \
  --task generate \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.85
```
Expected output: INFO: Application startup complete. on port 8000.
Confirm distributed weights across devices.
```
curl -s http://localhost:8000/v1/models
```
Expected output: the model name listed; internally it spans both GPUs.

Verification

python -c "
import vllm
print('vLLM installed with tensor parallelism support')
"
# Expected: no error, vLLM imports successfully

Common failures

NCCL timeout errors — GPUs cannot communicate quickly enough, often caused by PCIe-only links. Lower --gpu-memory-utilization to 0.75.
tensor_parallel_size exceeds available GPUs — The CUDA mask exposes fewer devices than requested. Reset CUDA_VISIBLE_DEVICES to list all N GPUs.
NCCL vs. CUDA version mismatch — Reinstall vLLM targeting the host CUDA version.
Model architecture lacks tensor-parallel support — Check the vLLM supported models list before loading.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to enable tensor parallelism in vLLM

What this does

Steps

Verification

Common failures

Operator checkpoint

Related guides