HOW-TO · SET

How to enable tensor parallelism in vLLM

advanced15 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.xWindows 11 · Ollama 0.4.xmacOS 15 · Ollama 0.4.x
PREREQUISITES

2+ GPUs, vLLM installed

What this does

Distributes a model's weight matrices across two or more GPUs so that inference fits in aggregate VRAM without sacrificing batch size or context length. Tensor parallelism shards individual matrix multiplications across GPUs in a single forward pass.

Steps

  1. Check GPU topology. NVLink enables faster cross-GPU communication. PCIe is functional but slower for tensor-parallel all-reduce operations.

    nvidia-smi topo -m
    

    Expected output: a topology matrix showing connections between GPU pairs.

  2. Set the CUDA visible devices mask. To ensure NCCL uses the correct GPUs in order, expose them explicitly.

    export CUDA_VISIBLE_DEVICES=0,1
    

    Expected output: no output; the environment variable is set.

  3. Launch vLLM with tensor-parallel size. Replace 2 with the desired GPU count.

    vllm serve <model-repo> \
      --task generate \
      --tensor-parallel-size 2 \
      --gpu-memory-utilization 0.85
    

    Expected output: INFO: Application startup complete. on port 8000.

  4. Confirm distributed weights across devices.

    curl -s http://localhost:8000/v1/models
    

    Expected output: the model name listed; internally it spans both GPUs.

Verification

python -c "
import vllm
print('vLLM installed with tensor parallelism support')
"
# Expected: no error, vLLM imports successfully

Common failures

  • NCCL timeout errors — GPUs cannot communicate quickly enough, often caused by PCIe-only links. Lower --gpu-memory-utilization to 0.75.
  • tensor_parallel_size exceeds available GPUs — The CUDA mask exposes fewer devices than requested. Reset CUDA_VISIBLE_DEVICES to list all N GPUs.
  • NCCL vs. CUDA version mismatch — Reinstall vLLM targeting the host CUDA version.
  • Model architecture lacks tensor-parallel support — Check the vLLM supported models list before loading.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

RELATED GUIDES