Tensor parallelism crash — fix multi-GPU NCCL + topology issues
Multi-GPU tensor-parallel crashes trace to NCCL backend issues (PCIe topology, missing peer access), insufficient GPU pair memory, or tensor-parallel-size not matching GPU count. Diagnose with NCCL_DEBUG=INFO.
Diagnostic order — most likely first
TP-size doesn't match available GPU count
vLLM crashes with 'tensor-parallel-size 4 but only 2 GPUs visible.' `nvidia-smi` shows the count.
Set `--tensor-parallel-size` to match `nvidia-smi` count. For dual-GPU: `--tensor-parallel-size 2`. Verify with `CUDA_VISIBLE_DEVICES=0,1` to explicitly pin.
PCIe topology blocks peer access (NCCL falls back to host memory)
Multi-GPU works but slow. `nvidia-smi topo -m` shows `SYS` between GPUs (no direct peer access). NCCL traffic goes through CPU.
If GPUs are on different CPU sockets or behind different PCIe switches, NCCL can't peer-access. Move both GPUs to the same CPU's PCIe lanes. For consumer dual-GPU, ensure both are PCIe 4.0 x8 or x16 from the same CPU.
Combined VRAM insufficient for the model
TP works on smaller models but fails on 70B+. Each GPU needs to hold its slice + KV cache + activations. Two 16 GB cards can't run 70B Q4.
Smaller model. Smaller quant. Or upgrade to 24+ GB GPUs. For 70B Q4: dual 3090 (48 GB combined) is the minimum. For FP16 70B: H100 80 GB or 4× 24 GB cards.
NCCL version mismatch with PyTorch / runtime
Crash with 'NCCL version mismatch' or 'function not found.' NCCL bundled with PyTorch differs from the system one.
Reinstall PyTorch from official wheels (which bundle the right NCCL): `pip install --upgrade --force-reinstall torch --index-url https://download.pytorch.org/whl/cu124`. Or unset `NCCL_HOME` to use bundled.
GPU 0 not used (CUDA_VISIBLE_DEVICES misconfigured)
vLLM expects to use GPU 0 by default. If only GPUs 1+2 are exported, the framework sees them as 0+1 but errors on internal ID expectations.
Explicit: `CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server --tensor-parallel-size 2 ...`. Frameworks then see two GPUs as 0+1.
P2P (peer-to-peer) disabled in BIOS
Multi-GPU works but degraded. `nvidia-smi topo -m` shows `PXB` instead of `PIX` between GPUs. BIOS PCIe setting prevents direct P2P.
Enable 'Above 4G Decoding' in BIOS. Some boards also have 'Re-Size BAR Support' which helps. Reboot, verify topo improves.
Frequently asked questions
How much faster is tensor parallelism vs single-GPU?
vLLM / ExLlamaV2 typically scale 1.7-1.9x on dual-GPU for inference (memory-bound workload). Training scales closer to 1.95x. The non-linearity comes from communication overhead — tensor-parallel sends activations across GPUs every layer.
Do I need NVLink for tensor parallelism?
No — works on PCIe 4.0 x16. NVLink helps but isn't required. Most consumer dual-GPU rigs (4090, 5090, 3090) use PCIe and work fine. NVLink was discontinued on 4090+ consumer cards anyway.
Can I tensor-parallel different GPU models?
Technically yes, but the slow card bottlenecks the fast one. Mixing 4090 + 3090 means 3090 throughput on tensor-parallel workloads. Match cards for production setups.
Related troubleshooting
Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).
vLLM ships pre-built wheels against specific CUDA versions. When your system CUDA differs, you get cryptic kernel-image errors. Here's the version matrix and the fix order.
Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.
When the fix is hardware
A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: