What causes "NCCL error: peer to peer not supported (multi-GPU)"?

**Environment:** Multi-GPU NVIDIA hosts running tensor-parallel inference ([vLLM](/tools/vllm), [TGI](/tools/tgi), [SGLang](/tools/sglang)) on consumer chipsets or hosts with strict IOMMU/ACS. **Severity: high** — tensor-parallel jobs won't start. - Consumer chipsets (X670, Z790, B650) often lack PCIe peer-to-peer between GPUs by default - IOMMU + ACS enabled in BIOS isolates each PCIe slot, blocking GPU-to-GPU DMA - 4× consumer GPUs in a board that bifurcates the lane to ×8/×8 break P2P with one combination - NCCL detects no P2P path and refuses (older versions) instead of falling back to PCIe staging - Mixed-vendor GPU slots (one in CPU lanes, one in chipset lanes)

How do you fix "NCCL error: peer to peer not supported (multi-GPU)"?

**1. Disable P2P and force NCCL through host memory** (works everywhere, ~10-30% slower than true P2P): ```bash export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 # also disable InfiniBand if no fabric ``` Add to your launch script or systemd unit. Most consumer multi-GPU setups need this. **2. Verify P2P matrix to see which pairs are blocked:** ```bash # Build CUDA samples or run nvidia's built-in: cd /usr/local/cuda/extras/demo_suite ./p2pBandwidthLatencyTest ``` Look for "P2P=Yes" entries; missing pairs are your blocked links. **3. BIOS settings (if you have datacenter-grade boards):** - Disable IOMMU / VT-d - Disable ACS Override - Enable "Above 4G Decoding" + "Re-Size BAR" - Set PCIe slots to x16/x16 mode (not x8/x8) **4. Use a server chipset** if you genuinely need P2P at scale — Threadripper PRO + WRX80/WRX90 or EPYC + SP5 expose full PCIe lanes with native P2P. Consumer boards are a hard ceiling for tensor-parallel scaling. **5. For 2 GPUs only, skip tensor parallelism entirely** and use pipeline parallelism — splits layers across GPUs with no all-reduce, no P2P needed: ```bash vllm serve qwen2.5-72b --pipeline-parallel-size 2 --tensor-parallel-size 1 ```

NCCL error: peer to peer not supported (multi-GPU) — fix and explanation

Cause

Environment: Multi-GPU NVIDIA hosts running tensor-parallel inference (vLLM, TGI, SGLang) on consumer chipsets or hosts with strict IOMMU/ACS.

Severity: high — tensor-parallel jobs won't start.

Consumer chipsets (X670, Z790, B650) often lack PCIe peer-to-peer between GPUs by default
IOMMU + ACS enabled in BIOS isolates each PCIe slot, blocking GPU-to-GPU DMA
4× consumer GPUs in a board that bifurcates the lane to ×8/×8 break P2P with one combination
NCCL detects no P2P path and refuses (older versions) instead of falling back to PCIe staging
Mixed-vendor GPU slots (one in CPU lanes, one in chipset lanes)

Solution

1. Disable P2P and force NCCL through host memory (works everywhere, ~10-30% slower than true P2P):

export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1   # also disable InfiniBand if no fabric

Add to your launch script or systemd unit. Most consumer multi-GPU setups need this.

2. Verify P2P matrix to see which pairs are blocked:

# Build CUDA samples or run nvidia's built-in:
cd /usr/local/cuda/extras/demo_suite
./p2pBandwidthLatencyTest

Look for "P2P=Yes" entries; missing pairs are your blocked links.

3. BIOS settings (if you have datacenter-grade boards):

Disable IOMMU / VT-d
Disable ACS Override
Enable "Above 4G Decoding" + "Re-Size BAR"
Set PCIe slots to x16/x16 mode (not x8/x8)

4. Use a server chipset if you genuinely need P2P at scale — Threadripper PRO + WRX80/WRX90 or EPYC + SP5 expose full PCIe lanes with native P2P. Consumer boards are a hard ceiling for tensor-parallel scaling.

5. For 2 GPUs only, skip tensor parallelism entirely and use pipeline parallelism — splits layers across GPUs with no all-reduce, no P2P needed:

vllm serve qwen2.5-72b --pipeline-parallel-size 2 --tensor-parallel-size 1

Alternative solutions

Platform-specific note: NCCL_P2P_DISABLE=1 only matters on multi-GPU hosts. On a single-GPU rig the variable does nothing — don't set it cargo-cult style; it's noise. On AMD ROCm, the equivalent is HSA_FORCE_FINE_GRAIN_PCIE=1 and disabling RCCL P2P with RCCL_P2P_DISABLE=1.

NCCL error: peer to peer not supported (multi-GPU)

Cause

Solution

Alternative solutions

Related errors

Did this fix it?