RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Troubleshooting
  4. /Tensor parallelism: NCCL crash / 'unable to allocate' / 'distributed init failed'
fatal✓Editorial·Reviewed May 2026

Tensor parallelism crash — fix multi-GPU NCCL + topology issues

Multi-GPU tensor-parallel crashes trace to NCCL backend issues (PCIe topology, missing peer access), insufficient GPU pair memory, or tensor-parallel-size not matching GPU count. Diagnose with NCCL_DEBUG=INFO.

vLLMExLlamaV2TensorRT-LLMDeepSpeedPyTorch DDPNCCL
By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

#1

TP-size doesn't match available GPU count

Diagnose

vLLM crashes with 'tensor-parallel-size 4 but only 2 GPUs visible.' `nvidia-smi` shows the count.

Fix

Set `--tensor-parallel-size` to match `nvidia-smi` count. For dual-GPU: `--tensor-parallel-size 2`. Verify with `CUDA_VISIBLE_DEVICES=0,1` to explicitly pin.

#2

PCIe topology blocks peer access (NCCL falls back to host memory)

Diagnose

Multi-GPU works but slow. `nvidia-smi topo -m` shows `SYS` between GPUs (no direct peer access). NCCL traffic goes through CPU.

Fix

If GPUs are on different CPU sockets or behind different PCIe switches, NCCL can't peer-access. Move both GPUs to the same CPU's PCIe lanes. For consumer dual-GPU, ensure both are PCIe 4.0 x8 or x16 from the same CPU.

#3

Combined VRAM insufficient for the model

Diagnose

TP works on smaller models but fails on 70B+. Each GPU needs to hold its slice + KV cache + activations. Two 16 GB cards can't run 70B Q4.

Fix

Smaller model. Smaller quant. Or upgrade to 24+ GB GPUs. For 70B Q4: dual 3090 (48 GB combined) is the minimum. For FP16 70B: H100 80 GB or 4× 24 GB cards.

Best GPU for local AI →
#4

NCCL version mismatch with PyTorch / runtime

Diagnose

Crash with 'NCCL version mismatch' or 'function not found.' NCCL bundled with PyTorch differs from the system one.

Fix

Reinstall PyTorch from official wheels (which bundle the right NCCL): `pip install --upgrade --force-reinstall torch --index-url https://download.pytorch.org/whl/cu124`. Or unset `NCCL_HOME` to use bundled.

#5

GPU 0 not used (CUDA_VISIBLE_DEVICES misconfigured)

Diagnose

vLLM expects to use GPU 0 by default. If only GPUs 1+2 are exported, the framework sees them as 0+1 but errors on internal ID expectations.

Fix

Explicit: `CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server --tensor-parallel-size 2 ...`. Frameworks then see two GPUs as 0+1.

#6

P2P (peer-to-peer) disabled in BIOS

Diagnose

Multi-GPU works but degraded. `nvidia-smi topo -m` shows `PXB` instead of `PIX` between GPUs. BIOS PCIe setting prevents direct P2P.

Fix

Enable 'Above 4G Decoding' in BIOS. Some boards also have 'Re-Size BAR Support' which helps. Reboot, verify topo improves.

Frequently asked questions

How much faster is tensor parallelism vs single-GPU?

vLLM / ExLlamaV2 typically scale 1.7-1.9x on dual-GPU for inference (memory-bound workload). Training scales closer to 1.95x. The non-linearity comes from communication overhead — tensor-parallel sends activations across GPUs every layer.

Do I need NVLink for tensor parallelism?

No — works on PCIe 4.0 x16. NVLink helps but isn't required. Most consumer dual-GPU rigs (4090, 5090, 3090) use PCIe and work fine. NVLink was discontinued on 4090+ consumer cards anyway.

Can I tensor-parallel different GPU models?

Technically yes, but the slow card bottlenecks the fast one. Mixing 4090 + 3090 means 3090 throughput on tensor-parallel workloads. Match cards for production setups.

Related troubleshooting

CUDA out of memory

Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).

vLLM: CUDA version mismatch / 'no kernel image is available for execution'

vLLM ships pre-built wheels against specific CUDA versions. When your system CUDA differs, you get cryptic kernel-image errors. Here's the version matrix and the fix order.

Model keeps crashing / segfault during inference

Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

  • Best GPU for local AI
  • Best laptop for local AI
  • Best Mac for local AI

Where next?

All troubleshooting guides
OrBest GPU for local AIWill it run on my hardware?