RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Hardware & infrastructure / Pipeline Parallelism
Hardware & infrastructure

Pipeline Parallelism

Pipeline parallelism (a.k.a. "layer split" in llama.cpp parlance) puts whole layers on different GPUs. Card 0 handles layers 0-39; card 1 handles layers 40-79. On every forward pass, the activation tensor crosses the bus once at the layer-boundary transition, not on every layer like tensor parallelism.

The trade-off: pipeline parallelism is bandwidth-friendlier than tensor parallelism on slow interconnect (PCIe-only multi-GPU, Thunderbolt cluster, multi-machine over Ethernet) because cross-card traffic is once per token instead of every layer. The downside: inherently sequential — card 1 sits idle while card 0 is computing the first half of layers, so single-stream throughput is limited to the per-card throughput. You only win latency parallelism via concurrent throughput.

Pipeline parallelism is the right answer for: asymmetric GPU pairs (mixed RTX 4090 + RTX 3090 — the ratio handles the throughput difference), PCIe-only multi-GPU (no NVLink penalty matters less), multi-machine clusters (Exo, Petals, Hyperspace pods). vLLM supports it via --pipeline-parallel-size; llama.cpp via --tensor-split. Often combined with TP in hybrid configurations on large datacenter clusters.

Related terms

VRAM (Video RAM)KV CacheTensor Parallelism

See also

hardware: rtx-3090hardware: rtx-4090tool: vllmtool: llama-cpptool: exo
Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →