Pipeline Parallelism

Pipeline parallelism (a.k.a. "layer split" in llama.cpp parlance) puts whole layers on different GPUs. Card 0 handles layers 0-39; card 1 handles layers 40-79. On every forward pass, the activation tensor crosses the bus once at the layer-boundary transition, not on every layer like tensor parallelism.

The trade-off: pipeline parallelism is bandwidth-friendlier than tensor parallelism on slow interconnect (PCIe-only multi-GPU, Thunderbolt cluster, multi-machine over Ethernet) because cross-card traffic is once per token instead of every layer. The downside: inherently sequential — card 1 sits idle while card 0 is computing the first half of layers, so single-stream throughput is limited to the per-card throughput. You only win latency parallelism via concurrent throughput.

Pipeline parallelism is the right answer for: asymmetric GPU pairs (mixed RTX 4090 + RTX 3090 — the ratio handles the throughput difference), PCIe-only multi-GPU (no NVLink penalty matters less), multi-machine clusters (Exo, Petals, Hyperspace pods). vLLM supports it via --pipeline-parallel-size; llama.cpp via --tensor-split. Often combined with TP in hybrid configurations on large datacenter clusters.

Related terms

See also