What distributed inference actually is
Distributed inference at the depth engineers need before betting a deployment on it. Tensor parallelism vs pipeline parallelism vs CPU offload, why “VRAM pooling” is the wrong mental model, the latency math that makes consumer networking the real bottleneck, and the conditions under which one bigger GPU still wins.
What distributed inference actually is
Distributed inference is running a single model whose weights or computation graph are partitioned across more than one accelerator, in a way that requires accelerators to communicate during every inference step. The defining property is the communication during inference. Running ten copies of the same model behind a load balancer is not distributed inference — that is replicated inference, which scales throughput but doesn't share model state between machines.
The two real partitioning strategies — tensor parallelism (TP) and pipeline parallelism (PP) — are the engineering content of this page. Everything else in the “distributed inference” space (offloading, swarms, federated training) inherits its tradeoffs from how it answers two questions: what is being partitioned and how often do partitions have to talk during a single forward pass.
What it does not mean
Three patterns get called “distributed inference” in marketing copy and aren't:
- Replicated inference — N copies of the same model, load-balanced. Scales QPS, doesn't share state. The right answer at SaaS scale; not what this page is about.
- CPU/disk offloading — model weights live in host RAM or NVMe and stream to a single GPU on demand. One machine, slow inference, no inter-machine traffic. Useful when you can't fit the weights in VRAM; not architecturally distributed.
- Federated training — different shape entirely. Multiple parties train a shared model on private data; weights are aggregated periodically. Inference is downstream and usually single-node.
Architecture: TP, PP, and CPU offload
Distributed inference assembles three primitives:
Tensor parallelism (TP) shards the matrices inside a layer across N GPUs. Every GPU stores a slice of the weight matrix; on every layer, every GPU does its slice of the matmul, then all GPUs all-reduce to share the partial result. The all-reduce is the expensive part — it happens twice per transformer block, every token. TP is the path that lets a 70B model fit across 2 cards each holding ~35GB of weights, and it's also the path that makes interconnect bandwidth a first-order performance concern.
Pipeline parallelism (PP) shards entire layers across N GPUs. GPU 0 holds layers 1-20, GPU 1 holds layers 21-40, GPU 2 holds 41-60, etc. Activations stream forward through the pipeline; the GPU holding layer K hands off to the GPU holding layer K+1. Communication per token is much smaller than TP — only the activation tensor between layer boundaries — but you pay in pipeline-bubble idle time unless you keep many requests in flight.
CPU/disk offload is the single-machine fallback. Weights of the layers you're not currently computing live in host RAM or NVMe; the runtime swaps slabs of weights to GPU memory just before they're needed. PCIe bandwidth becomes the bottleneck, which is roughly 16-32 GB/s on a single x16 lane — orders of magnitude slower than HBM, so per-token latency stretches by 5-50x. Useful when your only constraint is “does it run at all?”
Real systems combine these. A 405B-class model on 8x A100 80GB uses TP=8 within a node (NCCL all-reduce over NVLink). A 671B-class model across two such nodes uses TP=8 within each node and PP=2 across nodes (Ray transports activations between nodes over InfiniBand). An Apple-Silicon cluster running DeepSeek V3 uses pipeline parallel across boxes connected by Thunderbolt 5 with macOS 26.2 RDMA cutting the inter-box latency by ~99% — the same pattern, different transport.
Why “VRAM pooling” is the wrong mental model
The most common misconception in distributed-inference framing is that connecting multiple GPUs “pools their VRAM” into a single addressable buffer. It does not. Each GPU still has its own VRAM, its own memory hierarchy, its own kernels. What distributed inference does is partition the work so each GPU only needs the slice of the model it owns. The total memory across the cluster is additive only because the runtime ensures no two GPUs need the same weights at the same time.
The corollary that bites operators: you cannot run a 70B model on two 24GB consumer GPUs by “pooling 48GB of VRAM” unless your runtime supports the partitioning. With TP, both cards need to hold matching slices of every layer's weights — possible, but the inter-GPU bandwidth on consumer motherboards (PCIe 4.0 x16 = ~32 GB/s, x8 = ~16 GB/s) makes the all-reduce a serious tax. With PP, GPU 0 holds the first half of the model and GPU 1 holds the second; bandwidth is much lower per token, but you need at least 2 concurrent requests to avoid pipeline bubbles.
The practical version of this rule: multi-GPU inference is partitioning, not pooling. Plan against the partitioning your runtime supports, and plan against the bandwidth your interconnect provides.
The latency math — why network is the bottleneck
A 70B model running TP=2 with FP16 weights does roughly two all-reduces per transformer block. With 80 transformer blocks, that's 160 all-reduces per token. Each all-reduce moves on the order of a few hundred kilobytes to a few megabytes of activation data, depending on hidden size and batch size.
Plug in real numbers. A typical 70B all-reduce on Llama is ~32 KB per token at batch=1, scaled up at higher batch. Across NVLink (600 GB/s practical bidirectional), 32 KB completes in well under a microsecond. Across PCIe 4.0 x16 (32 GB/s), the same all-reduce takes ~1 microsecond. Across 100 Gbps Ethernet (12.5 GB/s practical), it takes ~3 microseconds. Across 1 Gbps Ethernet (100 MB/s), ~300 microseconds.
Multiply by 160 all-reduces per token. NVLink: ~160 µs of interconnect time per token (negligible vs decode compute). PCIe x16: ~160 µs (still negligible). 100 Gbps Ethernet: ~480 µs (1.5-2x the decode time on a 70B). 1 Gbps Ethernet: ~48 ms (interconnect dominates everything; you're generating maybe 20 tok/s before the GPU compute).
The conclusion isn't that consumer Ethernet doesn't work — Petals demonstrably runs Llama-2 70B at ~6 tok/s across internet-WAN volunteer hosts. The conclusion is that interconnect quality determines whether tensor parallelism buys you anything compared to a single bigger card. With NVLink or InfiniBand, TP scales nearly linearly. With 1 Gbps Ethernet, TP is worse than running on one card at half the speed.
Workflow: how a request flows across the cluster
A concrete example — Llama-3.1 70B on 4x A100 80GB with TP=4, orchestrated by Ray, served via vLLM:
- HTTP client posts to the vLLM head node's OpenAI-compat endpoint. Head node parses the request and adds it to the scheduler queue.
- On the next decode step, the scheduler assigns the request a slot. All four GPUs are doing prefill / decode in lockstep (TP), so the request becomes part of the running batch.
- For every transformer block, every GPU computes its slice of the matmul, then NCCL all-reduce over NVLink combines partial results across all four cards. Sub-microsecond per all-reduce on this hardware.
- After the final layer, GPU 0 (the rank-0 worker) gets the output token, decodes it, and the head node streams it back to the client.
- On the next decode step, repeat. KV cache lives partitioned across the four cards; only the new K/V projections need communicating, not the full cache.
Cross-node deployment adds Ray transports between nodes for the PP boundary. The lifecycle stays the same; the latency budget increases by the inter-node interconnect cost (microseconds on InfiniBand, milliseconds on Ethernet).
Multi-PC inference reality
Most readers asking about distributed inference are asking about consumer multi-PC setups: two-three workstations on a home network, maybe a Mac and a Linux box, each with one GPU or M-series chip. Three honest observations:
Consumer Ethernet is the bottleneck when the model crosses machine boundaries. 1 Gbps loses to PCIe by 30x; even 10 Gbps is 3x slower than PCIe 4.0 x16. Tensor parallel across a switch is almost always worse than running on a single card with offload. Pipeline parallel can work because per-token communication is smaller, but you still pay microseconds-to-milliseconds per layer boundary.
Apple Silicon clusters are the 2026 exception. Thunderbolt 5 + macOS 26.2 RDMA cut device-to-device latency to roughly InfiniBand levels — the Exo reference benchmark shows DeepSeek V3 671B at 5.37 tok/s on 8x M4 Pro Mac Minis, which is genuinely competitive with the same workload on datacenter hardware tier-by-tier. The credibility lifts consumer Apple multi-PC from “novelty” to “real serving option” for personal use.
Petals is the “internet is the cluster” extreme. Layers shard across the global volunteer swarm; activations stream over WAN. ~6 tok/s on Llama-2 70B is slow, but it lets a laptop user run a model they could never fit locally. The activations leaving your machine make it unsuitable for sensitive workloads, but for “I want to play with a frontier-class model and I can't afford the GPU”, it's a real option.
Reference stacks (Hyperspace / Petals / Exo / vLLM+Ray / SGLang)
The five categories of distributed inference deployment in May 2026, with the canonical tool for each:
Datacenter multi-GPU, single-node TP → vLLM or SGLang. \`tensor_parallel_size=N\`, NVLink, NCCL all-reduce, the whole production playbook. The canonical pattern for serving 70B-class models behind an OpenAI-compatible endpoint.
Datacenter multi-node TP+PP → vLLM or SGLang orchestrated by Ray Serve. Required for 405B / 671B-class models. InfiniBand or RoCE non-negotiable.
Apple Silicon home/office cluster → Exo. Auto-discovers nearby devices over the LAN; pipeline-parallels via MLX. Thunderbolt 5 + RDMA on M4 Pro+ hardware is the unlock that makes this credible for serious models.
Internet-scale volunteer swarm → Petals. Pipeline-parallel across WAN-connected volunteer hosts. Slow but possible for models you couldn't otherwise run.
Consumer P2P inference (experimental) → Hyperspace and a handful of newer entrants. The category that still has no undisputed winner; watch the next 6-12 months.
When distributed inference actually wins
The conditions under which the engineering tax pays for itself:
- The model literally won't fit on the biggest single card you can buy. 405B / 671B class models on anything below an H100/H200 cluster. TP+PP is the only path.
- You have NVLink or InfiniBand interconnect. Without it, the all-reduce tax usually exceeds the parallel speedup. With it, TP scales nearly linearly to 8x within a node.
- You have an Apple Silicon cluster on Thunderbolt 5 with macOS 26.2 RDMA. The new exception that makes consumer-Mac multi-PC competitive.
- You can't buy a single bigger card and you can tolerate the throughput hit. Petals fits here: 6 tok/s on a model you couldn't otherwise run.
- You're already running an autoscaling Ray cluster. The marginal cost of a multi-node serving deployment is low when the orchestration layer is already paid for.
When one bigger GPU is the better answer
The unappreciated default. If you can fit the model on one card you can buy, you almost always should:
- Llama 3.1 70B AWQ-INT4 fits in ~40GB. One H100 80GB or one MI300X 192GB beats two A100 40GB on TP for both throughput and ops complexity.
- 13B-class models fit on a 24GB consumer card. Distributed inference is pure overhead at this scale.
- Latency-sensitive single-request workloads often prefer a single fast card to multi-card TP, even when both fit, because the all-reduce tax adds latency that single-card decode doesn't.
- You don't have NVLink/InfiniBand. Probably the most common mistake we see in homelab setups — wiring two GPUs across PCIe x8 and assuming TP will scale. It usually won't.
- Your operations team has one platform engineer. Ray + vLLM + multi-node tensor parallel is a real cluster engineering job. Single-node is one process.
The honest version of the rule: distributed inference is a hardware-and-ops investment, not a software trick. Without the underlying interconnect and platform-engineering maturity, “just add another GPU” rarely turns into the throughput it suggests.
Failure modes you'll hit in production
The list of things that will go wrong in distributed-inference deployments, in rough order of how often we've seen them:
- NCCL hang on cluster startup. Inter-node tensor parallel with non-uniform NIC settings deadlocks at init. \`NCCL_DEBUG=INFO\` and \`NCCL_IB_HCA\` configuration are mandatory for any multi-node deployment. Plan a half-day of cluster debugging the first time.
- PCIe topology asymmetry. Two GPUs on the same root complex have different bandwidth than two GPUs on different root complexes. \`nvidia-smi topo -m\` reveals which pairs share NVLink vs PCIe. Misaligned tensor-parallel ranks silently halve throughput.
- Pipeline bubble starvation. PP with single- request traffic leaves most pipeline stages idle most of the time. Either keep many requests in flight or accept that PP is a memory-fitting strategy, not a throughput strategy at low concurrency.
- Ray head node single-point-of-failure. Every multi-node Ray deployment starts with a head node that schedules everything. Lose it, lose the cluster. Plan for HA or accept the failure mode.
- Network MTU mismatch. Ethernet MTU defaults to 1500; high-throughput RDMA wants 9000 (jumbo frames). Mismatched MTUs across switches produce silent throughput regression on multi-node TP.
- Petals / Exo node churn. Volunteer/peer inference depends on which nodes are online. Latency spikes and partial failures are routine; client code has to handle them gracefully.
- Activation precision drift. Different GPUs in a cluster running mixed FP16/BF16 silently produce slightly different activations. Reproducibility breaks; debugging is painful. Pin precision globally.
- RDMA driver / firmware mismatch. Apple Silicon clusters on Thunderbolt 5 RDMA require macOS 26.2+ on every node and matching Thunderbolt firmware. One node on macOS 26.1 silently downgrades the cluster to non-RDMA path.
Benchmark targets RunLocalAI should run
The measurements that would let readers actually decide whether distributed inference is worth it for them. These are the targets we plan to add to the benchmarks dataset over the next two cycles:
- 70B FP16 on 1x H100 vs 2x A100 80GB TP=2 with NVLink — does the extra card buy throughput, or is single-card HBM bandwidth winning?
- 70B AWQ on 1x RTX 4090 24GB (offload) vs 2x RTX 4090 TP=2 over PCIe 4.0 x16 — the “is multi-GPU consumer worth it” question, with real numbers.
- DeepSeek V3 671B on 8x M4 Pro Mac Mini cluster with Thunderbolt 5 RDMA — the Exo reference benchmark independently verified.
- Petals public swarm Llama-3.1 70B latency distribution — TTFT and per-token latency distributions over a week of hourly probes.
- vLLM TP=4 on 4x A100 80GB across two nodes (PP=2) via Ray on InfiniBand vs 100 Gbps Ethernet — the interconnect-quality vs throughput curve at production scale.
Comparison: distributed vs single-node alternatives
The five reference stacks above against the obvious single-node alternatives:
vs single big GPU + offload. Offload to host RAM beats distributed inference on hardware-availability axes (one machine, no networking) but loses on throughput (PCIe-bound at ~16 GB/s, vs HBM at 1-3 TB/s for the resident model). Pick offload when you can't fit the model and can't add a second machine; pick TP when the interconnect is fast enough.
vs replicated single-node serving. Replication scales QPS linearly, doesn't share state, and is vastly simpler operationally. Pick replication for serving a model that fits in one box at SaaS scale; pick distributed inference only when the model itself doesn't fit.
vs cloud API call. The honest comparison most readers should run before any distributed inference deployment. At spot-instance pricing, cloud H100 inference per token is often cheaper than amortising a self-hosted cluster, until your traffic clears ~$50K/month of API spend. Distributed inference becomes the answer when the per-token economics flip or when you have a hard data-residency requirement that rules out cloud calls.
vs quantizing harder. Most often forgotten: dropping from FP16 to AWQ-INT4 cuts memory needs ~4x, often turning “needs distributed inference” into “fits on one card.” Quality loss is usually 1-2% on benchmark scores; the operational simplification is enormous. Always check the quantization path before investing in cluster engineering.
Companion reading: inference-runtime ecosystem map for where each runtime sits in the landscape; vLLM operational review and SGLang operational review for the runtime-specific operator detail; /systems/mcp for the protocol layer most distributed inference deployments expose to clients.