Local AI compute clusters.
When a single GPU stops being enough — 70B at Q8, 200B+ MoE, long-context agent workflows — you're shopping for a cluster. Three credible paths in 2026: multi-GPU CUDA via vLLM tensor parallelism, multi-Mac via Exo + Thunderbolt 5, and multi-node CUDA. Honest tradeoffs, real numbers, sourced.
TL;DR
Three paths actually work in 2026:
- Multi-GPU on one machine via vLLM tensor parallelism — dual / quad 3090 / 4090 / A6000. The serious-work default. NVLink optional but recommended for ≥4 GPUs.
- Multi-Mac via Exo + Thunderbolt 5 — 2-8 Mac Studios sharded peer-to-peer. The 2026 surprise: DeepSeek V3 (671B) ran at 5.37 tok/s on 8× M4 Pro Mac Minis C (per Virge.io, community-reported — see confidence ladder).
- Multi-node CUDA via vLLM TP + PP — workstation-scale or small datacenter, when one rig can't hold even with multi-GPU. Ray Serve or Kubernetes for orchestration.
The honest middle ground: most readers don't need a cluster. A single 4090 24GB or M3 Ultra 192GB handles 99% of solo local-AI work. Clusters are for the 1% that's actually frontier-class — production multi-user inference, 71B+ Q8, or 200B+ MoE.
Do you need a cluster?
Three honest tests before you spend cluster money:
- VRAM math. Does your target model + KV cache exceed a single consumer GPU? Llama 3.3 70B at Q4 is 40GB weights + ~10GB KV at 32K context — won't fit on a 4090 (24GB). But it DOES fit on an M3 Ultra (192GB unified) — so “won't fit on one GPU” ≠ “needs a cluster.”
- Concurrency. Single-user or multi-user? vLLM's continuous batching + PagedAttention is built for multi-user, but a single operator hitting a single GPU usually doesn't need the throughput a cluster unlocks.
- Throughput vs latency. Clusters shine on throughput. Latency (time-to-first- token, total response time) often gets WORSE — inter- node coordination adds ms or seconds. If your workflow is “ask one question, wait for the answer,” a single fast rig usually wins.
Decision shortcut: if a single 4090 24GB / M3 Ultra 192GB / dual-3090 48GB rig can't hold your target model + your typical context size + 2× headroom, you're cluster-shopping. Otherwise probably not.
The three paths
Each path has a different sweet spot. Match the path to the workload, not the other way around:
| Path | Best at | Worst at |
|---|---|---|
| vLLM TP (single node) | multi-user inference, throughput per dollar | needs identical GPUs; CUDA-only |
| Mac Studio cluster (Exo) | huge unified memory, low power, quiet | single-user; latency-sensitive workflows |
| Multi-node CUDA | frontier-class throughput; horizontal scale | ops complexity; needs DC-grade networking |
Path 1 — vLLM tensor parallel (single node)
The serious-work default in 2026. vLLM is now maintained by 2000+ contributors (per GitHub Insights, May 2026) and supports tensor parallelism (TP), pipeline parallelism (PP), data parallelism (DP), expert parallelism (EP), and context parallelism out of the box. On a single node with N matching GPUs the setup is one line:
# 2x or 4x identical GPUs, single host vllm serve meta-llama/Llama-3.3-70B-Instruct \ --tensor-parallel-size 4 \ --dtype auto \ --max-model-len 32768
Practical notes from operators in 2026:
- Match the GPUs. Mixing a 4090 + 3090 works but tensor parallelism aligns to the slowest card — no benefit from the faster one.
- PCIe lanes matter. For 2× GPUs, x8+x8 PCIe 4.0 is fine. For 4×, you want x4 PCIe 5.0 minimum or NVLink. Consumer motherboards bottleneck at ~3 GPUs without a HEDT / threadripper board.
- Power + thermals. 4× 4090 = 1800W TDP at load. Need a 1600W+ PSU and serious airflow. Used 3090s at 350W each are kinder.
- FP8 + Marlin kernels give big throughput wins on Ada / Hopper. Check vLLM release notes for the latest quant support before buying — landscape shifts every minor release.
Concrete builds in the catalog: dual-3090, quad-3090, dual-4090, H100 tensor-parallel workstation.
Path 2 — Mac Studio chain (Exo + Thunderbolt 5)
The 2026 surprise. Exo (exo-explore/exo) treats a chain of Mac Studios as a single inference target — peer-to-peer topology, automatic device discovery, dynamic layer-wise partitioning. Thunderbolt 5 carries activations between machines; RDMA support added in 2026 cuts inter-node latency by ~99% vs older TCP path.
Headline numbers operators reported in early 2026:
256GB pooled unified memory. Llama 3.1 70B Q8 + 32K context fits comfortably. ~$7,400 total. Two Studios on a desk, single Thunderbolt 5 cable.
DeepSeek V3 671B at 5.37 tok/s C. Less per-machine memory but more parallelism. ~$12k. The “frontier-on-a-shelf” demo of 2026.
Honest caveats:
- Single-user only in practice. Exo's peer-to-peer topology doesn't batch across concurrent users the way vLLM does.
- Latency hurts interactivity. Even with Thunderbolt 5 RDMA, inter-machine traffic during decode adds noticeable latency. Long-context batch workflows (RAG, code generation) work better than live chat.
- MLX multi-machine still beta. Apple's official multi-machine MLX support (ml- explore/mlx#1046) is landing through 2026. Until it ships GA, Exo is the practical path.
Catalog reference: multi-machine Apple cluster stack.
Path 3 — multi-node CUDA
When one node isn't enough. vLLM supports multi-node inference by combining tensor parallelism (within a node) with pipeline parallelism (across nodes):
# 2 nodes, 8 GPUs per node, 16 total vllm serve <model> \ --tensor-parallel-size 8 \ --pipeline-parallel-size 2 \ --max-model-len 65536
This isn't homelab territory. You need:
- High-bandwidth interconnect. InfiniBand HDR/NDR (200-400 Gbps) or RoCE for low- latency cross-node attention. 10G Ethernet works but pipeline-parallel performance tanks.
- Orchestration. Ray Serve for the simple case, Kubernetes (KServe, vLLM Production Stack) for production. Single-node vLLM + Ray cluster is the lightest path.
- Identical hardware. Like single-node TP, mixed-GPU nodes hurt — match everything.
Multi-node is the right answer for production inference APIs serving real traffic, not for a single operator wanting more capacity on their desk. Distributed-inference homelab stack has a smaller-scale walkthrough.
Decentralized & Kubernetes-native paths
Beyond the three primary paths above, four 2026-current frameworks deserve a place in the decision space — they don't replace vLLM / Exo / multi-node, but they fill specific gaps:
Hyperspace AI — fully decentralized P2P inference
Hyperspace is a peer-to-peer network for AI inference built on libp2p (same stack that powers IPFS). No central servers: your node joins a global mesh, queries route via DHT + gossip to whichever peer has the best model loaded for the request, and the network publishes a three-layer distributed cache where the first node to answer a question pays the compute cost — every subsequent caller gets the verified result for free. As of April 2026, Hyperspace self-reports the network at ~2M nodes and ~3.6M downloads C (per project README, no independent audit — vendor-published metric, treat as self-reported scale rather than verified active devices). The platform supports any GGUF model (Qwen 3.5 32B, GLM-5 Turbo, etc.) across the mesh. Browser / CLI / tray-app clients.
The fit: distributing compute across willing peers rather than owning the cluster yourself. Trade-off is shared throughput and unpredictable latency vs zero hardware spend.
Parallax — heterogeneous GPU pool serving
Parallax (from Gradient HQ) turns a pool of mismatched GPUs into a single inference target. Its two-phase scheduler does model allocation (placing layers across diverse GPUs) and request-time pipeline selection. The specific gap it fills: you have a 3090 + a 4090 + a 5080 in different rooms, and you want them to act as one cluster despite the speed mismatch that breaks vanilla vLLM tensor parallelism.
llm-d — Kubernetes-native distributed inference
llm-d joined CNCF as a Sandbox project in March 2026, founded by Red Hat, Google Cloud, IBM Research, CoreWeave, and NVIDIA. The headline architectural choice: disaggregated prefill + decode — prefill and decode phases run in separate pods on separate accelerator types, with KV-cache routing between them. The v0.5 release adds hierarchical KV offloading, cache-aware LoRA routing, active-active HA, UCCL-based transport, and scale-to-zero. This is what a production inference platform looks like at the Kubernetes layer in 2026 — overkill for a homelab, the right answer for an org serving real traffic.
Ray Serve — orchestration layer (still the default)
Ray Serve remains the lightest-weight orchestration layer for composing multiple models + business logic. Pairs naturally with vLLM (vLLM ships a Ray-backed multi-node option). Pick this when the production-Kubernetes stack is too heavy and you just want “run multiple vLLM instances across a couple of boxes.”
GPU-rental networks (Hyperbolic, io.net, Akash)
Different category, but worth naming because operators ask: these are commercial GPU-rental marketplaces rather than self-hosted clusters. Hyperbolic focuses on low-latency model runs; io.net aggregates idle enterprise GPUs; Akash is a decentralized cloud marketplace. None of these are “local AI” in the operator-grade sense — but they fill the same gap as a cluster (run a model bigger than your box can hold) without the capex.
What doesn't work (yet)
- Petals. The volunteer-compute peer-to-peer model was promising in 2023-2024; public-network latency remains prohibitive for interactive workflows and observed activity has declined since the 2023 peak. Niche use survives in a few research projects.
- Heterogeneous mixed-vendor clusters. Trying to combine an NVIDIA box with an AMD box with a Mac in one cluster is technically possible (Exo supports it) but practically painful — different quant formats, different kernels, different perf envelopes. Stick to one vendor per cluster.
- Consumer-WiFi distributed inference. Without wired interconnect (Thunderbolt 5 / InfiniBand / NVLink), every link is a bottleneck. Don't try to chain machines over WiFi.
- Cluster for a model that fits on one GPU. If your model fits on a single 24GB or 48GB card, clustering it costs throughput and adds complexity for near-zero gain. The cluster overhead is real.
Cluster software stack
| Layer | CUDA cluster | Mac cluster |
|---|---|---|
| Inference runner | vLLM | Exo (mlx-lm beta for multi-machine) |
| Parallelism | Tensor + pipeline parallel | Layer-wise sharding (Exo auto) |
| Interconnect | NVLink / InfiniBand / RoCE / PCIe 5.0 | Thunderbolt 5 (with RDMA) |
| Orchestration | Ray Serve / Kubernetes / KServe | Exo auto-discovery (no extra layer) |
| OpenAI-API frontend | vLLM OpenAI server | Exo built-in |
Pick your path
The honest decision tree:
- Need to run a model that won't fit on a single 24GB / 48GB / 192GB rig? If no, no cluster needed — try /will-it-run first. If yes, continue.
- Quiet + low power + huge unified memory? Mac Studio chain (Path 2). 2× M4 Max 128GB is the ~$7,400 entry.
- Multi-user concurrent inference? vLLM tensor parallel (Path 1) on a CUDA workstation. Continuous batching + PagedAttention is built for this.
- Frontier-class throughput, willing to ops? Multi-node CUDA (Path 3). Ray Serve + InfiniBand + identical H100/H200 hardware.
- None of the above + you want to try anyway? You probably don't need a cluster. Re-read § 2 or run /cost-calculator — clusters look very different against cloud APIs until you hit serious volume.
The cheapest credible 48GB pooled-VRAM cluster.
The Mac-Studio-chain build walkthrough.
Multi-node CUDA at a homelab scale (no datacenter networking).
Cluster TCO vs cloud at your usage pattern.