Multi-GPU decision intelligence

Hardware combinations for local AI

Dual GPUs, quad GPUs, mixed cards, Apple unified memory, Exo clusters, distributed serving. The honest answer to “what hardware combination should I build to run this model well?” — with effective-VRAM math, runtime compatibility, failure modes, and who should avoid each setup.

By Fredoline Eruo · Updated continuously

Filter

Topology

Any Single-node multi-GPU Apple unified Apple cluster Mixed GPU Distributed

Difficulty

Any Beginner Intermediate Advanced Expert

Interconnect

Any PCIe NVLink NVLink-Switch Thunderbolt Unified

Effective VRAM

Any 40+ GB 80+ GB 140+ GB

Combinations (8)

Each combo links to operator-grade detail with topology diagram, runtime compatibility matrix, failure modes, and recommended models.

vLLM tensor-parallel 4× H100 80GB workstation

Datacenter-tier serving rig: 4× H100 80GB SXM with NVLink-Switch fabric. 320 GB total / ~300 GB effective. The reference vLLM tensor-parallel deployment for production.

Single-node multi-GPUNVLink-Switchexpert

VRAM 300/320 GB

Power 2800W

4× Mac Mini M4 Pro Exo cluster (256 GB total)

Four Mac Mini M4 Pro nodes with 64 GB unified memory each, connected via Thunderbolt 5. Exo distributes layers across machines. 256 GB total / ~180 GB effective for inference.

Apple clusterThunderboltexpert

VRAM 180/256 GB

Power 600W

Mac Studio M3 Ultra 192GB

Apple Silicon flagship with 192 GB unified memory. Genuinely pools — total VRAM ≈ effective VRAM. Trades NVIDIA throughput for the largest model envelope at any reasonable power budget.

Apple unifiedUnified memorybeginner

VRAM 140/192 GB

Power 370W

Pooled

Quad RTX 3090 (24 GB × 4)

Four used 3090s in a homelab chassis. 96 GB total / ~88 GB effective. The cheapest path to 100B+ class models and high-concurrency 70B serving.

Single-node multi-GPUNVLinkadvanced

VRAM 88/96 GB

Power 1400W

Ray Serve multi-node distributed inference (4 nodes × 2× RTX 4090)

Distributed serving across 4 machines, each with 2× RTX 4090. Ray Serve orchestrates replicas. 192 GB total / ~80 GB per replica. Built for high-concurrency request routing, not single-large-model deployment.

Distributed10 GbEexpert

VRAM 80/192 GB

Power 3600W

Dual RTX 3090 (24 GB × 2)

The reference dual-GPU local-AI rig. NVLink optional. 48 GB total / ~46 GB effective with tensor parallelism. The cheapest path to 70B-class models at 2025-2026 prices.

Single-node multi-GPUNVLinkintermediate

VRAM 46/48 GB

Power 700W

Dual RTX 4090 (24 GB × 2)

Two consumer-flagship cards. PCIe 4.0 only — no NVLink on 4090. 48 GB total / ~45 GB effective with tensor parallelism. ~30% faster decode than dual 3090 at 2× the cost.

Single-node multi-GPUPCIeintermediate

VRAM 45/48 GB

Power 900W

RTX 4090 + RTX 3090 (asymmetric 24+24 GB)

Asymmetric multi-GPU: a 4090 paired with a 3090. PCIe 4.0 only — different SM counts, different memory bandwidth. Effective VRAM is bottlenecked by the slower card on most split strategies.

Mixed GPUPCIeadvanced

VRAM 42/48 GB

Power 800W

Going deeper

Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
Distributed inference systems — architectural depth on tensor / pipeline / expert routing.
Execution stacks — full deployment recipes that pair combos with runtimes and models.
Hardware catalog — single-GPU baselines that the combos here build on.