Hardware combinations for local AI
Dual GPUs, quad GPUs, mixed cards, Apple unified memory, Exo clusters, distributed serving. The honest answer to “what hardware combination should I build to run this model well?” — with effective-VRAM math, runtime compatibility, failure modes, and who should avoid each setup.
Combinations (8)
Each combo links to operator-grade detail with topology diagram, runtime compatibility matrix, failure modes, and recommended models.
vLLM tensor-parallel 4× H100 80GB workstation
Datacenter-tier serving rig: 4× H100 80GB SXM with NVLink-Switch fabric. 320 GB total / ~300 GB effective. The reference vLLM tensor-parallel deployment for production.
4× Mac Mini M4 Pro Exo cluster (256 GB total)
Four Mac Mini M4 Pro nodes with 64 GB unified memory each, connected via Thunderbolt 5. Exo distributes layers across machines. 256 GB total / ~180 GB effective for inference.
Mac Studio M3 Ultra 192GB
Apple Silicon flagship with 192 GB unified memory. Genuinely pools — total VRAM ≈ effective VRAM. Trades NVIDIA throughput for the largest model envelope at any reasonable power budget.
Quad RTX 3090 (24 GB × 4)
Four used 3090s in a homelab chassis. 96 GB total / ~88 GB effective. The cheapest path to 100B+ class models and high-concurrency 70B serving.
Ray Serve multi-node distributed inference (4 nodes × 2× RTX 4090)
Distributed serving across 4 machines, each with 2× RTX 4090. Ray Serve orchestrates replicas. 192 GB total / ~80 GB per replica. Built for high-concurrency request routing, not single-large-model deployment.
Dual RTX 3090 (24 GB × 2)
The reference dual-GPU local-AI rig. NVLink optional. 48 GB total / ~46 GB effective with tensor parallelism. The cheapest path to 70B-class models at 2025-2026 prices.
Dual RTX 4090 (24 GB × 2)
Two consumer-flagship cards. PCIe 4.0 only — no NVLink on 4090. 48 GB total / ~45 GB effective with tensor parallelism. ~30% faster decode than dual 3090 at 2× the cost.
RTX 4090 + RTX 3090 (asymmetric 24+24 GB)
Asymmetric multi-GPU: a 4090 paired with a 3090. PCIe 4.0 only — different SM counts, different memory bandwidth. Effective VRAM is bottlenecked by the slower card on most split strategies.
Going deeper
- Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
- Distributed inference systems — architectural depth on tensor / pipeline / expert routing.
- Execution stacks — full deployment recipes that pair combos with runtimes and models.
- Hardware catalog — single-GPU baselines that the combos here build on.