Hardware combinations for local AI
Dual GPUs, quad GPUs, mixed cards, Apple unified memory, Exo clusters, distributed serving. The honest answer to “what hardware combination should I build to run this model well?” — with effective-VRAM math, runtime compatibility, failure modes, and who should avoid each setup.
Combinations (2)
Each combo links to operator-grade detail with topology diagram, runtime compatibility matrix, failure modes, and recommended models.
vLLM tensor-parallel 4× H100 80GB workstation
Datacenter-tier serving rig: 4× H100 80GB SXM with NVLink-Switch fabric. 320 GB total / ~300 GB effective. The reference vLLM tensor-parallel deployment for production.
4× Mac Mini M4 Pro Exo cluster (256 GB total)
Four Mac Mini M4 Pro nodes with 64 GB unified memory each, connected via Thunderbolt 5. Exo distributes layers across machines. 256 GB total / ~180 GB effective for inference.
Going deeper
- Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
- Distributed inference systems — architectural depth on tensor / pipeline / expert routing.
- Execution stacks — full deployment recipes that pair combos with runtimes and models.
- Hardware catalog — single-GPU baselines that the combos here build on.