Multi-GPU decision intelligence

Hardware combinations for local AI

Dual GPUs, quad GPUs, mixed cards, Apple unified memory, Exo clusters, distributed serving. The honest answer to “what hardware combination should I build to run this model well?” — with effective-VRAM math, runtime compatibility, failure modes, and who should avoid each setup.

By Fredoline Eruo · Updated continuously
Filter

Combinations (8)

Each combo links to operator-grade detail with topology diagram, runtime compatibility matrix, failure modes, and recommended models.

vLLM tensor-parallel 4× H100 80GB workstation

Datacenter-tier serving rig: 4× H100 80GB SXM with NVLink-Switch fabric. 320 GB total / ~300 GB effective. The reference vLLM tensor-parallel deployment for production.

Single-node multi-GPUNVLink-Switchexpert
VRAM 300/320 GB
Power 2800W

4× Mac Mini M4 Pro Exo cluster (256 GB total)

Four Mac Mini M4 Pro nodes with 64 GB unified memory each, connected via Thunderbolt 5. Exo distributes layers across machines. 256 GB total / ~180 GB effective for inference.

Apple clusterThunderboltexpert
VRAM 180/256 GB
Power 600W

Mac Studio M3 Ultra 192GB

Apple Silicon flagship with 192 GB unified memory. Genuinely pools — total VRAM ≈ effective VRAM. Trades NVIDIA throughput for the largest model envelope at any reasonable power budget.

Apple unifiedUnified memorybeginner
VRAM 140/192 GB
Power 370W
Pooled

Quad RTX 3090 (24 GB × 4)

Four used 3090s in a homelab chassis. 96 GB total / ~88 GB effective. The cheapest path to 100B+ class models and high-concurrency 70B serving.

Single-node multi-GPUNVLinkadvanced
VRAM 88/96 GB
Power 1400W

Ray Serve multi-node distributed inference (4 nodes × 2× RTX 4090)

Distributed serving across 4 machines, each with 2× RTX 4090. Ray Serve orchestrates replicas. 192 GB total / ~80 GB per replica. Built for high-concurrency request routing, not single-large-model deployment.

Distributed10 GbEexpert
VRAM 80/192 GB
Power 3600W

Dual RTX 3090 (24 GB × 2)

The reference dual-GPU local-AI rig. NVLink optional. 48 GB total / ~46 GB effective with tensor parallelism. The cheapest path to 70B-class models at 2025-2026 prices.

Single-node multi-GPUNVLinkintermediate
VRAM 46/48 GB
Power 700W

Dual RTX 4090 (24 GB × 2)

Two consumer-flagship cards. PCIe 4.0 only — no NVLink on 4090. 48 GB total / ~45 GB effective with tensor parallelism. ~30% faster decode than dual 3090 at 2× the cost.

Single-node multi-GPUPCIeintermediate
VRAM 45/48 GB
Power 900W

RTX 4090 + RTX 3090 (asymmetric 24+24 GB)

Asymmetric multi-GPU: a 4090 paired with a 3090. PCIe 4.0 only — different SM counts, different memory bandwidth. Effective VRAM is bottlenecked by the slower card on most split strategies.

Mixed GPUPCIeadvanced
VRAM 42/48 GB
Power 800W

Going deeper