Multi-GPU decision intelligence

Hardware combinations for local AI

Dual GPUs, quad GPUs, mixed cards, Apple unified memory, Exo clusters, distributed serving. The honest answer to “what hardware combination should I build to run this model well?” — with effective-VRAM math, runtime compatibility, failure modes, and who should avoid each setup.

By Fredoline Eruo · Updated continuously

Filter

Topology

Any Single-node multi-GPU Apple unified Apple cluster Mixed GPU Distributed

Difficulty

Any Beginner Intermediate Advanced Expert

Interconnect

Any PCIe NVLink NVLink-Switch Thunderbolt Unified

Effective VRAM

Any 40+ GB 80+ GB 140+ GB

Combinations (4)

Each combo links to operator-grade detail with topology diagram, runtime compatibility matrix, failure modes, and recommended models.

vLLM tensor-parallel 4× H100 80GB workstation

Datacenter-tier serving rig: 4× H100 80GB SXM with NVLink-Switch fabric. 320 GB total / ~300 GB effective. The reference vLLM tensor-parallel deployment for production.

Single-node multi-GPUNVLink-Switchexpert

VRAM 300/320 GB

Power 2800W

Quad RTX 3090 (24 GB × 4)

Four used 3090s in a homelab chassis. 96 GB total / ~88 GB effective. The cheapest path to 100B+ class models and high-concurrency 70B serving.

Single-node multi-GPUNVLinkadvanced

VRAM 88/96 GB

Power 1400W

Dual RTX 3090 (24 GB × 2)

The reference dual-GPU local-AI rig. NVLink optional. 48 GB total / ~46 GB effective with tensor parallelism. The cheapest path to 70B-class models at 2025-2026 prices.

Single-node multi-GPUNVLinkintermediate

VRAM 46/48 GB

Power 700W

Dual RTX 4090 (24 GB × 2)

Two consumer-flagship cards. PCIe 4.0 only — no NVLink on 4090. 48 GB total / ~45 GB effective with tensor parallelism. ~30% faster decode than dual 3090 at 2× the cost.

Single-node multi-GPUPCIeintermediate

VRAM 45/48 GB

Power 900W

Going deeper

Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
Distributed inference systems — architectural depth on tensor / pipeline / expert routing.
Execution stacks — full deployment recipes that pair combos with runtimes and models.
Hardware catalog — single-GPU baselines that the combos here build on.