RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Hardware combinations
Multi-GPU decision intelligence

Hardware combinations for local AI

Dual GPUs, quad GPUs, mixed cards, Apple unified memory, Exo clusters, distributed serving. The honest answer to “what hardware combination should I build to run this model well?” — with effective-VRAM math, runtime compatibility, failure modes, and who should avoid each setup.

By Fredoline Eruo · Updated continuously
⚠Total VRAM ≠ usable VRAM

The single most important rule when reading multi-GPU specs: total VRAM is not pooled VRAM. Two 24 GB cards do NOT give you 48 GB to load a single model into. Each card holds its share of the model via tensor or pipeline parallelism, and runtime overhead eats per-card VRAM. Only Apple unified memory and NVLink-Switch fabrics genuinely pool. Every combo below shows total vs effective with the honest explanation.

Filter
Topology
AnySingle-node multi-GPUApple unifiedApple clusterMixed GPUDistributed
Difficulty
AnyBeginnerIntermediateAdvancedExpert
Interconnect
AnyPCIeNVLinkNVLink-SwitchThunderboltUnified
Effective VRAM
Any40+ GB80+ GB140+ GB

Combinations (2)

Each combo links to operator-grade detail with topology diagram, runtime compatibility matrix, failure modes, and recommended models.

vLLM tensor-parallel 4× H100 80GB workstation

Datacenter-tier serving rig: 4× H100 80GB SXM with NVLink-Switch fabric. 320 GB total / ~300 GB effective. The reference vLLM tensor-parallel deployment for production.

Single-node multi-GPUNVLink-Switchexpert
VRAM 300/320 GB
Power 2800W

4× Mac Mini M4 Pro Exo cluster (256 GB total)

Four Mac Mini M4 Pro nodes with 64 GB unified memory each, connected via Thunderbolt 5. Exo distributes layers across machines. 256 GB total / ~180 GB effective for inference.

Apple clusterThunderboltexpert
VRAM 180/256 GB
Power 600W

Going deeper

  • Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
  • Distributed inference systems — architectural depth on tensor / pipeline / expert routing.
  • Execution stacks — full deployment recipes that pair combos with runtimes and models.
  • Hardware catalog — single-GPU baselines that the combos here build on.