RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Hardware combinations
  4. /Quad RTX 3090 (24 GB × 4)
Single-node multi-GPUNVLinkadvanced

Quad RTX 3090 (24 GB × 4)

Four used 3090s in a homelab chassis. 96 GB total / ~88 GB effective. The cheapest path to 100B+ class models and high-concurrency 70B serving.

By Fredoline Eruo · Reviewed 2026-05-06
Try this build in the custom builder

Tweak GPU count, mix in another card, switch OS / runtime — see which models still fit.

Open in builder →
Memory budget
Total VRAM
96 GB
Effective for inference
88 GB
92% of total
Not pooled

Four 3090s in a single chassis with PCIe + NVLink (paired bridges between cards 0-1 and 2-3) does not produce 96 GB of pooled VRAM. Tensor parallelism across 4 ranks with vLLM yields ~88 GB effective for model weights — total minus ~2 GB per card for activations, KV cache, and runtime overhead. This is the configuration that fits 100B+ class MoE models like DeepSeek V2.5 (236B / 21B-active needs ~134 GB at Q4 — does NOT fit; 100B-class dense models like Llama 3.1 100B-tier do fit). The 88 GB envelope is the realistic ceiling for prosumer multi-GPU before you pay for datacenter hardware.

Topology

Topology
single-node-multi-gpu
Interconnect
nvlink~112.5 GB/s
Component count
4 units
Components
  • 4×rtx-3090

Recommended runtimes

Runtimes that are operationally viable for this combo. Each links to the runtime’s operational review.

vLLMSGLangExLlamaV2

Supported split strategies

How the model is partitioned across the components. The right strategy depends on model architecture, runtime, and interconnect bandwidth.

Tensor parallelPipeline parallelExpert routing

Why this combo

Quad RTX 3090 is the prosumer ceiling for local AI. Beyond this you're either paying datacenter prices (H100 / A100 cluster) or distributing across multiple machines.

The use case sweet spot:

  • 70B serving at high concurrency (16+ users)
  • 100B-class dense or 50-100B MoE models
  • Long-context (32K-128K) inference where KV cache budget matters
  • Research / experimentation with model splits and serving topologies

What it's NOT for:

  • Single-user latency (dual-3090 is faster per-stream for most models)
  • Production at scale (datacenter hardware wins on reliability per dollar over a 3-year horizon)
  • Anyone whose answer to "where will this live" is "my desk"

Runtime compatibility

  • vLLM ✓ excellent. --tensor-parallel-size 4 is the canonical configuration. Production-default for the combo.
  • SGLang ✓ excellent. RadixAttention compounds harder at 4 ranks because the prefix cache is shared across all 4 GPUs.
  • ExLlamaV2 ✓ good. EXL2 quants are great but rank-4 tensor-parallel scales less cleanly than on rank-2.
  • Ray Serve ✓ excellent for serving multiple replicas — you can run 2× tensor-parallel-2 instead of 1× tensor-parallel-4 for higher per-stream throughput.

Split strategy

For 100B-class dense, tensor parallelism rank 4 is the path. Per-stream decode is ~20-25 tok/s; aggregate cluster throughput is high.

For MoE, expert routing across 4 cards lets the router put experts on different GPUs. Mixtral 8x7B / 8x22B / DeepSeek MoE all benefit.

For maximum single-stream latency, 2× tensor-parallel-2 replicas via Ray Serve beats 1× tensor-parallel-4 — counter-intuitive but the cross-card overhead at rank-4 dominates.

Comparison vs alternatives

Metric Quad 3090 Single H100 80GB Mac Studio M3 Ultra 192GB
Effective VRAM 88 GB 80 GB ~140 GB usable
Power 1400W 700W ~370W
Total cost (used / new) $3,000-4,500 $20,000-30,000 $7,000-10,000
Largest model (Q4) 100B dense 80B dense 200B+ dense
Setup difficulty Advanced Beginner Beginner

The Mac Studio M3 Ultra is genuinely competitive for "largest model that runs at all" but trails on tokens-per-second by 3-5×.

Related

  • /hardware-combos/dual-rtx-3090 — entry point to multi-GPU
  • /stacks/distributed-inference-homelab — when even quad-3090 isn't enough
  • /guides/running-local-ai-on-multiple-gpus-2026 — buying guide

Best model classes

  • 100B-class dense models at Q4 — anything that fits in 88 GB. The closest practical targets are smaller MoE flagships and 70B at higher quants.
  • 70B at Q5_K_M / Q6 with extended context — quality lift over dual-3090 Q4 at the cost of throughput per card.
  • High-concurrency 70B serving — 4 cards = 4 tensor-parallel ranks; vLLM serves 16+ concurrent requests at ~30 tok/s each.
  • MoE 50-100B with expert routing — Mixtral 8x22B, DeepSeek Coder V2 236B (just barely doesn't fit at Q4 but Q3 might).

What this combo is bad at

  • 236B+ MoE models — DeepSeek V3 / V4 Pro / Qwen 3.5 235B all exceed 88 GB at any practical quant. Forces CPU offload or cluster.
  • Single-user latency — 4-rank tensor parallelism doesn't outperform 2-rank for single-stream decode; you only win on concurrency.
  • Power-cost-sensitive deployment — 1400W under load × 24/7 = ~12,300 kWh/year = $1,500-2,500/year in electricity at 2026 US rates.

Who should avoid this

  • Anyone without a basement or server room — thermal + acoustic envelope is unacceptable in living spaces.
  • First-time multi-GPU builders — start with dual before quad.
  • Anyone considering a single H100 80GB — at used-market $20k pricing, H100 makes sense if you can swing it; quad 3090 is the path when budget is hard-capped at $4k.
  • Production users needing reliability — datacenter hardware exists for a reason; quad used consumer cards is a hobby-lab choice.
Power & thermal
~1400W peak

1400W of GPU draw under load demands a server chassis or open-frame mining rig. Consumer tower cases cannot handle this thermally. Plan for a 1600W+ Platinum PSU (or dual PSUs synced via Add2PSU adapter), industrial fans, and either basement / server-room placement or accept the heat output of a small space heater. Open-frame rigs are louder but thermally easier than enclosed chassis.

Reliability

Four used GPUs = four points of failure with unknown service histories. Plan thermal-paste refresh + memory-junction monitoring on all 4 cards before committing to production. Power supply derating: a 1600W PSU under sustained 1400W load runs at 87% capacity — acceptable but at the upper bound. Better to use dual 1000W units.

Recommended OS

Ubuntu 22.04 LTS or 24.04 LTS — Linux is mandatory for this scale.

Operator warning — failure modes

Failure modes specific to quad 3090

  1. Power-supply transient overload. Four 3090s have correlated transient spikes during model load — a 1600W PSU can trip OCP even at "normal" load when the spike compounds. Size for 1800W+ headroom or use dual PSUs.
  2. PCIe lane allocation collapse. Most consumer X670/B650 boards can't deliver x16/x16/x8/x8 to four cards. WRX80 / W790 workstation boards are required for full lane allocation. Without it, tensor-parallel performance suffers.
  3. NVLink pairing constraints. NVLink bridges only work between adjacent slot pairs — you typically get two NVLink pairs (cards 0-1 and 2-3), not a 4-card mesh. Tensor parallelism with rank 4 still works but the cross-pair links go over PCIe.
  4. Chassis thermal compounding. Even with open-frame mounting, GPU 3 heating GPU 4 is real. Stagger card placement and add inter-card fans.
  5. Driver-NUMA mismatches. On dual-socket builds, GPUs need to be on the same NUMA node as the CPU running the inference process. Mismatches cost 30%+ throughput silently.
  6. Memory-junction temperature on the inner cards. Cards in slots 2-3 (sandwiched between others) routinely hit memory-junction temps over 110°C. Active inter-card cooling is mandatory.
Closest alternative

Vllm Tensor Parallel H100 Workstation →

Quad-3090 is the prosumer ceiling at ~$4k; 4× H100 is the production reference at $200k+. The H100 cluster wins on reliability per dollar over a 3-year horizon for orgs.

Featured in stack

Quad RTX 3090 workstation →

Step-by-step setup with WRX80/W790 motherboard, NVLink pair verification, vLLM tensor-parallel-4 + power/thermal warnings.

Benchmark opportunities

Pending measurement targets for this combo. These are estimates, not measurements — actual benchmarks land in the catalog when run.

4× RTX 3090 + DeepSeek R1 Distill Llama 70B (vLLM TP-4)

deepseek-r1-distill-llama-70b
pending
Estimate: 20-28 tok/s decode (with thinking-mode bloat)

Reasoning workload on quad-3090. R1 distill produces 5-15× more tokens per query; per-stream throughput drops vs same-size non-reasoning model.

Going deeper

  • All hardware combinations — browse other multi-GPU and multi-machine setups.
  • Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
  • Distributed inference systems — architectural depth.