Dual RTX 4090 (24 GB × 2)
Two consumer-flagship cards. PCIe 4.0 only — no NVLink on 4090. 48 GB total / ~45 GB effective with tensor parallelism. ~30% faster decode than dual 3090 at 2× the cost.
Tweak GPU count, mix in another card, switch OS / runtime — see which models still fit.
Critical: RTX 4090 has NO NVLink. NVIDIA removed the connector. Two 4090s communicate ONLY via PCIe — typically PCIe 4.0 x8 each on a consumer board, x16 each on a workstation board. This means the cross-card bandwidth is ~32 GB/s, vs 112 GB/s on dual 3090 NVLink. For tensor parallelism, this matters — expect ~10-20% throughput penalty vs an NVLink-equipped pair. Effective VRAM is total minus ~2-3 GB per card for activations and KV cache; concretely, 70B Q4 fits with marginal headroom. Two 4090s do NOT pool to 48 GB usable — runtime overhead and per-card activations cost real VRAM.
Topology
- 2×rtx-4090
Recommended runtimes
Runtimes that are operationally viable for this combo. Each links to the runtime’s operational review.
Supported split strategies
How the model is partitioned across the components. The right strategy depends on model architecture, runtime, and interconnect bandwidth.
Why this combo
Dual RTX 4090 is the consumer-flagship multi-GPU build for users who want newer-architecture compute and don't mind the NVLink absence. The honest case for it:
- ~30% faster decode than dual 3090 on the same model (newer SMs, faster GDDR6X)
- Lower power per watt-of-throughput
- New-card warranty + reliability
- FP8 support for some quantization formats unavailable on Ampere
The honest case against:
- 2× the cost of dual 3090
- No NVLink — tensor parallelism over PCIe leaves performance on the table
- Doesn't materially expand the model size envelope (still 48 GB)
Runtime compatibility
- vLLM ✓ excellent. PCIe-aware tensor-parallel works but verify with
nvidia-smi nvlink --status(will show "Disabled" — that's expected, no panic). - SGLang ✓ excellent. RadixAttention prefix-cache compounds nicely; pick this over vLLM when concurrent request prefix overlap is high.
- ExLlamaV2 ✓ excellent. The most-tuned single-/dual-card runtime for consumer NVIDIA.
- Ollama ✓ good. Layer-split is functional; doesn't extract maximum throughput.
- TensorRT-LLM ✓ supported. The 4090's FP8 support unlocks slightly better TRT-LLM performance vs 3090.
Split strategy
For 70B-class dense, tensor parallelism with vLLM --tensor-parallel-size 2 is the path. Expect 25-35 tok/s decode vs 45+ tok/s on dual H100. The NVLink absence costs the most here — every layer's all-reduce has to go over PCIe.
For MoE, expert routing dominates because the routing tensors are tiny relative to expert weights. Mixtral / Qwen MoE / DeepSeek MoE all run efficiently on dual 4090.
When dual 3090 with NVLink is better
If your priority is "biggest model that fits, lowest cost, don't care about peak benchmarks": dual 3090 is the answer. NVLink + 48 GB + $1,500 total beats no-NVLink + 48 GB + $4,000 total for hobby workloads.
When single 5090 is better
If your peak model fits in 32 GB Q4 and you'd otherwise leave the second 4090 idle most of the time: single 5090 wins on power, simplicity, and per-GPU benchmark scores.
Related
- /hardware-combos/dual-rtx-3090 — cheaper alternative
- /stacks/dual-4090-workstation — full deployment recipe
- /guides/running-local-ai-on-multiple-gpus-2026 — multi-GPU buying guide
- /systems/distributed-inference — architectural depth
Best model classes
- 70B-class dense at Q4 — same envelope as dual 3090 but ~30% faster decode. Llama 3.3 70B, Qwen 2.5 72B.
- MoE 30-50B with high concurrency — Qwen 3 30B-A3B, Mixtral 8x7B. Two 4090s + vLLM serves 8-16 concurrent agent loops at ~40 tok/s each.
- Reasoning workloads at 32B with extended context — DeepSeek R1 Distill Qwen 32B at Q5_K_M with 32K context.
The 4090's compute density (FP16 + FP8 throughput) is what makes this combo attractive over dual-3090 — for reasoning workloads with high token-per-query budgets, the throughput advantage compounds.
What this combo is bad at
- Tensor parallelism over PCIe — without NVLink, large models see real cross-card overhead. ~10-20% slower than dual-3090 NVLink for the same model.
- 120B+ models — same ceiling as dual-3090.
- Cost-efficiency — at 2025-2026 prices, dual 4090 costs $4,000+. Used dual 3090 is $1,500. The throughput uplift doesn't justify the price for most workloads.
Who should avoid this
- Cost-sensitive builders — dual-3090 NVLink is 50-65% cheaper for similar VRAM envelope.
- Anyone hoping NVIDIA put NVLink on 4090 — they didn't. Don't buy this expecting NVLink performance.
- Workstation users who could just buy an H100 80GB — at $25k+ used pricing in 2026, H100 makes sense for orgs but not individuals.
Two 450W cards in a single chassis is at the edge of what consumer cooling tolerates. ATX 3.0 PSU at 1200W minimum; expect 1500W for headroom under transient spikes. Most consumer cases (Lian Li O11, Fractal Torrent) handle this with re-tuned fan curves but the chassis becomes a small space heater. Server chassis (Supermicro, Asrock Rack) handle it more cleanly.
RTX 4090 is the most reliable consumer GPU NVIDIA has shipped in years — failure rates from data centers are below 0.5% per year. Power-connector melt issues from launch were largely fixed by 2024 production runs; verify the 16-pin connector is fully seated and avoid the early 12VHPWR adapters.
Ubuntu 22.04 LTS or 24.04 LTS; Windows 11 also works but Linux extracts more performance.
Failure modes specific to dual 4090
- No NVLink — the silent throughput killer. Many tutorials assume NVLink; following them on dual-4090 produces 30%+ slower benchmarks than expected. Always verify your runtime is using PCIe-aware tensor-parallel paths.
- PCIe lane starvation. Consumer boards drop second slot to x8 when both are populated. Workstation boards (TRX50, W790) maintain x16/x16 but cost $1,000+. The x8 drop costs another 10% throughput vs full x16.
- 12VHPWR connector wear. Two cards = two connectors = two failure points. Don't reuse old PSU adapters; use cables rated for 600W+ each.
- Power transient spikes. RTX 4090 transient spikes can exceed 600W per card briefly. A 1200W PSU sized for "2× 450W = 900W" can trigger over-current protection during boost transients. Size for 1500W+ to avoid mysterious shutdowns.
- PCIe slot spacing. Two triple-slot (or worse, quad-slot) 4090s require either riser cables or a workstation chassis. Most ATX cases choke the lower card's intake; thermal throttling on the second card is the typical result.
Dual Rtx 3090 →
Half the cost, NVLink-equipped (better tensor-parallel), used cards. Pick 3090 unless you specifically need newer architecture or FP8.
Dual RTX 4090 workstation →
PCIe peer-to-peer verification (no NVLink), FP8 path, vLLM tensor-parallel-2 over PCIe.
Benchmark opportunities
Pending measurement targets for this combo. These are estimates, not measurements — actual benchmarks land in the catalog when run.
Dual RTX 4090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)
llama-3.3-70b-instructReference benchmark for dual-4090 PCIe (no NVLink). Same model + quant as dual-3090 entry; the comparison reveals NVLink vs PCIe impact at tensor-parallel-2.
Going deeper
- All hardware combinations — browse other multi-GPU and multi-machine setups.
- Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
- Distributed inference systems — architectural depth.