Single-node multi-GPUPCIeintermediate

Dual RTX 4090 (24 GB × 2)

Two consumer-flagship cards. PCIe 4.0 only — no NVLink on 4090. 48 GB total / ~45 GB effective with tensor parallelism. ~30% faster decode than dual 3090 at 2× the cost.

By Fredoline Eruo · Reviewed 2026-05-06

Try this build in the custom builder

Tweak GPU count, mix in another card, switch OS / runtime — see which models still fit.

Open in builder →

Memory budget

Total VRAM

48 GB

Effective for inference

45 GB

94% of total

Not pooled

Critical: RTX 4090 has NO NVLink. NVIDIA removed the connector. Two 4090s communicate ONLY via PCIe — typically PCIe 4.0 x8 each on a consumer board, x16 each on a workstation board. This means the cross-card bandwidth is ~32 GB/s, vs 112 GB/s on dual 3090 NVLink. For tensor parallelism, this matters — expect ~10-20% throughput penalty vs an NVLink-equipped pair. Effective VRAM is total minus ~2-3 GB per card for activations and KV cache; concretely, 70B Q4 fits with marginal headroom. Two 4090s do NOT pool to 48 GB usable — runtime overhead and per-card activations cost real VRAM.

Topology

single-node-multi-gpu

Interconnect

pcie~32 GB/s

Component count

2 units

Components

2×rtx-4090

Recommended runtimes

Runtimes that are operationally viable for this combo. Each links to the runtime’s operational review.

vLLM SGLang ExLlamaV2

Supported split strategies

How the model is partitioned across the components. The right strategy depends on model architecture, runtime, and interconnect bandwidth.

Tensor parallelPipeline parallel

Why this combo

Dual RTX 4090 is the consumer-flagship multi-GPU build for users who want newer-architecture compute and don't mind the NVLink absence. The honest case for it:

~30% faster decode than dual 3090 on the same model (newer SMs, faster GDDR6X)
Lower power per watt-of-throughput
New-card warranty + reliability
FP8 support for some quantization formats unavailable on Ampere

The honest case against:

2× the cost of dual 3090
No NVLink — tensor parallelism over PCIe leaves performance on the table
Doesn't materially expand the model size envelope (still 48 GB)

Runtime compatibility

vLLM ✓ excellent. PCIe-aware tensor-parallel works but verify with nvidia-smi nvlink --status (will show "Disabled" — that's expected, no panic).
SGLang ✓ excellent. RadixAttention prefix-cache compounds nicely; pick this over vLLM when concurrent request prefix overlap is high.
ExLlamaV2 ✓ excellent. The most-tuned single-/dual-card runtime for consumer NVIDIA.
Ollama ✓ good. Layer-split is functional; doesn't extract maximum throughput.
TensorRT-LLM ✓ supported. The 4090's FP8 support unlocks slightly better TRT-LLM performance vs 3090.

Split strategy

For 70B-class dense, tensor parallelism with vLLM --tensor-parallel-size 2 is the path. Expect 25-35 tok/s decode vs 45+ tok/s on dual H100. The NVLink absence costs the most here — every layer's all-reduce has to go over PCIe.

For MoE, expert routing dominates because the routing tensors are tiny relative to expert weights. Mixtral / Qwen MoE / DeepSeek MoE all run efficiently on dual 4090.

When dual 3090 with NVLink is better

If your priority is "biggest model that fits, lowest cost, don't care about peak benchmarks": dual 3090 is the answer. NVLink + 48 GB + $1,500 total beats no-NVLink + 48 GB + $4,000 total for hobby workloads.

When single 5090 is better

If your peak model fits in 32 GB Q4 and you'd otherwise leave the second 4090 idle most of the time: single 5090 wins on power, simplicity, and per-GPU benchmark scores.

/hardware-combos/dual-rtx-3090 — cheaper alternative
/stacks/dual-4090-workstation — full deployment recipe
/guides/running-local-ai-on-multiple-gpus-2026 — multi-GPU buying guide
/systems/distributed-inference — architectural depth

Best model classes

70B-class dense at Q4 — same envelope as dual 3090 but ~30% faster decode. Llama 3.3 70B, Qwen 2.5 72B.
MoE 30-50B with high concurrency — Qwen 3 30B-A3B, Mixtral 8x7B. Two 4090s + vLLM serves 8-16 concurrent agent loops at ~40 tok/s each.
Reasoning workloads at 32B with extended context — DeepSeek R1 Distill Qwen 32B at Q5_K_M with 32K context.

The 4090's compute density (FP16 + FP8 throughput) is what makes this combo attractive over dual-3090 — for reasoning workloads with high token-per-query budgets, the throughput advantage compounds.

What this combo is bad at

Tensor parallelism over PCIe — without NVLink, large models see real cross-card overhead. ~10-20% slower than dual-3090 NVLink for the same model.
120B+ models — same ceiling as dual-3090.
Cost-efficiency — at 2025-2026 prices, dual 4090 costs $4,000+. Used dual 3090 is $1,500. The throughput uplift doesn't justify the price for most workloads.

Who should avoid this

Cost-sensitive builders — dual-3090 NVLink is 50-65% cheaper for similar VRAM envelope.
Anyone hoping NVIDIA put NVLink on 4090 — they didn't. Don't buy this expecting NVLink performance.
Workstation users who could just buy an H100 80GB — at $25k+ used pricing in 2026, H100 makes sense for orgs but not individuals.

Power & thermal

~900W peak

Two 450W cards in a single chassis is at the edge of what consumer cooling tolerates. ATX 3.0 PSU at 1200W minimum; expect 1500W for headroom under transient spikes. Most consumer cases (Lian Li O11, Fractal Torrent) handle this with re-tuned fan curves but the chassis becomes a small space heater. Server chassis (Supermicro, Asrock Rack) handle it more cleanly.

Reliability

RTX 4090 is the most reliable consumer GPU NVIDIA has shipped in years — failure rates from data centers are below 0.5% per year. Power-connector melt issues from launch were largely fixed by 2024 production runs; verify the 16-pin connector is fully seated and avoid the early 12VHPWR adapters.

Recommended OS

Ubuntu 22.04 LTS or 24.04 LTS; Windows 11 also works but Linux extracts more performance.

Operator warning — failure modes

Failure modes specific to dual 4090

No NVLink — the silent throughput killer. Many tutorials assume NVLink; following them on dual-4090 produces 30%+ slower benchmarks than expected. Always verify your runtime is using PCIe-aware tensor-parallel paths.
PCIe lane starvation. Consumer boards drop second slot to x8 when both are populated. Workstation boards (TRX50, W790) maintain x16/x16 but cost $1,000+. The x8 drop costs another 10% throughput vs full x16.
12VHPWR connector wear. Two cards = two connectors = two failure points. Don't reuse old PSU adapters; use cables rated for 600W+ each.
Power transient spikes. RTX 4090 transient spikes can exceed 600W per card briefly. A 1200W PSU sized for "2× 450W = 900W" can trigger over-current protection during boost transients. Size for 1500W+ to avoid mysterious shutdowns.
PCIe slot spacing. Two triple-slot (or worse, quad-slot) 4090s require either riser cables or a workstation chassis. Most ATX cases choke the lower card's intake; thermal throttling on the second card is the typical result.

Closest alternative

Dual Rtx 3090 →

Half the cost, NVLink-equipped (better tensor-parallel), used cards. Pick 3090 unless you specifically need newer architecture or FP8.

Featured in stack

Dual RTX 4090 workstation →

PCIe peer-to-peer verification (no NVLink), FP8 path, vLLM tensor-parallel-2 over PCIe.

Benchmark opportunities

Pending measurement targets for this combo. These are estimates, not measurements — actual benchmarks land in the catalog when run.

Dual RTX 4090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)

llama-3.3-70b-instruct

pending

Estimate: 28-36 tok/s decode (PCIe only)

Reference benchmark for dual-4090 PCIe (no NVLink). Same model + quant as dual-3090 entry; the comparison reveals NVLink vs PCIe impact at tensor-parallel-2.

Going deeper

All hardware combinations — browse other multi-GPU and multi-machine setups.
Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
Distributed inference systems — architectural depth.