Dual RTX 3090 (24 GB × 2)
The reference dual-GPU local-AI rig. NVLink optional. 48 GB total / ~46 GB effective with tensor parallelism. The cheapest path to 70B-class models at 2025-2026 prices.
Tweak GPU count, mix in another card, switch OS / runtime — see which models still fit.
PCIe + optional NVLink between two RTX 3090s does NOT pool VRAM the way Apple unified memory does. Each card holds its half of the model weights via tensor parallelism (vLLM / SGLang) or pipeline parallelism (llama.cpp layer split). Effective VRAM is roughly total minus ~2 GB per card for activations, KV cache, and runtime overhead. Concretely: a 70B Q4 model (~40 GB weights) fits with ~6 GB of headroom for context and KV. Anything claiming 48 GB pooled is wrong.
Topology
- 2×rtx-3090
Recommended runtimes
Runtimes that are operationally viable for this combo. Each links to the runtime’s operational review.
Supported split strategies
How the model is partitioned across the components. The right strategy depends on model architecture, runtime, and interconnect bandwidth.
Why this combo
Dual RTX 3090 is the prosumer multi-GPU local-AI baseline in 2025-2026. The pricing math: two used 3090s typically run $1,200-1,800 total vs a new RTX 5090 at $2,000-2,500. You trade newer architecture for ~2× the VRAM (48 GB vs 32 GB), which is the difference between "70B at Q4" and "32B at Q4."
The choice between dual-3090 and single-5090 is one of the most-asked, least-honestly-answered questions in local AI. The honest framing:
- If your largest target model fits in 32 GB at Q4: single 5090 wins on power, noise, simplicity, support life, and likely benchmarks per dollar of total system cost.
- If you need 70B class: dual 3090 is the cheapest path. Single 5090 + CPU offload is theoretically possible but throughput is ~5-10× worse than dual-card.
Runtime compatibility
- vLLM ✓ excellent.
--tensor-parallel-size 2is the canonical configuration. AWQ-INT4 + 8K context is the production-default path. - SGLang ✓ excellent. Particularly strong when serving multiple agent loops with stable prefix caches.
- ExLlamaV2 ✓ excellent. EXL2 quants are sharper than GGUF at equivalent size; the dual-3090 ExLlamaV2 path is the throughput leader.
- Ollama ✓ good but not optimal. Ollama's GGUF layer-split is functional but doesn't extract NVLink bandwidth the way vLLM does.
- llama.cpp ✓ good. Layer split via
-ngl+--tensor-splitworks; performance trails vLLM by ~30%. - TensorRT-LLM ✓ supported but the recompile-per-config friction makes it impractical for prosumer use.
Split strategy
For 70B-class dense models, tensor parallelism is the right default. Each card holds half the weights of every layer; activations cross the NVLink (or PCIe) on every forward pass. With NVLink the bandwidth penalty is minimal; without NVLink, expect ~10-15% throughput loss vs theoretical.
For MoE models like Qwen 3 30B-A3B, expert routing wins — different experts can live on different cards, with only routing tensors crossing the bus. This is what vLLM does automatically when it detects MoE architecture.
For Mixtral 8x7B, the 47B total weight just barely doesn't fit at FP16 across 48 GB; Q4_K_M fits comfortably. Expert routing is active by default in vLLM 0.7+.
Comparison: dual-3090 vs single-4090 vs single-5090
| Metric | Dual 3090 | Single 4090 | Single 5090 |
|---|---|---|---|
| VRAM | 48 GB | 24 GB | 32 GB |
| Power | 700W | 450W | 575W |
| Largest model (Q4) | 70B | 32B | 35B |
| Setup difficulty | Intermediate | Beginner | Beginner |
| Total cost (used 3090 / new 5090) | $1,200-1,800 | $1,500-2,000 | $2,000-2,500 |
If your workload tops out at 32B Q4: single 5090 wins. If you want 70B: dual 3090 wins on cost, single H100 80GB wins on simplicity and price-no-object reliability.
Related
- /stacks/local-coding-agent — single-card alternative
- /systems/distributed-inference — when you need >48 GB
- /guides/running-local-ai-on-multiple-gpus-2026 — the multi-GPU buying guide
Best model classes
- 70B-class dense models at Q4_K_M / AWQ-INT4 — the primary use case. Llama 3.3 70B, Qwen 2.5 72B, DeepSeek R1 Distill Llama 70B all fit comfortably with 8K-32K context.
- 30-35B class with extended context — Qwen 3 32B at Q5_K_M with 64K context.
- MoE 30B / A3B — Qwen 3 30B-A3B is the throughput sweet spot for prosumer dual-3090 rigs.
The combo's ceiling is 70B Q4. 100B+ models force CPU offload, which collapses throughput. The combo's floor is 32B at FP16 — you can fit it but you're wasting capacity vs running a 70B Q4.
What this combo is bad at
- 120B+ MoE models — DeepSeek V3 / V4, Llama 4 Maverick, Qwen 3.5 235B all exceed 48 GB at any practical quant. CPU offload kills the value proposition.
- High-concurrency serving — 2 cards means at most 2 tensor-parallel ranks; throughput per request scales but per-GPU concurrency is limited vs 4× setups.
- Latency-critical workloads — NVLink between two 3090s is faster than PCIe-only but still adds cross-card communication overhead vs a single 5090.
Who should avoid this
- First-time local-AI builders — multi-GPU is intermediate-tier. Start with a single RTX 4090 or 5090 if you've never set up a local AI workstation.
- Anyone who needs production reliability — used 3090s have unknown service histories; mean-time-to-failure is meaningfully worse than a new 5090.
- Quiet-environment users — two 350W GPUs are loud regardless of cooling solution.
Two 350W GPUs in a single chassis is a thermal challenge — most consumer cases need re-tuning fan curves and adding intake fans. NVLink bridge requires a specific motherboard slot spacing (3-slot bridge for most boards). Expect noticeable noise under sustained load; consider a dedicated server-style chassis if 24/7 deployment is the goal.
RTX 3090s are 5+ years old at this point — used-market units come with unknown thermal-paste history. Plan for a re-paste + memory-junction temperature monitoring before committing to production workloads. EVGA / MSI X-Trio variants tend to age better than reference Founders Edition.
Ubuntu 22.04 LTS or 24.04 LTS
Failure modes specific to dual 3090
- NVLink bridge misseating. NVLink 3.0 bridges are mechanically fragile — a bridge that "looks" seated but isn't fully clicked in produces silent fallback to PCIe. nvidia-smi nvlink --status will show "OFF" when this happens. Always verify before benchmarking.
- PCIe lane starvation on consumer boards. Many consumer motherboards drop the second PCIe slot from x16 to x8 when both are populated. For tensor parallelism this is acceptable; for pipeline parallelism with frequent cross-card traffic it costs ~15-25% throughput.
- Memory-junction overheating. RTX 3090 GDDR6X memory junctions run hotter than the GPU core. Long-context inference workloads can push memory-junction temps over 105°C, triggering thermal throttling that's invisible in nvidia-smi (which reports core temp). Monitor via nvidia-smi --query-gpu=temperature.memory.
- NVLink-incompatible runtime configurations. vLLM auto-detects NVLink, but some llama.cpp builds require explicit --tensor-split arguments to use it; getting the syntax wrong silently disables the high-bandwidth path.
- Used-market thermal paste degradation. ~5-year-old cards routinely benefit from a thermal-paste refresh, dropping core temps 8-15°C and unlocking sustainable boost clocks.
Dual Rtx 4090 →
Newer architecture, ~30% faster decode, 2× the cost, no NVLink. Pick 4090 for new-card warranty + FP8 support; pick 3090 for cost-efficiency.
Dual RTX 3090 workstation →
Step-by-step setup with NVLink bridge verification, vLLM tensor-parallel-2 configuration, and operator-grade failure modes.
Benchmark opportunities
Pending measurement targets for this combo. These are estimates, not measurements — actual benchmarks land in the catalog when run.
Dual RTX 3090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)
llama-3.3-70b-instructReference benchmark for the dual-3090 NVLink prosumer build. vLLM tensor-parallel-2, AWQ-INT4, 8K context. Compare against dual-4090 PCIe (no NVLink) to isolate interconnect impact.
Going deeper
- All hardware combinations — browse other multi-GPU and multi-machine setups.
- Running local AI on multiple GPUs in 2026 — the flagship buying / deployment guide.
- Distributed inference systems — architectural depth.