RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Will it run?
  4. /Dual RTX 4090 (24 GB × 2)
Single-node multi-GPUPCIeintermediate

What runs on Dual RTX 4090 (24 GB × 2)?

Two consumer-flagship cards. PCIe 4.0 only — no NVLink on 4090. 48 GB total / ~45 GB effective with tensor parallelism. ~30% faster decode than dual 3090 at 2× the cost.

At a glance
Effective VRAM
45 / 48 GB
Not pooled
Speed penalty
~20%
vs ideal single-card
Recommended runtime
vllm
tensor parallel
Setup difficulty
intermediate
~900W peak
24
Models fit
12
Borderline
8
Not practical
Deployment recipe
Dual RTX 4090 workstation →

PCIe peer-to-peer verification (no NVLink), FP8 path, vLLM tensor-parallel-2 over PCIe.

Memory budget
Total VRAM
48 GB
Effective for inference
45 GB
94% of total
Not pooled

Critical: RTX 4090 has NO NVLink. NVIDIA removed the connector. Two 4090s communicate ONLY via PCIe — typically PCIe 4.0 x8 each on a consumer board, x16 each on a workstation board. This means the cross-card bandwidth is ~32 GB/s, vs 112 GB/s on dual 3090 NVLink. For tensor parallelism, this matters — expect ~10-20% throughput penalty vs an NVLink-equipped pair. Effective VRAM is total minus ~2-3 GB per card for activations and KV cache; concretely, 70B Q4 fits with marginal headroom. Two 4090s do NOT pool to 48 GB usable — runtime overhead and per-card activations cost real VRAM.

Why total VRAM is not the whole story

PCIe-only multi-GPU. No NVLink means cross-card bandwidth is 32 GB/s — 3-4× slower than NVLink. Tensor parallelism still works but with ~10-20% throughput penalty. Effective 45 GB of total 48 GB.

See the multi-GPU guide for topology tradeoffs, and the RunLocalAI Will-It-Run Framework for the citable fit-tier method.

Topology

Topology
single-node-multi-gpu
Interconnect
pcie~32 GB/s
Component count
2 units
Components
  • 2×rtx-4090
Recommended runtime
vllm
Also: sglang, exllamav2
Recommended split strategy
tensor-parallel
Also: pipeline-parallel
Setup difficulty
intermediate
~900W peak

Models that fit comfortably (24)

Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.

Jamba 1.5 Mini
Fits
52B·Q4_K_M → 36 GB·80% of effective VRAM·~20% speed penalty vs ideal
Nemotron 3 Super 49B
Fits
49B·AWQ-INT4 → 32 GB·71% of effective VRAM·~20% speed penalty vs ideal
Mixtral 8x7B Instruct
Fits
47B·Q4_K_M → 32 GB·71% of effective VRAM·~20% speed penalty vs ideal
Mixtral 8X7B Instruct v0.1 GPTQ
Fits
46.7B·Q4_K_M → 33 GB·73% of effective VRAM·~20% speed penalty vs ideal
Falcon 40B Instruct
Fits
40B·Q4_K_M → 28 GB·62% of effective VRAM·~20% speed penalty vs ideal
ALIA 40b instruct 2601
Fits
40B·Q4_K_M → 28 GB·62% of effective VRAM·~20% speed penalty vs ideal
Mihenk LLM v2 35B (Turkish Financial)
Fits
35B·Q4_K_M → 25 GB·56% of effective VRAM·~20% speed penalty vs ideal
Command R 35B
Fits
35B·Q4_K_M → 26 GB·58% of effective VRAM·~20% speed penalty vs ideal
Aya 23 35B
Fits
35B·Q4_K_M → 24 GB·53% of effective VRAM·~20% speed penalty vs ideal
Phind CodeLlama 34B v2
Fits
34B·Q4_K_M → 24 GB·53% of effective VRAM·~20% speed penalty vs ideal
Yi 1.5 34B
Fits
34B·Q4_K_M → 24 GB·53% of effective VRAM·~20% speed penalty vs ideal
DeepSeek Coder V3
Fits
33B·AWQ-INT4 → 22 GB·49% of effective VRAM·~20% speed penalty vs ideal
Magistral 32B
Fits
32B·AWQ-INT4 → 22 GB·49% of effective VRAM·~20% speed penalty vs ideal
Qwen 2.5 32B Instruct
Fits
32B·Q4_K_M → 24 GB·53% of effective VRAM·~20% speed penalty vs ideal
Qwen3 Swallow 32B RL v0.2
Fits
32B·Q4_K_M → 23 GB·51% of effective VRAM·~20% speed penalty vs ideal
EXAONE 3.5 32B
Fits
32B·AWQ-INT4 → 22 GB·49% of effective VRAM·~20% speed penalty vs ideal
EXAONE 4.0.1 32B
Fits
32B·Q4_K_M → 23 GB·51% of effective VRAM·~20% speed penalty vs ideal
Aya Expanse 32B
Fits
32B·AWQ-INT4 → 22 GB·49% of effective VRAM·~20% speed penalty vs ideal
llm-jp 4 32B A3B Thinking
Fits
32B·Q4_K_M → 23 GB·51% of effective VRAM·~20% speed penalty vs ideal
EXAONE 3.5 32B Instruct AWQ
Fits
32B·Q4_K_M → 23 GB·51% of effective VRAM·~20% speed penalty vs ideal
OLMo 2 32B
Fits
32B·Q4_K_M → 24 GB·53% of effective VRAM·~20% speed penalty vs ideal
Qwen 3 Coder 32B
Fits
32B·AWQ-INT4 → 22 GB·49% of effective VRAM·~20% speed penalty vs ideal
EXAONE 3.5 32B Instruct
Fits
32B·Q4_K_M → 23 GB·51% of effective VRAM·~20% speed penalty vs ideal
Qwen 3 32B
Fits
32B·Q4_K_M → 24 GB·53% of effective VRAM·~20% speed penalty vs ideal

Borderline (12)

Fits but with little headroom. KV cache for long context may not fit; verify before deployment.

Molmo 72B
Borderline
72B·Q4_K_M → 48 GB·107% of effective VRAM·~20% speed penalty vs ideal

Effective VRAM utilization >107% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Qwen 2.5 Math 72B
Borderline
72B·Q4_K_M → 48 GB·107% of effective VRAM·~20% speed penalty vs ideal

Effective VRAM utilization >107% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Qwen 2.5 72B Instruct
Borderline
72B·Q4_K_M → 48 GB·107% of effective VRAM·~20% speed penalty vs ideal

Effective VRAM utilization >107% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Qwen 2.5-VL 72B
Borderline
72B·AWQ-INT4 → 48 GB·107% of effective VRAM·~20% speed penalty vs ideal

Effective VRAM utilization >107% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Llama 4 70B
Borderline
70B·AWQ-INT4 → 48 GB·107% of effective VRAM·~20% speed penalty vs ideal

Effective VRAM utilization >107% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Tulu 3 70B
Borderline
70B·Q4_K_M → 48 GB·107% of effective VRAM·~20% speed penalty vs ideal

Effective VRAM utilization >107% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Dolphin 3 Llama 3.3 70B
Borderline
70B·AWQ-INT4 → 48 GB·107% of effective VRAM·~20% speed penalty vs ideal

Effective VRAM utilization >107% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

EVA Llama 3.3 70B
Borderline
70B·AWQ-INT4 → 48 GB·107% of effective VRAM·~20% speed penalty vs ideal

Effective VRAM utilization >107% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

OpenBioLLM Llama 3 70B
Borderline
70B·Q4_K_M → 48 GB·107% of effective VRAM·~20% speed penalty vs ideal

Effective VRAM utilization >107% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Llama 3.1 70B Instruct
Borderline
70B·Q4_K_M → 48 GB·107% of effective VRAM·~20% speed penalty vs ideal

Effective VRAM utilization >107% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

DeepSeek R1 Distill Llama 70B
Borderline
70B·Q4_K_M → 48 GB·107% of effective VRAM·~20% speed penalty vs ideal

Effective VRAM utilization >107% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Hermes 4 70B FP8
Borderline
70B·Q4_K_M → 49 GB·109% of effective VRAM·~20% speed penalty vs ideal

Effective VRAM utilization >109% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Not practical (8)

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly. Drop to a smaller quant or move to a larger combo.

DeepSeek V4 Pro (1.6T MoE)
Not practical
1600B·Q4_K_M → 1024 GB·2276% of effective VRAM·~20% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Step-3
Not practical
1000B·AWQ-INT4 → 640 GB·1422% of effective VRAM·~20% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Kimi K2.6
Not practical
1000B·Q4_K_M → 700 GB·1556% of effective VRAM·~20% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek V4
Not practical
745B·AWQ-INT4 → 480 GB·1067% of effective VRAM·~20% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Mistral Medium 3.5 (675B MoE)
Not practical
675B·Q4_K_M → 448 GB·996% of effective VRAM·~20% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek R1 (671B reasoning)
Not practical
671B·Q4_K_M → 420 GB·933% of effective VRAM·~20% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek V3 (671B MoE)
Not practical
671B·Q4_K_M → 420 GB·933% of effective VRAM·~20% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Llama 4 405B
Not practical
405B·AWQ-INT4 → 280 GB·622% of effective VRAM·~20% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Benchmark opportunities

estimates, not measurements

Pending benchmark targets for this combo. Once measured, results land in the catalog as benchmarks.

Dual RTX 4090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)
pending
Estimate: 28-36 tok/s decode (PCIe only)

Reference benchmark for dual-4090 PCIe (no NVLink). Same model + quant as dual-3090 entry; the comparison reveals NVLink vs PCIe impact at tensor-parallel-2.

Going deeper

  • Full combo detail page — operational review with failure modes and runtime matrix.
  • Multi-GPU buying guide — when multi-GPU is worth it and when it isn't.
  • RunLocalAI Will-It-Run Framework — citable effective-VRAM, working-set, fit-tier, and evidence-tier method.
  • Will-it-run home — single-card check + custom builds.