Single-node multi-GPUPCIeintermediate

What runs on Dual RTX 4090 (24 GB × 2)?

Two consumer-flagship cards. PCIe 4.0 only — no NVLink on 4090. 48 GB total / ~45 GB effective with tensor parallelism. ~30% faster decode than dual 3090 at 2× the cost.

At a glance

Effective VRAM

45 / 48 GB

Not pooled

Speed penalty

~20%

vs ideal single-card

Recommended runtime

vllm

tensor parallel

Setup difficulty

intermediate

~900W peak

Models fit

Borderline

Not practical

Deployment recipe

Dual RTX 4090 workstation →

PCIe peer-to-peer verification (no NVLink), FP8 path, vLLM tensor-parallel-2 over PCIe.

Memory budget

Total VRAM

48 GB

Effective for inference

45 GB

94% of total

Not pooled

Critical: RTX 4090 has NO NVLink. NVIDIA removed the connector. Two 4090s communicate ONLY via PCIe — typically PCIe 4.0 x8 each on a consumer board, x16 each on a workstation board. This means the cross-card bandwidth is ~32 GB/s, vs 112 GB/s on dual 3090 NVLink. For tensor parallelism, this matters — expect ~10-20% throughput penalty vs an NVLink-equipped pair. Effective VRAM is total minus ~2-3 GB per card for activations and KV cache; concretely, 70B Q4 fits with marginal headroom. Two 4090s do NOT pool to 48 GB usable — runtime overhead and per-card activations cost real VRAM.

Why total VRAM is not the whole story

PCIe-only multi-GPU. No NVLink means cross-card bandwidth is 32 GB/s — 3-4× slower than NVLink. Tensor parallelism still works but with ~10-20% throughput penalty. Effective 45 GB of total 48 GB.

See the multi-GPU guide for topology tradeoffs, and the RunLocalAI Will-It-Run Framework for the citable fit-tier method.

Topology

single-node-multi-gpu

Interconnect

pcie~32 GB/s

Component count

2 units

Components

2×rtx-4090

Recommended runtime

vllm

Also: sglang, exllamav2

Recommended split strategy

tensor-parallel

Also: pipeline-parallel

Setup difficulty

intermediate

~900W peak

Models that fit comfortably (24)

Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.

Jamba 1.5 Mini

Fits

52B·Q4_K_M → 36 GB·80% of effective VRAM·~20% speed penalty vs ideal

Nemotron 3 Super 49B

Fits

49B·AWQ-INT4 → 32 GB·71% of effective VRAM·~20% speed penalty vs ideal

Mixtral 8x7B Instruct

Fits

47B·Q4_K_M → 32 GB·71% of effective VRAM·~20% speed penalty vs ideal

Mixtral 8X7B Instruct v0.1 GPTQ

Fits

46.7B·Q4_K_M → 33 GB·73% of effective VRAM·~20% speed penalty vs ideal

Falcon 40B Instruct

Fits

40B·Q4_K_M → 28 GB·62% of effective VRAM·~20% speed penalty vs ideal

ALIA 40b instruct 2601

Fits

40B·Q4_K_M → 28 GB·62% of effective VRAM·~20% speed penalty vs ideal

Mihenk LLM v2 35B (Turkish Financial)

Fits

35B·Q4_K_M → 25 GB·56% of effective VRAM·~20% speed penalty vs ideal

Command R 35B

Fits

35B·Q4_K_M → 26 GB·58% of effective VRAM·~20% speed penalty vs ideal

Aya 23 35B

Fits

35B·Q4_K_M → 24 GB·53% of effective VRAM·~20% speed penalty vs ideal

Phind CodeLlama 34B v2

Fits

34B·Q4_K_M → 24 GB·53% of effective VRAM·~20% speed penalty vs ideal

Yi 1.5 34B

Fits

34B·Q4_K_M → 24 GB·53% of effective VRAM·~20% speed penalty vs ideal

DeepSeek Coder V3

Fits

33B·AWQ-INT4 → 22 GB·49% of effective VRAM·~20% speed penalty vs ideal

Magistral 32B

Fits

32B·AWQ-INT4 → 22 GB·49% of effective VRAM·~20% speed penalty vs ideal

Qwen 2.5 32B Instruct

Fits

32B·Q4_K_M → 24 GB·53% of effective VRAM·~20% speed penalty vs ideal

Qwen3 Swallow 32B RL v0.2

Fits

32B·Q4_K_M → 23 GB·51% of effective VRAM·~20% speed penalty vs ideal

EXAONE 3.5 32B

Fits

32B·AWQ-INT4 → 22 GB·49% of effective VRAM·~20% speed penalty vs ideal

EXAONE 4.0.1 32B

Fits

32B·Q4_K_M → 23 GB·51% of effective VRAM·~20% speed penalty vs ideal

Aya Expanse 32B

Fits

32B·AWQ-INT4 → 22 GB·49% of effective VRAM·~20% speed penalty vs ideal

llm-jp 4 32B A3B Thinking

Fits

32B·Q4_K_M → 23 GB·51% of effective VRAM·~20% speed penalty vs ideal

EXAONE 3.5 32B Instruct AWQ

Fits

32B·Q4_K_M → 23 GB·51% of effective VRAM·~20% speed penalty vs ideal

OLMo 2 32B

Fits

32B·Q4_K_M → 24 GB·53% of effective VRAM·~20% speed penalty vs ideal

Qwen 3 Coder 32B

Fits

32B·AWQ-INT4 → 22 GB·49% of effective VRAM·~20% speed penalty vs ideal

EXAONE 3.5 32B Instruct

Fits

32B·Q4_K_M → 23 GB·51% of effective VRAM·~20% speed penalty vs ideal

Qwen 3 32B

Fits

32B·Q4_K_M → 24 GB·53% of effective VRAM·~20% speed penalty vs ideal

Borderline (12)

Fits but with little headroom. KV cache for long context may not fit; verify before deployment.

Molmo 72B

Borderline

72B·Q4_K_M → 48 GB·107% of effective VRAM·~20% speed penalty vs ideal