RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Will it run?
  4. /Quad RTX 3090 (24 GB × 4)
Single-node multi-GPUNVLinkadvanced

What runs on Quad RTX 3090 (24 GB × 4)?

Four used 3090s in a homelab chassis. 96 GB total / ~88 GB effective. The cheapest path to 100B+ class models and high-concurrency 70B serving.

At a glance
Effective VRAM
88 / 96 GB
Not pooled
Speed penalty
~10%
vs ideal single-card
Recommended runtime
vllm
tensor parallel
Setup difficulty
advanced
~1400W peak
24
Models fit
8
Borderline
8
Not practical
Deployment recipe
Quad RTX 3090 workstation →

Step-by-step setup with WRX80/W790 motherboard, NVLink pair verification, vLLM tensor-parallel-4 + power/thermal warnings.

Memory budget
Total VRAM
96 GB
Effective for inference
88 GB
92% of total
Not pooled

Four 3090s in a single chassis with PCIe + NVLink (paired bridges between cards 0-1 and 2-3) does not produce 96 GB of pooled VRAM. Tensor parallelism across 4 ranks with vLLM yields ~88 GB effective for model weights — total minus ~2 GB per card for activations, KV cache, and runtime overhead. This is the configuration that fits 100B+ class MoE models like DeepSeek V2.5 (236B / 21B-active needs ~134 GB at Q4 — does NOT fit; 100B-class dense models like Llama 3.1 100B-tier do fit). The 88 GB envelope is the realistic ceiling for prosumer multi-GPU before you pay for datacenter hardware.

Why total VRAM is not the whole story

Two cards with NVLink bridge. NVLink (~112.5 GB/s bidirectional) keeps tensor-parallel efficient but does NOT pool memory. Each card holds its share via tensor or pipeline parallelism. Effective 88 GB of total 96 GB.

See the multi-GPU guide for the full math + tradeoffs.

Topology

Topology
single-node-multi-gpu
Interconnect
nvlink~112.5 GB/s
Component count
4 units
Components
  • 4×rtx-3090
Recommended runtime
vllm
Also: sglang, exllamav2
Recommended split strategy
tensor-parallel
Also: pipeline-parallel, expert-routing
Setup difficulty
advanced
~1400W peak

Models that fit comfortably (24)

Effective VRAM utilization ≤ 85% at the smallest production quant. Comfortable headroom for KV cache.

Command R+ 104B
Fits
104B·Q4_K_M → 70 GB·80% of effective VRAM·~10% speed penalty vs ideal
Command R+ (Aug 2024)
Fits
104B·AWQ-INT4 → 72 GB·82% of effective VRAM·~10% speed penalty vs ideal
Llama 3.2 90B Vision
Fits
90B·AWQ-INT4 → 64 GB·73% of effective VRAM·~10% speed penalty vs ideal
Llama 3.2 90B Vision Instruct
Fits
90B·Q4_K_M → 60 GB·68% of effective VRAM·~10% speed penalty vs ideal
InternVL 2.5 78B
Fits
78B·Q4_K_M → 52 GB·59% of effective VRAM·~10% speed penalty vs ideal
Qwen 2.5 72B Instruct
Fits
72B·Q4_K_M → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
Molmo 72B
Fits
72B·Q4_K_M → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
Qwen 2.5-VL 72B
Fits
72B·AWQ-INT4 → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
Qwen 3 72B
Fits
72B·AWQ-INT4 → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
Qwen 2.5 Math 72B
Fits
72B·Q4_K_M → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
DeepSeek R1 Distill Llama 70B
Fits
70B·Q4_K_M → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
OpenBioLLM Llama 3 70B
Fits
70B·Q4_K_M → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
Llama 3.3 70B Instruct
Fits
70B·Q4_K_M → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
Dolphin 3 Llama 3.3 70B
Fits
70B·AWQ-INT4 → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
Llama 3.1 Nemotron 70B Instruct
Fits
70B·Q4_K_M → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
Tulu 3 70B
Fits
70B·Q4_K_M → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
Hermes 4 Llama 3.3 70B
Fits
70B·AWQ-INT4 → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
Hermes 3 Llama 3.1 70B
Fits
70B·Q4_K_M → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
Llama 3.1 70B Instruct
Fits
70B·Q4_K_M → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
Llama 4 70B
Fits
70B·AWQ-INT4 → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
EVA Llama 3.3 70B
Fits
70B·AWQ-INT4 → 48 GB·55% of effective VRAM·~10% speed penalty vs ideal
Jamba 1.5 Mini
Fits
52B·Q4_K_M → 36 GB·41% of effective VRAM·~10% speed penalty vs ideal
Nemotron 3 Super 49B
Fits
49B·AWQ-INT4 → 32 GB·36% of effective VRAM·~10% speed penalty vs ideal
Mixtral 8x7B Instruct
Fits
47B·Q4_K_M → 32 GB·36% of effective VRAM·~10% speed penalty vs ideal

Borderline (8)

Fits but with little headroom. KV cache for long context may not fit; verify before deployment.

GLM-5 Pro
Borderline
144B·AWQ-INT4 → 96 GB·109% of effective VRAM·~10% speed penalty vs ideal

Effective VRAM utilization >109% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Mixtral 8x22B Instruct
Borderline
141B·Q4_K_M → 96 GB·109% of effective VRAM·~10% speed penalty vs ideal

Effective VRAM utilization >109% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

WizardLM-2 8x22B
Borderline
141B·Q4_K_M → 96 GB·109% of effective VRAM·~10% speed penalty vs ideal

Effective VRAM utilization >109% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

DBRX Base
Borderline
132B·Q4_K_M → 96 GB·109% of effective VRAM·~10% speed penalty vs ideal

Effective VRAM utilization >109% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

DBRX Instruct
Borderline
132B·AWQ-INT4 → 96 GB·109% of effective VRAM·~10% speed penalty vs ideal

Effective VRAM utilization >109% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Mistral Large 2 (123B)
Borderline
123B·Q4_K_M → 88 GB·100% of effective VRAM·~10% speed penalty vs ideal

Effective VRAM utilization >100% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Nemotron 3 Super (120B-A12B)
Borderline
120B·Q4_K_M → 84 GB·95% of effective VRAM·~10% speed penalty vs ideal

Effective VRAM utilization >95% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Llama 4 Scout
Borderline
109B·Q4_K_M → 80 GB·91% of effective VRAM·~10% speed penalty vs ideal

Effective VRAM utilization >91% — KV cache for long context will not fit. Cap context at ~4-8K or move to a larger combo.

Not practical (8)

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly. Drop to a smaller quant or move to a larger combo.

DeepSeek V4 Pro (1.6T MoE)
Not practical
1600B·Q4_K_M → 1024 GB·1164% of effective VRAM·~10% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Kimi K2.6
Not practical
1000B·Q4_K_M → 700 GB·795% of effective VRAM·~10% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Step-3
Not practical
1000B·AWQ-INT4 → 640 GB·727% of effective VRAM·~10% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek V4
Not practical
745B·AWQ-INT4 → 480 GB·545% of effective VRAM·~10% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Mistral Medium 3.5 (675B MoE)
Not practical
675B·Q4_K_M → 448 GB·509% of effective VRAM·~10% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek R1 (671B reasoning)
Not practical
671B·Q4_K_M → 420 GB·477% of effective VRAM·~10% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

DeepSeek V3 (671B MoE)
Not practical
671B·Q4_K_M → 420 GB·477% of effective VRAM·~10% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Llama 4 405B
Not practical
405B·AWQ-INT4 → 280 GB·318% of effective VRAM·~10% speed penalty vs ideal

Model weights exceed effective combo VRAM. Even with the recommended split strategy, this configuration won't run cleanly.

Benchmark opportunities

estimates, not measurements

Pending benchmark targets for this combo. Once measured, results land in the catalog as benchmarks.

4× RTX 3090 + DeepSeek R1 Distill Llama 70B (vLLM TP-4)
pending
Estimate: 20-28 tok/s decode (with thinking-mode bloat)

Reasoning workload on quad-3090. R1 distill produces 5-15× more tokens per query; per-stream throughput drops vs same-size non-reasoning model.

Going deeper

  • Full combo detail page — operational review with failure modes and runtime matrix.
  • Multi-GPU buying guide — when multi-GPU is worth it and when it isn't.
  • Will-it-run home — single-card check + custom builds.