NVIDIA GeForce RTX 5090
Blackwell flagship. 32GB GDDR7 on a 512-bit bus delivers ~1.79 TB/s memory bandwidth — the new top of consumer hardware for local LLM inference. Comfortably loads 70B Q4 with room for context.
Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.
Sub-scores sum to 900 / 1000. Headline = 900 × 0.70 (Estimated-confidence discount) = 630. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →
Extrapolated from 1792 GB/s bandwidth — 215.0 tok/s estimated. No measured benchmarks yet.
Plain-English: Comfortable at 32B and below — snappy enough for a coding agent; vision models supported.
Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.
What it does well
The 32 GB VRAM is the operational headline — it's the smallest amount of memory that runs the 70B-class models fully on-GPU at Q4 (Llama 3.3 70B, DeepSeek R1 Distill 70B, Qwen 2.5 72B all land in 39–42 GB with KV cache headroom for 8K context). Memory bandwidth at ~1.79 TB/s is the second headline — that's roughly 1.8× the RTX 4090's 1.0 TB/s, and decode speed scales nearly linearly with bandwidth on memory-bound workloads, so 70B Q4 runs at ~40–55 tok/s on a 5090 versus ~22–28 tok/s on a 4090 with system-RAM offload. CUDA support is universal: every local runtime (vLLM, llama.cpp, Ollama, SGLang) has a happy path on consumer Blackwell.
Where it breaks
- 575 W TGP is real. This is a 1000 W+ PSU card, not a 750 W card. Add headroom for CPU + drives + transient spikes; many operators end up at 1200 W. The 12V-2x6 connector replaces the controversial 4090-era 12VHPWR but the fitment + power-budget caution stays.
- Supply + price are not normal yet. 2025-into-2026 retail is supply-constrained — MSRP $1,999 is rarely the price you actually pay. Scalper-adjacent pricing of $2,300–2,800 is the operator-grade reality.
- 32B-class workloads are over-spec. If your daily target is Qwen 3 32B / Qwen 2.5 Coder 32B / QwQ 32B, the 5090 isn't doing more for you than a 4090 does — the workload fits 24 GB. You're paying the 5090 premium for headroom you don't need.
- Multi-GPU economics are awkward. Two 5090s for ~$5,000 buy you 64 GB combined VRAM. Two used 3090s buy you 48 GB combined for ~$1,800. For homelab operators chasing $/VRAM, the calculus often favors the older silicon.
Ideal model range
- Sweet spot: 70B-class at Q4 full-GPU — Llama 3.3 70B, DeepSeek R1 Distill 70B, Qwen 2.5 72B at ~40–55 tok/s with comfortable 8–16K context.
- Stretch: 70B at Q5_K_M (50 GB) — partial offload to system RAM, drops to ~28–38 tok/s. Or 32B FP16 (64 GB) — same partial-offload story.
- Comfortable: 32B-class at Q4/Q8 with full 32K context, or 14B-class with 128K context, both at 80+ tok/s with significant headroom.
- Future-proof zone: emerging 32B reasoning models (R1-class) with extended thinking-token budgets fit comfortably; agent loops with 32–64K live context don't pressure VRAM.
Bad use cases
- Genuine frontier-MoE workloads — DeepSeek V3 671B, Llama 4 Maverick / Behemoth — need workstation hardware (RTX 6000 Ada / RTX PRO 6000 Blackwell) or multi-GPU. 32 GB doesn't change that math.
- Power-constrained builds — mini-ITX cases, 750 W PSUs, anyone running 24/7 inference and paying retail electricity. The 5090 is a thermal + power statement.
- Maximum tok/s on small models — 7B at >~300 tok/s is throughput territory where smaller cards (RTX 5070 Ti, RTX 4070 Super) are better $/throughput. The 5090 is over-specced for sub-13B workloads.
- Anyone betting on future supply normalization — if the 5090 is still scalper-priced when you check, the 4090 used market and the dual-3090 path are honest alternatives. Don't pay 30%+ premium for a card you can wait on.
Verdict
Buy this if 70B Q4 fully on GPU is your daily-driver target, you can find one at $2,300 or below (so within a reasonable scalper premium), AND your build has 1000 W+ PSU + thermal headroom + a case that fits a 4-slot card. The 32 GB at 1.79 TB/s is genuinely the new sweet spot for serious local-AI work in 2026.
Skip this if the RTX 4090 covers your model range (32B-class), if used RTX 3090s make better $/VRAM sense for a multi-GPU rig, if your power envelope is tight, or if you're price-sensitive enough that a 30%+ scalper premium hurts. The 5090 isn't a value play; it's a capability play.
How it compares
- vs RTX 4090 → 4090's 24 GB caps at 32B-class full-GPU and forces partial offload on 70B (~22–28 tok/s vs 5090's ~40–55 tok/s). Pick 5090 when 70B is the target, or you can wait for normal pricing. See /compare/rtx-4090-vs-rtx-5090.
- vs Dual RTX 3090 → 48 GB combined VRAM at ~$1,800 used vs $2,500 new. Better $/VRAM but real multi-GPU complexity (NCCL, driver pinning, PCIe lane budgeting). See /compare/dual-3090-vs-rtx-5090.
- vs Apple M4 Max 128 GB → unified memory comfortably runs 70B FP16 (~140 GB) where the 5090 can't. Apple wins on memory ceiling + total system noise/power; 5090 wins on raw decode speed + CUDA ecosystem maturity. See /compare/apple-m4-max-vs-rtx-5090.
- vs RTX 5080 (16 GB) → wrong tier — 5080 caps at 13B-class full-GPU. The 16 GB → 32 GB jump is the whole reason to pay the 5090 premium.
- vs RX 7900 XTX → 7900 XTX matches 24 GB at half the price but ROCm software stack still trails NVIDIA. Pick 5090 for production-grade local AI; 7900 XTX for hobby + Linux + tight budget.
- vs RTX 6000 Ada / RTX PRO 6000 Blackwell → 48–96 GB workstation VRAM at $7,000–$10,000. Right answer when you need >32 GB and can pay; the 5090 is the consumer ceiling, not the absolute ceiling.
Overview
What the RTX 5090 actually is, in local-AI terms
The RTX 5090 is the new consumer-flagship local-AI GPU in 2026 — 32 GB of GDDR7 at ~1.79 TB/s memory bandwidth, the Blackwell consumer architecture with native FP4 acceleration, and the first single consumer card with enough VRAM to host a 70B-class model at INT4 with comfortable context headroom on one PCIe slot. It is roughly 1.5-1.8× faster than the RTX 4090 on most LLM workloads and adds the architectural piece — FP4 — that consumer cards have lacked since the H100 introduced FP8 on Hopper.
It is also expensive, power-hungry, and supply-constrained through most of 2026. For operators who do not need 32 GB and do not need FP4, the 4090 still wins on dollars-per-token. For operators who do need either, there's no alternative below the RTX A6000 or RTX Pro 6000 Blackwell in the consumer-adjacent tier.
Where it fits in the hardware ladder
In the consumer-NVIDIA tier:
| Card | VRAM | BW | Bin |
|---|---|---|---|
| RTX 4090 | 24 GB | 1008 GB/s | workstation default through 2025 |
| RTX 5090 | 32 GB | 1792 GB/s | consumer flagship 2026 |
| RTX Pro 6000 Blackwell | 96 GB | ~1.8 TB/s | workstation tier above 5090 |
vs the datacenter ladder:
| Card | VRAM | BW | Notes |
|---|---|---|---|
| RTX 5090 | 32 GB | 1.79 TB/s | consumer; no NVLink |
| H100 SXM | 80 GB | 3.35 TB/s | datacenter; NVLink |
| H200 | 141 GB | 4.8 TB/s | datacenter capacity tier |
The 5090's 32 GB ceiling is what defines the 2026 "consumer-tier sweet spot" — large enough that 70B at INT4 fits comfortably, small enough that 405B is firmly out of reach without a multi-card or datacenter step.
Best use cases
- Single-card 70B-class inference. Llama 3.3 70B at AWQ-INT4 fits with realistic context headroom on a single 5090 — the first time this has been true on a consumer card. Pair with vLLM or ExLlamaV2.
- High-throughput single-user agentic stacks. Qwen 2.5 Coder 32B at FP16 fits with substantial context; a 4090 can't do that. See /stacks/local-coding-agent.
- FP4 inference experimentation. Blackwell consumer cards expose FP4 acceleration; the engines that target it (TensorRT-LLM, vLLM) are catching up through 2026.
- Local fine-tuning of 13B-32B models with QLoRA via PyTorch + bitsandbytes; 32 GB is enough to hold a quantized 32B + optimizer states + a meaningful batch.
- Concurrent image-gen + LLM. A 5090 can host a Stable Diffusion XL-class model and a 7B-13B chat model simultaneously without thrashing.
What it can run
The realistic working set on a single 5090 in May 2026:
| Model class | Quant | Context | Headroom |
|---|---|---|---|
| 7B | F16 | 128K | massive |
| 13B-14B | F16 | 64K | comfortable |
| 32B | F16 | 32K | comfortable |
| 32B | AWQ-INT4 | 128K | substantial |
| 70B | AWQ-INT4 / EXL2 4.0bpw | 16-32K | tight but works |
| 70B | FP4 (when engine-supported) | 32K | comfortable |
| 405B | — | — | does NOT fit single card |
For 405B-class you need a datacenter tier — see NVIDIA H100 SXM and /stacks/h100-tensor-parallel-workstation.
OS support
| OS | Quality |
|---|---|
| Linux (Ubuntu 24.04 LTS) | excellent — reference |
| Windows 11 native | excellent |
| Windows (WSL2) | excellent |
| macOS | unsupported |
If your CUDA path is broken on WSL2, see /errors/wsl2-gpu-not-detected.
Software / runtime support
The 5090's Blackwell architecture is supported across the leading-edge inference engines, with the caveat that engine support for FP4 lags hardware availability through 2026:
- Ollama / llama.cpp — full GGUF + CUDA; FP4 lands incrementally
- vLLM — full AWQ / GPTQ / FP16 / FP8; FP4 maturing through 2026
- SGLang — same coverage as vLLM
- ExLlamaV2 — single-stream throughput king on this hardware via TabbyAPI
- TensorRT-LLM — first-class; FP4 path the most mature here
- LM Studio — full GUI path with CUDA acceleration
- PyTorch — first-class CUDA target
What breaks first
- Power delivery. The 5090 pulls up to 575 W under sustained inference; the 12V-2x6 connector + cheap PSUs is a known fire-and-instability path. Pair with a Platinum-rated 1200 W+ PSU and high-quality cabling.
- Thermals in compact cases. 575 W of dissipation is a real cooling problem; small-form-factor builds throttle quickly without aggressive airflow.
- CUDA toolkit / driver lag for FP4. Engines are still catching up; expect a 6-12 month tail of "this engine doesn't yet use the 5090's FP4 path" through 2026.
- PCIe Gen5 x16 dependency. The 5090 wants Gen5 bandwidth for prefill on long contexts; older Gen4 boards still work but are bandwidth-limited.
- Multi-GPU absence of NVLink. Like the 4090, no NVLink — multi-card is PCIe only.
Alternatives by intent
| If you want… | Reach for |
|---|---|
| Cheaper, same-tier consumer | RTX 5080 (16 GB) or used RTX 4090 |
| Even more VRAM | RTX Pro 6000 Blackwell (96 GB) or RTX A6000 (48 GB) |
| 70B FP16 single-machine | Apple M3 Ultra 192 GB unified memory |
| AMD path | RX 9070 XT — much cheaper, ROCm tax applies |
| Datacenter throughput | H100 SXM or H200 |
Best pairings
- vLLM + 70B AWQ-INT4 — the canonical multi-user homelab default
- ExLlamaV2 + EXL2 4.65bpw + 70B — the single-stream king setup
- TensorRT-LLM FP4 path — the throughput-king path as engines mature through 2026
- Ubuntu 24.04 LTS + CUDA 12.6+ + Open WebUI in Docker — the homelab default
- Continue.dev routed at vLLM for the 32B-class coding agent — see /stacks/local-coding-agent
Who should avoid the RTX 5090
- Operators happy with 24 GB. A 4090 is dramatically better dollars-per-token; the 5090 only wins when 32 GB or FP4 matter.
- Anyone on a sub-1200 W PSU. Power delivery is non-negotiable.
- Compact-case builders without aggressive cooling. 575 W is a real thermal problem.
- Apple-ecosystem operators. Different stack entirely.
- Workloads where 13B-class models suffice. A 16 GB card saves ~$2000 at the same tier of usefulness.
Related
- Stacks: /stacks/local-coding-agent, /stacks/h100-tensor-parallel-workstation
- System guides: /guides/running-local-ai-on-multiple-gpus-2026, /systems/quantization-formats
- Tools: vLLM, TensorRT-LLM, ExLlamaV2
- Errors: /errors/wsl2-gpu-not-detected
Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.
Specs
| VRAM | 32 GB |
| Power draw (peak) | 575 W |
| Released | 2025 |
| MSRP | $1999 |
| Backends | CUDA Vulkan |
Models that fit
Open-weight models small enough to run on NVIDIA GeForce RTX 5090 with usable context.
The 5090 only justifies its price for buyers who specifically need 32 GB on one card or are running production image/video gen. The guides below cover those workloads.
Frequently asked
What models can NVIDIA GeForce RTX 5090 run?
Does NVIDIA GeForce RTX 5090 support CUDA?
How much does NVIDIA GeForce RTX 5090 cost?
Where next?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.