UNIT · NVIDIA · GPU

32 GB VRAMenthusiastReviewed June 2026

NVIDIA GeForce RTX 5090

diagram

Credit: RunLocalAI·License: CC-BY-4.0 (original illustration)·Source

Blackwell flagship. 32GB GDDR7 on a 512-bit bus delivers ~1.79 TB/s memory bandwidth — the new top of consumer hardware for local LLM inference. Comfortably loads 70B Q4 with room for context.

Released 2025·~$2499 street·1792 GB/s memory bandwidth

▼ CHECK CURRENT PRICE· 1 retailer

NVIDIA GeForce RTX 5090

Check on Amazon

Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.

RUNLOCALAI SCORE

See full leaderboard →

630/ 1000

BB-tier

Estimated

Throughput

500/ 500

VRAM-fit

170/ 200

Ecosystem

200/ 200

Efficiency

30/ 100

Sub-scores sum to 900 / 1000. Headline = 900 × 0.70 (Estimated-confidence discount) = 630. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →

Extrapolated from 1792 GB/s bandwidth — 215.0 tok/s estimated. No measured benchmarks yet.

WORKLOAD FIT

Try other hardware →

Plain-English: Comfortable at 32B and below — snappy enough for a coding agent; vision models supported.

7B chat✓

Comfortable

14B chat✓

Comfortable

32B chat✓

Comfortable

70B chat✗

Doesn't fit

Coding agent✓

Comfortable

Vision (≤8B VLM)✓

Comfortable

Long context (32K)✓

Comfortable

✓Comfortable — fits with headroom

~Tight — works, no slack

△Marginal — needs aggressive quant

✗Doesn't fit usefully

Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

9.6/10

What it does well

The 32 GB VRAM is the operational headline — it's the smallest amount of memory that runs the 70B-class models fully on-GPU at Q4 (Llama 3.3 70B, DeepSeek R1 Distill 70B, Qwen 2.5 72B all land in 39–42 GB with KV cache headroom for 8K context). Memory bandwidth at ~1.79 TB/s is the second headline — that's roughly 1.8× the RTX 4090's 1.0 TB/s, and decode speed scales nearly linearly with bandwidth on memory-bound workloads, so 70B Q4 runs at ~40–55 tok/s on a 5090 versus ~22–28 tok/s on a 4090 with system-RAM offload. CUDA support is universal: every local runtime (vLLM, llama.cpp, Ollama, SGLang) has a happy path on consumer Blackwell.

Where it breaks

575 W TGP is real. This is a 1000 W+ PSU card, not a 750 W card. Add headroom for CPU + drives + transient spikes; many operators end up at 1200 W. The 12V-2x6 connector replaces the controversial 4090-era 12VHPWR but the fitment + power-budget caution stays.
Supply + price are not normal yet. 2025-into-2026 retail is supply-constrained — MSRP $1,999 is rarely the price you actually pay. Scalper-adjacent pricing of $2,300–2,800 is the operator-grade reality.
32B-class workloads are over-spec. If your daily target is Qwen 3 32B / Qwen 2.5 Coder 32B / QwQ 32B, the 5090 isn't doing more for you than a 4090 does — the workload fits 24 GB. You're paying the 5090 premium for headroom you don't need.
Multi-GPU economics are awkward. Two 5090s for ~$5,000 buy you 64 GB combined VRAM. Two used 3090s buy you 48 GB combined for ~$1,800. For homelab operators chasing $/VRAM, the calculus often favors the older silicon.

Ideal model range

Sweet spot: 70B-class at Q4 full-GPU — Llama 3.3 70B, DeepSeek R1 Distill 70B, Qwen 2.5 72B at ~40–55 tok/s with comfortable 8–16K context.
Stretch: 70B at Q5_K_M (50 GB) — partial offload to system RAM, drops to ~28–38 tok/s. Or 32B FP16 (64 GB) — same partial-offload story.
Comfortable: 32B-class at Q4/Q8 with full 32K context, or 14B-class with 128K context, both at 80+ tok/s with significant headroom.
Future-proof zone: emerging 32B reasoning models (R1-class) with extended thinking-token budgets fit comfortably; agent loops with 32–64K live context don't pressure VRAM.

Bad use cases

Genuine frontier-MoE workloads — DeepSeek V3 671B, Llama 4 Maverick / Behemoth — need workstation hardware (RTX 6000 Ada / RTX PRO 6000 Blackwell) or multi-GPU. 32 GB doesn't change that math.
Power-constrained builds — mini-ITX cases, 750 W PSUs, anyone running 24/7 inference and paying retail electricity. The 5090 is a thermal + power statement.
Maximum tok/s on small models — 7B at >~300 tok/s is throughput territory where smaller cards (RTX 5070 Ti, RTX 4070 Super) are better $/throughput. The 5090 is over-specced for sub-13B workloads.
Anyone betting on future supply normalization — if the 5090 is still scalper-priced when you check, the 4090 used market and the dual-3090 path are honest alternatives. Don't pay 30%+ premium for a card you can wait on.

Verdict

Buy this if 70B Q4 fully on GPU is your daily-driver target, you can find one at $2,300 or below (so within a reasonable scalper premium), AND your build has 1000 W+ PSU + thermal headroom + a case that fits a 4-slot card. The 32 GB at 1.79 TB/s is genuinely the new sweet spot for serious local-AI work in 2026.

Skip this if the RTX 4090 covers your model range (32B-class), if used RTX 3090s make better $/VRAM sense for a multi-GPU rig, if your power envelope is tight, or if you're price-sensitive enough that a 30%+ scalper premium hurts. The 5090 isn't a value play; it's a capability play.

How it compares

vs RTX 4090 → 4090's 24 GB caps at 32B-class full-GPU and forces partial offload on 70B (~22–28 tok/s vs 5090's ~40–55 tok/s). Pick 5090 when 70B is the target, or you can wait for normal pricing. See /compare/rtx-4090-vs-rtx-5090.
vs Dual RTX 3090 → 48 GB combined VRAM at ~$1,800 used vs $2,500 new. Better $/VRAM but real multi-GPU complexity (NCCL, driver pinning, PCIe lane budgeting). See /compare/dual-3090-vs-rtx-5090.
vs Apple M4 Max 128 GB → unified memory comfortably runs 70B FP16 (~140 GB) where the 5090 can't. Apple wins on memory ceiling + total system noise/power; 5090 wins on raw decode speed + CUDA ecosystem maturity. See /compare/apple-m4-max-vs-rtx-5090.
vs RTX 5080 (16 GB) → wrong tier — 5080 caps at 13B-class full-GPU. The 16 GB → 32 GB jump is the whole reason to pay the 5090 premium.
vs RX 7900 XTX → 7900 XTX matches 24 GB at half the price but ROCm software stack still trails NVIDIA. Pick 5090 for production-grade local AI; 7900 XTX for hobby + Linux + tight budget.
vs RTX 6000 Ada / RTX PRO 6000 Blackwell → 48–96 GB workstation VRAM at $7,000–$10,000. Right answer when you need >32 GB and can pay; the 5090 is the consumer ceiling, not the absolute ceiling.

BLK · OVERVIEW

Overview

What the RTX 5090 actually is, in local-AI terms

The RTX 5090 is the new consumer-flagship local-AI GPU in 2026 — 32 GB of GDDR7 at ~1.79 TB/s memory bandwidth, the Blackwell consumer architecture with native FP4 acceleration, and the first single consumer card with enough VRAM to host a 70B-class model at INT4 with comfortable context headroom on one PCIe slot. It is roughly 1.5-1.8× faster than the RTX 4090 on most LLM workloads and adds the architectural piece — FP4 — that consumer cards have lacked since the H100 introduced FP8 on Hopper.

It is also expensive, power-hungry, and supply-constrained through most of 2026. For operators who do not need 32 GB and do not need FP4, the 4090 still wins on dollars-per-token. For operators who do need either, there's no alternative below the RTX A6000 or RTX Pro 6000 Blackwell in the consumer-adjacent tier.

Where it fits in the hardware ladder

In the consumer-NVIDIA tier:

Card	VRAM	BW	Bin
RTX 4090	24 GB	1008 GB/s	workstation default through 2025
RTX 5090	32 GB	1792 GB/s	consumer flagship 2026
RTX Pro 6000 Blackwell	96 GB	~1.8 TB/s	workstation tier above 5090

vs the datacenter ladder:

Card	VRAM	BW	Notes
RTX 5090	32 GB	1.79 TB/s	consumer; no NVLink
H100 SXM	80 GB	3.35 TB/s	datacenter; NVLink
H200	141 GB	4.8 TB/s	datacenter capacity tier

The 5090's 32 GB ceiling is what defines the 2026 "consumer-tier sweet spot" — large enough that 70B at INT4 fits comfortably, small enough that 405B is firmly out of reach without a multi-card or datacenter step.

Best use cases

Single-card 70B-class inference. Llama 3.3 70B at AWQ-INT4 fits with realistic context headroom on a single 5090 — the first time this has been true on a consumer card. Pair with vLLM or ExLlamaV2.
High-throughput single-user agentic stacks. Qwen 2.5 Coder 32B at FP16 fits with substantial context; a 4090 can't do that. See /stacks/local-coding-agent.
FP4 inference experimentation. Blackwell consumer cards expose FP4 acceleration; the engines that target it (TensorRT-LLM, vLLM) are catching up through 2026.
Local fine-tuning of 13B-32B models with QLoRA via PyTorch + bitsandbytes; 32 GB is enough to hold a quantized 32B + optimizer states + a meaningful batch.
Concurrent image-gen + LLM. A 5090 can host a Stable Diffusion XL-class model and a 7B-13B chat model simultaneously without thrashing.

What it can run

The realistic working set on a single 5090 in May 2026:

Model class	Quant	Context	Headroom
7B	F16	128K	massive
13B-14B	F16	64K	comfortable
32B	F16	32K	comfortable
32B	AWQ-INT4	128K	substantial
70B	AWQ-INT4 / EXL2 4.0bpw	16-32K	tight but works
70B	FP4 (when engine-supported)	32K	comfortable
405B	—	—	does NOT fit single card

For 405B-class you need a datacenter tier — see NVIDIA H100 SXM and /stacks/h100-tensor-parallel-workstation.

OS support

OS	Quality
Linux (Ubuntu 24.04 LTS)	excellent — reference
Windows 11 native	excellent
Windows (WSL2)	excellent
macOS	unsupported

If your CUDA path is broken on WSL2, see /errors/wsl2-gpu-not-detected.

Software / runtime support

The 5090's Blackwell architecture is supported across the leading-edge inference engines, with the caveat that engine support for FP4 lags hardware availability through 2026:

Ollama / llama.cpp — full GGUF + CUDA; FP4 lands incrementally
vLLM — full AWQ / GPTQ / FP16 / FP8; FP4 maturing through 2026
SGLang — same coverage as vLLM
ExLlamaV2 — single-stream throughput king on this hardware via TabbyAPI
TensorRT-LLM — first-class; FP4 path the most mature here
LM Studio — full GUI path with CUDA acceleration
PyTorch — first-class CUDA target

What breaks first

Power delivery. The 5090 pulls up to 575 W under sustained inference; the 12V-2x6 connector + cheap PSUs is a known fire-and-instability path. Pair with a Platinum-rated 1200 W+ PSU and high-quality cabling.
Thermals in compact cases. 575 W of dissipation is a real cooling problem; small-form-factor builds throttle quickly without aggressive airflow.
CUDA toolkit / driver lag for FP4. Engines are still catching up; expect a 6-12 month tail of "this engine doesn't yet use the 5090's FP4 path" through 2026.
PCIe Gen5 x16 dependency. The 5090 wants Gen5 bandwidth for prefill on long contexts; older Gen4 boards still work but are bandwidth-limited.
Multi-GPU absence of NVLink. Like the 4090, no NVLink — multi-card is PCIe only.

Alternatives by intent

If you want…	Reach for
Cheaper, same-tier consumer	RTX 5080 (16 GB) or used RTX 4090
Even more VRAM	RTX Pro 6000 Blackwell (96 GB) or RTX A6000 (48 GB)
70B FP16 single-machine	Apple M3 Ultra 192 GB unified memory
AMD path	RX 9070 XT — much cheaper, ROCm tax applies
Datacenter throughput	H100 SXM or H200

Best pairings

vLLM + 70B AWQ-INT4 — the canonical multi-user homelab default
ExLlamaV2 + EXL2 4.65bpw + 70B — the single-stream king setup
TensorRT-LLM FP4 path — the throughput-king path as engines mature through 2026
Ubuntu 24.04 LTS + CUDA 12.6+ + Open WebUI in Docker — the homelab default
Continue.dev routed at vLLM for the 32B-class coding agent — see /stacks/local-coding-agent

Who should avoid the RTX 5090

Operators happy with 24 GB. A 4090 is dramatically better dollars-per-token; the 5090 only wins when 32 GB or FP4 matter.
Anyone on a sub-1200 W PSU. Power delivery is non-negotiable.
Compact-case builders without aggressive cooling. 575 W is a real thermal problem.
Apple-ecosystem operators. Different stack entirely.
Workloads where 13B-class models suffice. A 16 GB card saves ~$2000 at the same tier of usefulness.

Stacks: /stacks/local-coding-agent, /stacks/h100-tensor-parallel-workstation
System guides: /guides/running-local-ai-on-multiple-gpus-2026, /systems/quantization-formats
Tools: vLLM, TensorRT-LLM, ExLlamaV2
Errors: /errors/wsl2-gpu-not-detected

Retailers we'd check:Amazon

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

BLK · SPECS

Specs

VRAM	32 GB
Power draw (peak)	575 W
Released	2025
MSRP	$1999
Backends	CUDA Vulkan

Models that fit

Open-weight models small enough to run on NVIDIA GeForce RTX 5090 with usable context.

Nomic Embed Text v1.5

0.137B · other

Kokoro 82M

0.082B · other

Llama 3.1 8B Instruct

8B · llama

XTTS v2

0.46B · other

Buyer guides where this card is the right answer

The 5090 only justifies its price for buyers who specifically need 32 GB on one card or are running production image/video gen. The guides below cover those workloads.

Frequently asked

What models can NVIDIA GeForce RTX 5090 run?

With 32GB VRAM, the NVIDIA GeForce RTX 5090 runs models up to ~32B in 4-bit, with room for context. See the model list below for tested combinations.

Does NVIDIA GeForce RTX 5090 support CUDA?

Yes — NVIDIA GeForce RTX 5090 is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

How much does NVIDIA GeForce RTX 5090 cost?

Current street price for NVIDIA GeForce RTX 5090 is around $2499 (MSRP $1999). Prices vary by region and supply.

Where next?

Compare NVIDIA GeForce RTX 5090

Buyer guides

Troubleshooting

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.

UNIT · NVIDIA · GPU

32 GB VRAMenthusiastReviewed June 2026

NVIDIA GeForce RTX 5090

diagram

Credit: RunLocalAI·License: CC-BY-4.0 (original illustration)·Source

Blackwell flagship. 32GB GDDR7 on a 512-bit bus delivers ~1.79 TB/s memory bandwidth — the new top of consumer hardware for local LLM inference. Comfortably loads 70B Q4 with room for context.

Released 2025·~$2499 street·1792 GB/s memory bandwidth

▼ CHECK CURRENT PRICE· 1 retailer

NVIDIA GeForce RTX 5090

Check on Amazon

RUNLOCALAI SCORE

See full leaderboard →

630/ 1000

BB-tier

Estimated

Throughput

500/ 500

VRAM-fit

170/ 200

Ecosystem

200/ 200

Efficiency

30/ 100

Extrapolated from 1792 GB/s bandwidth — 215.0 tok/s estimated. No measured benchmarks yet.

WORKLOAD FIT

Try other hardware →

Plain-English: Comfortable at 32B and below — snappy enough for a coding agent; vision models supported.

7B chat✓

Comfortable

14B chat✓

Comfortable

32B chat✓

Comfortable

70B chat✗

Doesn't fit

Coding agent✓

Comfortable

Vision (≤8B VLM)✓

Comfortable

Long context (32K)✓

Comfortable

✓Comfortable — fits with headroom

~Tight — works, no slack

△Marginal — needs aggressive quant

✗Doesn't fit usefully

Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

9.6/10

What it does well

Where it breaks

575 W TGP is real. This is a 1000 W+ PSU card, not a 750 W card. Add headroom for CPU + drives + transient spikes; many operators end up at 1200 W. The 12V-2x6 connector replaces the controversial 4090-era 12VHPWR but the fitment + power-budget caution stays.
Supply + price are not normal yet. 2025-into-2026 retail is supply-constrained — MSRP $1,999 is rarely the price you actually pay. Scalper-adjacent pricing of $2,300–2,800 is the operator-grade reality.
32B-class workloads are over-spec. If your daily target is Qwen 3 32B / Qwen 2.5 Coder 32B / QwQ 32B, the 5090 isn't doing more for you than a 4090 does — the workload fits 24 GB. You're paying the 5090 premium for headroom you don't need.
Multi-GPU economics are awkward. Two 5090s for ~$5,000 buy you 64 GB combined VRAM. Two used 3090s buy you 48 GB combined for ~$1,800. For homelab operators chasing $/VRAM, the calculus often favors the older silicon.

Ideal model range

Sweet spot: 70B-class at Q4 full-GPU — Llama 3.3 70B, DeepSeek R1 Distill 70B, Qwen 2.5 72B at ~40–55 tok/s with comfortable 8–16K context.
Stretch: 70B at Q5_K_M (50 GB) — partial offload to system RAM, drops to ~28–38 tok/s. Or 32B FP16 (64 GB) — same partial-offload story.
Comfortable: 32B-class at Q4/Q8 with full 32K context, or 14B-class with 128K context, both at 80+ tok/s with significant headroom.
Future-proof zone: emerging 32B reasoning models (R1-class) with extended thinking-token budgets fit comfortably; agent loops with 32–64K live context don't pressure VRAM.

Bad use cases

Genuine frontier-MoE workloads — DeepSeek V3 671B, Llama 4 Maverick / Behemoth — need workstation hardware (RTX 6000 Ada / RTX PRO 6000 Blackwell) or multi-GPU. 32 GB doesn't change that math.
Power-constrained builds — mini-ITX cases, 750 W PSUs, anyone running 24/7 inference and paying retail electricity. The 5090 is a thermal + power statement.
Maximum tok/s on small models — 7B at >~300 tok/s is throughput territory where smaller cards (RTX 5070 Ti, RTX 4070 Super) are better $/throughput. The 5090 is over-specced for sub-13B workloads.
Anyone betting on future supply normalization — if the 5090 is still scalper-priced when you check, the 4090 used market and the dual-3090 path are honest alternatives. Don't pay 30%+ premium for a card you can wait on.

Verdict

How it compares

vs RTX 4090 → 4090's 24 GB caps at 32B-class full-GPU and forces partial offload on 70B (~22–28 tok/s vs 5090's ~40–55 tok/s). Pick 5090 when 70B is the target, or you can wait for normal pricing. See /compare/rtx-4090-vs-rtx-5090.
vs Dual RTX 3090 → 48 GB combined VRAM at ~$1,800 used vs $2,500 new. Better $/VRAM but real multi-GPU complexity (NCCL, driver pinning, PCIe lane budgeting). See /compare/dual-3090-vs-rtx-5090.
vs Apple M4 Max 128 GB → unified memory comfortably runs 70B FP16 (~140 GB) where the 5090 can't. Apple wins on memory ceiling + total system noise/power; 5090 wins on raw decode speed + CUDA ecosystem maturity. See /compare/apple-m4-max-vs-rtx-5090.
vs RTX 5080 (16 GB) → wrong tier — 5080 caps at 13B-class full-GPU. The 16 GB → 32 GB jump is the whole reason to pay the 5090 premium.
vs RX 7900 XTX → 7900 XTX matches 24 GB at half the price but ROCm software stack still trails NVIDIA. Pick 5090 for production-grade local AI; 7900 XTX for hobby + Linux + tight budget.
vs RTX 6000 Ada / RTX PRO 6000 Blackwell → 48–96 GB workstation VRAM at $7,000–$10,000. Right answer when you need >32 GB and can pay; the 5090 is the consumer ceiling, not the absolute ceiling.

BLK · OVERVIEW

Overview

What the RTX 5090 actually is, in local-AI terms

Where it fits in the hardware ladder

In the consumer-NVIDIA tier:

Card	VRAM	BW	Bin
RTX 4090	24 GB	1008 GB/s	workstation default through 2025
RTX 5090	32 GB	1792 GB/s	consumer flagship 2026
RTX Pro 6000 Blackwell	96 GB	~1.8 TB/s	workstation tier above 5090

vs the datacenter ladder:

Card	VRAM	BW	Notes
RTX 5090	32 GB	1.79 TB/s	consumer; no NVLink
H100 SXM	80 GB	3.35 TB/s	datacenter; NVLink
H200	141 GB	4.8 TB/s	datacenter capacity tier

Best use cases

Single-card 70B-class inference. Llama 3.3 70B at AWQ-INT4 fits with realistic context headroom on a single 5090 — the first time this has been true on a consumer card. Pair with vLLM or ExLlamaV2.
High-throughput single-user agentic stacks. Qwen 2.5 Coder 32B at FP16 fits with substantial context; a 4090 can't do that. See /stacks/local-coding-agent.
FP4 inference experimentation. Blackwell consumer cards expose FP4 acceleration; the engines that target it (TensorRT-LLM, vLLM) are catching up through 2026.
Local fine-tuning of 13B-32B models with QLoRA via PyTorch + bitsandbytes; 32 GB is enough to hold a quantized 32B + optimizer states + a meaningful batch.
Concurrent image-gen + LLM. A 5090 can host a Stable Diffusion XL-class model and a 7B-13B chat model simultaneously without thrashing.

What it can run

The realistic working set on a single 5090 in May 2026:

Model class	Quant	Context	Headroom
7B	F16	128K	massive
13B-14B	F16	64K	comfortable
32B	F16	32K	comfortable
32B	AWQ-INT4	128K	substantial
70B	AWQ-INT4 / EXL2 4.0bpw	16-32K	tight but works
70B	FP4 (when engine-supported)	32K	comfortable
405B	—	—	does NOT fit single card

For 405B-class you need a datacenter tier — see NVIDIA H100 SXM and /stacks/h100-tensor-parallel-workstation.

OS support

OS	Quality
Linux (Ubuntu 24.04 LTS)	excellent — reference
Windows 11 native	excellent
Windows (WSL2)	excellent
macOS	unsupported

If your CUDA path is broken on WSL2, see /errors/wsl2-gpu-not-detected.

Software / runtime support

The 5090's Blackwell architecture is supported across the leading-edge inference engines, with the caveat that engine support for FP4 lags hardware availability through 2026:

Ollama / llama.cpp — full GGUF + CUDA; FP4 lands incrementally
vLLM — full AWQ / GPTQ / FP16 / FP8; FP4 maturing through 2026
SGLang — same coverage as vLLM
ExLlamaV2 — single-stream throughput king on this hardware via TabbyAPI
TensorRT-LLM — first-class; FP4 path the most mature here
LM Studio — full GUI path with CUDA acceleration
PyTorch — first-class CUDA target

What breaks first

Power delivery. The 5090 pulls up to 575 W under sustained inference; the 12V-2x6 connector + cheap PSUs is a known fire-and-instability path. Pair with a Platinum-rated 1200 W+ PSU and high-quality cabling.
Thermals in compact cases. 575 W of dissipation is a real cooling problem; small-form-factor builds throttle quickly without aggressive airflow.
CUDA toolkit / driver lag for FP4. Engines are still catching up; expect a 6-12 month tail of "this engine doesn't yet use the 5090's FP4 path" through 2026.
PCIe Gen5 x16 dependency. The 5090 wants Gen5 bandwidth for prefill on long contexts; older Gen4 boards still work but are bandwidth-limited.
Multi-GPU absence of NVLink. Like the 4090, no NVLink — multi-card is PCIe only.

Alternatives by intent

If you want…	Reach for
Cheaper, same-tier consumer	RTX 5080 (16 GB) or used RTX 4090
Even more VRAM	RTX Pro 6000 Blackwell (96 GB) or RTX A6000 (48 GB)
70B FP16 single-machine	Apple M3 Ultra 192 GB unified memory
AMD path	RX 9070 XT — much cheaper, ROCm tax applies
Datacenter throughput	H100 SXM or H200

Best pairings

vLLM + 70B AWQ-INT4 — the canonical multi-user homelab default
ExLlamaV2 + EXL2 4.65bpw + 70B — the single-stream king setup
TensorRT-LLM FP4 path — the throughput-king path as engines mature through 2026
Ubuntu 24.04 LTS + CUDA 12.6+ + Open WebUI in Docker — the homelab default
Continue.dev routed at vLLM for the 32B-class coding agent — see /stacks/local-coding-agent

Who should avoid the RTX 5090

Operators happy with 24 GB. A 4090 is dramatically better dollars-per-token; the 5090 only wins when 32 GB or FP4 matter.
Anyone on a sub-1200 W PSU. Power delivery is non-negotiable.
Compact-case builders without aggressive cooling. 575 W is a real thermal problem.
Apple-ecosystem operators. Different stack entirely.
Workloads where 13B-class models suffice. A 16 GB card saves ~$2000 at the same tier of usefulness.

Stacks: /stacks/local-coding-agent, /stacks/h100-tensor-parallel-workstation
System guides: /guides/running-local-ai-on-multiple-gpus-2026, /systems/quantization-formats
Tools: vLLM, TensorRT-LLM, ExLlamaV2
Errors: /errors/wsl2-gpu-not-detected

Retailers we'd check:Amazon

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

BLK · SPECS

Specs

VRAM	32 GB
Power draw (peak)	575 W
Released	2025
MSRP	$1999
Backends	CUDA Vulkan

Models that fit

Open-weight models small enough to run on NVIDIA GeForce RTX 5090 with usable context.

Nomic Embed Text v1.5

0.137B · other

Kokoro 82M

0.082B · other

Llama 3.1 8B Instruct

8B · llama

XTTS v2

0.46B · other

Buyer guides where this card is the right answer

The 5090 only justifies its price for buyers who specifically need 32 GB on one card or are running production image/video gen. The guides below cover those workloads.

Frequently asked

What models can NVIDIA GeForce RTX 5090 run?

With 32GB VRAM, the NVIDIA GeForce RTX 5090 runs models up to ~32B in 4-bit, with room for context. See the model list below for tested combinations.

Does NVIDIA GeForce RTX 5090 support CUDA?

Yes — NVIDIA GeForce RTX 5090 is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

How much does NVIDIA GeForce RTX 5090 cost?

Current street price for NVIDIA GeForce RTX 5090 is around $2499 (MSRP $1999). Prices vary by region and supply.

Where next?

Compare NVIDIA GeForce RTX 5090

Buyer guides

Troubleshooting

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.

What it does well

Where it breaks

Ideal model range

Bad use cases

Verdict

How it compares

Overview

What the RTX 5090 actually is, in local-AI terms

Where it fits in the hardware ladder

Best use cases

What it can run

OS support

Software / runtime support

What breaks first

Alternatives by intent

Best pairings

Who should avoid the RTX 5090

Related

Specs

Models that fit

Frequently asked

What models can NVIDIA GeForce RTX 5090 run?

Does NVIDIA GeForce RTX 5090 support CUDA?

How much does NVIDIA GeForce RTX 5090 cost?

Where next?

What it does well

Where it breaks

Ideal model range

Bad use cases

Verdict

How it compares

Overview

What the RTX 5090 actually is, in local-AI terms

Where it fits in the hardware ladder

Best use cases

What it can run

OS support

Software / runtime support

What breaks first

Alternatives by intent

Best pairings

Who should avoid the RTX 5090

Related

Specs

Models that fit

Frequently asked

What models can NVIDIA GeForce RTX 5090 run?

Does NVIDIA GeForce RTX 5090 support CUDA?

How much does NVIDIA GeForce RTX 5090 cost?

Where next?