UNIT · NVIDIA · GPU
24 GB VRAMenthusiastReviewed June 2026

NVIDIA GeForce RTX 4090

RTX 4090 spec card — 24 GB VRAM, 1008 GB/s bandwidth, 450 W; best for 32B AWQ-INT4 + 16K context
diagram
Credit: RunLocalAI·License: CC-BY-4.0 (original illustration)·Source

The community-default high-end local-AI card from 2022 to 2025. 24GB GDDR6X at ~1 TB/s makes 70B Q4 comfortably loadable.

Released 2022·~$1899 street·1008 GB/s memory bandwidth
▼ CHECK CURRENT PRICE· 1 retailer
NVIDIA GeForce RTX 4090

Affiliate disclosure: as an Amazon Associate and partner of other retailers, we earn from qualifying purchases. The verdict on this page is our editorial opinion; affiliate links never influence what we recommend.

RUNLOCALAI SCORE
See full leaderboard →
520/ 1000
BB-tier
Estimated
Throughput
351/ 500
VRAM-fit
170/ 200
Ecosystem
200/ 200
Efficiency
22/ 100

Sub-scores sum to 743 / 1000. Headline = 743 × 0.70 (Estimated-confidence discount) = 520. This is an algorithmic performance-tier score — distinct from, and often lower than, the editorial “Our verdict” below, which weighs value and real-world fit (especially for hardware we haven’t measured yet). How scoring works →

Extrapolated from 1008 GB/s bandwidth — 121.0 tok/s estimated. No measured benchmarks yet.

Plain-English: Workable at 32B, comfortable at 14B and below — snappy enough for a coding agent; vision models supported.

7B chat
Comfortable
14B chat
Comfortable
32B chat~
Tight
70B chat
Doesn't fit
Coding agent
Comfortable
Vision (≤8B VLM)
Comfortable
Long context (32K)
Comfortable
Comfortable — fits with headroom
~Tight — works, no slack
Marginal — needs aggressive quant
Doesn't fit usefully

Verdicts extrapolated from catalog VRAM + bandwidth + ecosystem flags. Hover any chip for the rationale. Want measured numbers? Submit your own run with runlocalai-bench --submit.

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
9.4/10

What it does well

The 24 GB VRAM is the headline — it's the smallest amount of memory that runs the modern 32B-class models full-GPU at Q4 (Qwen 3 32B, Qwen 2.5 Coder 32B, QwQ 32B, R1 Distill Qwen 32B all live in 19–22 GB), and the largest amount that's reasonable to buy on consumer pricing. Memory bandwidth at 1 TB/s keeps tokens/sec respectable on those models — 70+ tok/s is normal at Q4. CUDA support is universal: every runner has a happy path here.

Where it breaks

  • 70B-class models partial-offload — Q4 70B is 39 GB, so you offload onto system RAM and watch tok/s drop from 70+ to 22–28. Fine for serious work, painful for autocomplete.
  • Llama 4 / DeepSeek V3 don't fit — true workstation models start at 80+ GB at Q4.
  • Power draw at sustained load — 450 W TGP is real; consider PSU and thermals before slotting one in.

Ideal model range

  • Sweet spot: Qwen 3 32B / Qwen 2.5 Coder 32B / QwQ 32B / R1 Distill Qwen 32B — full GPU, 70+ tok/s, full 16K context.
  • Stretch: Llama 3.3 70B / R1 Distill Llama 70B at Q4 with offload — 22–28 tok/s, requires 64+ GB system RAM.
  • Comfortable: 14B-class with full 32K context, or 7–8B with 128K context — runs at 60+ tok/s with headroom.

Bad use cases

  • Genuine 100B+ MoE workloads (DeepSeek V3, Llama 4 Maverick) — workstation hardware required.
  • Maximum tok/s on tiny models — for sub-7B at >200 tok/s, integrated solutions and lower-end cards are better $/throughput.
  • Workstation reliability over years — the 4090 has consumer warranty terms; sustained 24/7 inference is technically out-of-spec.

Verdict

Buy this if you want the best single-card local-AI experience and either bought one at MSRP or are willing to pay the current resale premium. The model sweet spot (32B-class) is the highest-quality tier that runs full-GPU. Skip this if the RTX 5090 is in your budget (32 GB at higher bandwidth fits 70B more comfortably), if you can wait for RTX 5090/5080 supply to normalize, or if 16 GB cards (RTX 4080, 5080) cover your model range — the 32B-class jump is what you're paying the premium for.

How it compares

  • vs RTX 5090 → 5090 is the proper successor (32 GB, ~1.8 TB/s bandwidth) and lets 70B-class models partial-offload less aggressively. Pick 5090 if available at sane prices.
  • vs RTX 3090 → 3090 has the same 24 GB VRAM at lower bandwidth (~940 GB/s). Roughly 60% the tok/s on the same model. Best value-for-VRAM card if you can find a clean used unit.
  • vs RTX 4080 (16 GB) → 4080 is the wrong tier — 16 GB caps you at 14B-class full-GPU. The jump to 32B-class is what makes the 4090 worth the premium.
  • vs RX 7900 XTX → 7900 XTX matches 24 GB at lower price, but ROCm is still a hassle and Vulkan paths are slower for most workloads. NVIDIA wins on software maturity.
  • vs Apple M3 Max (64 GB+) → unified memory lets the M3 Max run 70B at Q4 without offload, at slower tokens/sec than 4090 partial-offload but with much less hassle. Different platform tradeoff.
Why this rating

9.4/10 — the consumer card every local-AI build benchmarks against. 24 GB VRAM at frontier-tier compute means you can full-GPU-offload Qwen 3 32B and Qwen 2.5 Coder 32B; partial-offload Llama 3.3 70B at usable speeds. Loses points only because the RTX 5090 exists at higher VRAM and the resale price is now stupid.

BLK · OVERVIEW

Overview

What the RTX 4090 actually is, in local-AI terms

The RTX 4090 is the single most important consumer GPU in the local-AI ecosystem of the last three years. 24 GB of GDDR6X VRAM at ~1 TB/s memory bandwidth, full Ada-Lovelace tensor cores with 4-bit and 8-bit acceleration, mature CUDA tooling, abundant software support across every major inference engine — there is no other consumer card with the same combination, and in May 2026 it remains the workstation default for serious solo-user local AI even after the RTX 5090 launched.

The 5090 is faster on paper. The 4090 is more available, has cheaper used-market supply, and runs every inference path that exists today. For the marginal local-AI operator in 2026, the 4090 is still the right buy unless they specifically need the 5090's 32 GB or its FP4 acceleration.

Where it fits in the hardware ladder

In the consumer-NVIDIA tier:

Card VRAM BW Bin
RTX 3090 24 GB 936 GB/s floor for serious local AI
RTX 4090 24 GB 1008 GB/s workstation default
RTX 5090 32 GB 1792 GB/s next-gen frontier

vs the datacenter ladder:

Card VRAM BW Notes
RTX 4090 24 GB 1 TB/s consumer; no NVLink
H100 PCIe 80 GB 2 TB/s datacenter; expensive
H100 SXM 80 GB 3.35 TB/s datacenter; NVLink at scale

The 4090's 24 GB ceiling is what defines "consumer-tier" workloads in 2026. Models that fit a single 4090 at AWQ-INT4 or EXL2 4.65bpw — that's the canonical "solo user, no datacenter" sweet spot.

Best use cases

  • Single-user autonomous coding agent. The canonical /stacks/local-coding-agent target hardware. Pairs with Qwen 2.5 Coder 32B at AWQ-INT4 + 32K context with comfortable headroom.
  • 32B-class chat / reasoning models. Llama 3.3 70B does NOT fit a single 4090 at any quality-preserving quant; 32B-class is the realistic ceiling.
  • Long-context single-stream inference. ExLlamaV2 + EXL2 4.65bpw + 32K context is the throughput-king setup.
  • Local fine-tuning of 7B-13B models with LoRA / QLoRA via PyTorch + bitsandbytes.
  • Image generation alongside LLMs — a 4090 can run Stable Diffusion XL and a 7B chat model concurrently.

What it can run

The hard ceiling on a single 4090 is VRAM, not compute. The realistic working set in May 2026:

Model class Quant Context Headroom
7B F16 32K comfortable
13B-14B Q5_K_M / EXL2 5bpw 32K comfortable
32B AWQ-INT4 / EXL2 4.65bpw 32K tight but works
32B EXL2 4.0bpw 64K tight
70B does NOT fit single card

For anything 70B+ you need a second GPU. See /guides/running-local-ai-on-multiple-gpus-2026 and /stacks/dual-4090-workstation.

OS support

OS Quality
Linux (Ubuntu 24.04 LTS) excellent — reference platform
Windows 11 native excellent
Windows (WSL2) excellent — matches Linux
macOS unsupported (no NVIDIA on Apple Silicon)

If your CUDA path is broken on WSL2, see /errors/wsl2-gpu-not-detected.

Software / runtime support

The 4090 is supported by every major local-AI inference engine in 2026:

  • Ollama / llama.cpp — full GGUF / CUDA support
  • vLLM — full AWQ / GPTQ / FP16 support; the production-default for multi-user
  • SGLang — same coverage as vLLM; preferred for prefix-cache-heavy agentic workloads
  • ExLlamaV2 — single-stream throughput king on this hardware
  • TensorRT-LLM — supported but engineered for H100; using it on 4090 is overkill
  • LM Studio — full GUI path with CUDA acceleration
  • PyTorch — first-class CUDA target

What breaks first

  1. VRAM at long context. A 32B AWQ-INT4 model + 32K context + a 5K-token system prompt + agentic memory injection will OOM with no warning. Budget KV-cache headroom explicitly.
  2. Power draw / thermals. The 4090 pulls up to 450 W under sustained inference; cheap 850 W PSUs often fail. Pair with a Gold-rated 1000 W+ PSU and three-fan tower or AIO cooling.
  3. PCIe bandwidth on multi-GPU. Tensor-parallel across 2× 4090s on consumer motherboards usually hits PCIe 4.0 x8 + x8; not a hard limit but a real factor on prefill-heavy workloads.
  4. Driver vs CUDA toolkit drift. Mixing CUDA 12.4 toolkit with a 12.1-era driver is a common cause of "loads but uses CPU."
  5. NVLink absence. The 4090 has no NVLink; multi-GPU goes over PCIe. This is fine for layer-split inference but limits training scale-out vs Ampere RTX A6000 / Hopper.

Alternatives by intent

If you want… Reach for
Cheaper, same VRAM RTX 3090 used
More VRAM in one card RTX 5090 (32 GB) or RTX A6000 (48 GB)
70B-class single user dual 3090 or dual 4090 — see /stacks/dual-3090-workstation
Apple-native, big unified memory Apple M3 Ultra 192 GB
AMD path RX 7900 XTX — half the price, ROCm tax applies
Datacenter throughput H100 SXM

Best pairings

  • Ollama + Qwen 2.5 Coder 32B Q4_K_M — solo coding-agent default
  • vLLM + same model AWQ-INT4 — same intent, multi-user
  • ExLlamaV2 + EXL2 4.65bpw 32B — single-stream throughput king
  • Ubuntu 24.04 LTS + CUDA 12.4 + Open WebUI in Docker — the homelab default
  • Used 4090 from a retired SI build — the cheapest real path to this tier in mid-2026

Who should avoid the RTX 4090

  • Anyone running 70B-class models day-to-day. Either go dual-card or jump to a Mac M3 Ultra or a datacenter H100.
  • Anyone on a sub-1000 W PSU. The thermals and transient spikes are not negotiable.
  • Apple-ecosystem operators. Macs and 4090s don't share a stack.
  • Operators who only need 13B-class models. A 16 GB card is sufficient and saves ~$1000.

Related

Retailers we'd check:Amazon

Some links above are affiliate links. We may earn a commission at no extra cost to you. How we make money.

Featured in these stacks

The L3 execution stacks that pick this hardware as a recommended component, with the one-line note explaining the role it plays in each.

  • Stack · L3·Workstation tier·Role: GPU (where the model runs)
    Build a local coding-agent stack (May 2026)

    RTX 4090 24GB is the sweet spot for this stack: enough VRAM for Qwen 32B AWQ-INT4 + 32K context, enough memory bandwidth (1 TB/s) for sub-second TTFT, and consumer-grade thermals. The 5090 helps but isn't required; the 4080 16GB doesn't have headroom for the context window the agent actually needs.

  • Stack · L3·Workstation tier·Role: GPU (the hardware that defines this stack)
    Build an RTX 4090 AI workstation stack (May 2026)

    24GB VRAM is the first-class consumer tier in May 2026 — 4080 16GB doesn't have headroom for 32K context on 32B models; 5090 helps but is 2-3x the price for ~30% more throughput. The 4090 stays the sweet spot until 5090 supply normalises.

  • Stack · L3·Workstation tier·Role: GPU (LLM + embedding generation)
    Build an offline RAG workstation stack (May 2026)

    RTX 4090 24GB is the workstation default. Embedding 50,000 PDF chunks takes ~30 minutes on a 4090 vs ~3 hours on CPU; the GPU pays for itself on the ingestion side alone for any meaningful document corpus.

  • Stack · L3·Workstation tier·Role: GPU
    Build a memory-enabled local agent stack (May 2026)

    RTX 4090 24GB is the workstation default. The added memory-retrieval workload doesn't need more VRAM; what changes is system RAM (Mem0 + Postgres + agent buffer = bump to 64GB).

  • Stack · L3·Workstation tier·Role: GPU (minimum tier for 32B AWQ + 32K context)
    Build a local reasoning-model stack (May 2026)

    RTX 4090 24GB is the floor. 32B AWQ + 32K context fits with ~2GB headroom — enough for reasoning-block emission but tight. The 5090 32GB is the comfortable tier; M3 Max 64GB / M4 Max are credible alternatives via MLX-LM.

  • Stack · L3·Workstation tier·Role: GPU (minimum tier — vision tokens are heavy)
    Build a local vision-model stack (May 2026)

    Vision-language models tokenize images as long sequences (a 1024x1024 image becomes ~256-1024 vision tokens depending on the model's tokenizer). VRAM budget shrinks fast on multi-image queries. RTX 4090 24GB is the floor; 5090 32GB or M-class Apple is more comfortable.

  • Stack · L3·Workstation tier·Role: GPU
    Build a fully offline coding stack (May 2026)

    RTX 4090 24GB is the workstation default. Same hardware constraint as /stacks/local-coding-agent; the offline pivot is software + network, not GPU choice.

  • Stack · L3·Production tier·Role: GPUs (2× 24GB new, FP8-capable Ada-architecture)
    Dual RTX 4090 workstation stack — newer-architecture 70B serving without NVLink

    RTX 4090 brings Ada-architecture compute (FP8 transformer engine, faster GDDR6X memory) but NVIDIA removed NVLink from this generation. Two 4090s communicate only over PCIe at ~32 GB/s aggregate vs ~112 GB/s on dual-3090 NVLink. Pick 4090 over 3090 for new-card warranty + FP8 support; pick 3090 for cost-efficiency.

  • Stack · L3·Homelab tier·Role: Primary GPU (faster, takes more layers)
    Mixed RTX 4090 + 3090 workstation — the asymmetric upgrade path

    RTX 4090 is the throughput leader; in layer-split mode it takes ~55% of the layers. Its FP8 capability is wasted in this config since llama.cpp doesn't extract FP8 the way TensorRT-LLM does.

BLK · SPECS

Specs

VRAM24 GB
Power draw (peak)450 W
Released2022
MSRP$1599
Backends
CUDA
Vulkan

Models that fit

Open-weight models small enough to run on NVIDIA GeForce RTX 4090 with usable context.

Compare alternatives

Hardware worth comparing

The closest alternatives by price, memory bandwidth, and form factor, plus a step up and down — so you can frame the buying decision against real options.

Buyer guides where this card is the right answer

The RTX 4090 lands squarely in the production-tier slot for most workloads. The guides below cover the model-specific decisions where this card actually shines.

Honest buyer truths

Who should buy the RTX 4090 in 2026

If you want the strongest 24 GB single-card option and you're shopping new. The 4090 is the single-card AI production pick of 2026 — Ada compute is 30-50% faster than the 3090 on Flux + 15-25% faster on Llama 3.3 70B Q4. New retail at $1,800-2,200 (post-5090-launch market correction).

If you run image generation production at any scale. Flux Dev FP16 + ControlNet + IPAdapter + LoRA stacks fit comfortably. ComfyUI multi-checkpoint workflows have headroom. The 4090's compute advantage shows up disproportionately on image-gen vs LLM decode.

If you want one card that handles everything but datacenter workloads. Llama 3.3 70B Q4, Qwen 3 32B FP16, Flux Dev, ComfyUI multi-model, agentic coding loops, RAG over moderate-sized corpora — all fit comfortably on a single 4090.

If your budget tops out around $2,000 and you can't accept used. New + warranty in the 24 GB tier means 4090. The 5090 at $2,000-2,500 is faster but only justified if 32 GB matters.

Who should skip the RTX 4090

If your daily workload is 7-13B chat. A 4060 Ti 16 GB or used 3090 handles that without breaking 50% utilization on the 4090. The compute headroom is wasted; the price isn't justified.

If you're considering it for "future proofing." 32 GB on the 5090 is the ceiling local AI is moving toward. Flux 2 / video gen workflows already exceed 24 GB comfortably. If you're buying for a 3-year horizon and budget allows, the 5090 is the more honest choice.

If you specifically need 32 GB on a single card. Step up to the 5090. The 4090's 24 GB will OOM on long-context 70B FP16 or HunyuanVideo workflows.

If your PSU is 750W or less. The 4090 wants 850W minimum, 1000W comfortably. Don't try to make the 750W work with a Y-splitter — the 12VHPWR connector failure mode is documented and not worth the savings.

If you'd rather buy a used 3090 and pocket $1,000. Honest framing: most 4090 buyers don't need the Ada compute advantage. Used 3090 + the difference invested in faster storage / more system RAM is often the better build.

What breaks first on the RTX 4090

The 12VHPWR connector under sustained load. This is the documented failure mode that NVIDIA acknowledged in late 2022 and quietly fixed in later board revisions. Cards manufactured before mid-2023 may still have the original connector design. Use the included native cable adapter — straight 35mm minimum before any bend at the card-side connector. Aftermarket angled connectors voided RMA coverage; check before installing.

Thermal throttling in mid-tower cases under sustained AI load. The 4090's 450W TDP demands serious airflow. In a typical mid-tower with stock fans, expect 75-80°C under sustained inference and gradual clock drop from peak boost. Mitigation: undervolt -100mV (no perf cost), or move to a case with direct GPU airflow.

VRAM at 32K+ context length. Even 24 GB fills with KV cache faster than people expect. A 70B Q4 model at 32K context uses ~5 GB for weights and ~20 GB for KV cache. Long-context workflows will trip OOM well below the model's nominal VRAM cost. Budget context length carefully.

Driver regression on PyTorch nightlies. The 4090 sees more frequent driver-version sensitivity than older cards because it ships compute features (FP8, Ada-specific paths) that newer runtimes target aggressively. Pin your CUDA + PyTorch + driver combination once it works for your workload.

Power, noise, heat, and electricity cost

Sustained decode draws ~320-380W, not the 450W TDP. LLM decode is bandwidth-bound — the GPU core is underutilized. Image generation pushes closer to 400-450W during prefill and diffusion sampling.

Audible noise floor under load: ~38-42 dBA at 1m with stock fan curves at 75°C sustained. Quieter than a typical office vent; louder than the fanless Mac mini benchmark. Founder operator perspective: it's noticeable in a quiet room during long batch runs but not distracting during normal use.

Heat output to a small office: ~1100-1300 BTU/hour under sustained load. Adds ~1-2°F to ambient temperature in a typical 100-sqft home office over a 4-hour session. Air conditioning running in the same room offsets this.

Electricity cost: at the US average $0.16/kWh and 4 hours/day usage, the 4090 adds ~$8-10/month to the electricity bill. Not nothing, but well under the $20/month ChatGPT Plus subscription it replaces. In high-electricity-cost regions ($0.30-0.50/kWh in parts of Europe), the figure doubles or triples — but the comparison subscription scales the same way.

Frequently asked

What models can NVIDIA GeForce RTX 4090 run?

With 24GB VRAM, the NVIDIA GeForce RTX 4090 runs models up to ~32B in 4-bit, with room for context. See the model list below for tested combinations.

Does NVIDIA GeForce RTX 4090 support CUDA?

Yes — NVIDIA GeForce RTX 4090 is an NVIDIA card with full CUDA support, the most mature local-AI backend. llama.cpp, Ollama, vLLM, and ExLlamaV2 all run natively.

How much does NVIDIA GeForce RTX 4090 cost?

Current street price for NVIDIA GeForce RTX 4090 is around $1899 (MSRP $1599). Prices vary by region and supply.

Where next?

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify hardware specifications.