Hardware buyer guide · 5 picksEditorialReviewed May 2026

Best GPU for Stable Diffusion (local)

Honest 2026 guide to picking a GPU for local Stable Diffusion + Flux + ComfyUI: 4070 Ti Super, used 3090, 4090, 5090, M4 Max. SDXL vs Flux VRAM math, LoRA training requirements, video gen tier.

By Fredoline Eruo · Last reviewed 2026-05-08

The short answer

For most local image-gen operators, a used RTX 3090 24 GB at $700-1,000 is the right answer. 24 GB unlocks Flux Dev FP16 + LoRA training + ComfyUI multi-model workflows.

If you want new with warranty and SDXL/SD 1.5 is your daily, the RTX 4070 Ti Super 16 GB is the value entry. The RTX 4090 is the best mainstream pick across all 2026 image-gen workloads.

Image gen is compute-bound, not memory-bandwidth-bound — the opposite of LLM inference. FP16 TFLOPS matters. The 5090's 32 GB unlocks video gen (LTX-Video, Mochi) and is the right pick if Flux + video are your daily.

The picks, ranked by buyer-leverage

RTX 4070 Ti Super

full verdict →

16 GB · $800-1,000 (2026 retail)

Best 16 GB new card for SDXL + entry-tier Flux. Compute-strong, warranty included.

Buy if

SDXL / SD 1.5 daily workflows
Flux Dev FP8 (with offloading)
Buyers wanting new + warranty without used 3090 risk

Skip if

LoRA training on Flux (16 GB is too tight)
Video gen workflows (LTX-Video, Mochi need 24 GB+)
Buyers willing to accept used 3090 (more VRAM at half the cost)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

RTX 3090 (used)

full verdict →

24 GB · $700-1,000 (2026 used)

The single highest-leverage image-gen buy in 2026. 24 GB unlocks Flux Dev FP16 + LoRA training comfortably.

Buy if

Flux Dev FP16 + LoRA training daily
ComfyUI multi-model workflows (multiple checkpoints loaded)
Best $/GB-VRAM at the 24 GB tier

Skip if

Buyers who hate used silicon (warranty risk)
Power-budget-constrained builds (350W TDP)
Video gen production (32 GB on 5090 is meaningfully better)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

RTX 4090

full verdict →

24 GB · $1,400-1,900 used / $1,800-2,200 new

The best mainstream image-gen card. 24 GB + 1008 GB/s + Ada compute = fastest at every quant tier the 4090 can fit.

Buy if

Buyers wanting fastest 24 GB single-card image gen
Flux + video gen in one machine
ComfyUI heavy users (multi-model + LoRA + ControlNet stacks)

Skip if

Tight budgets where used 3090 covers the same workload
Buyers stretching to 5090 for video-gen production
Multi-GPU setups (4090s in 2-card rigs are tight)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

RTX 5090

full verdict →

32 GB · $2,000-2,500 (2026 retail)

Video gen's first viable consumer card. 32 GB unlocks LTX-Video + Mochi + long-form video workflows.

Buy if

Local video generation (LTX-Video, Mochi, AnimateDiff long-form)
Flux Dev + video gen + LoRA training same machine
Production image-gen serving with multiple concurrent generations

Skip if

Image-gen-only operators (4090's 24 GB is plenty for Flux)
PSU-constrained builds (575W TDP)
Multi-GPU rigs (4-slot reference cooler is brutal)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

Apple M4 Max (64-128 GB unified)

full verdict →

64 GB · $3,500-5,500 (MacBook Pro / Mac Studio configs)

Mac alternative for image gen. ComfyUI on MPS works. ~30-50% slower than 4090 on Flux but silent + plug-and-play.

Buy if

Mac-first creative workflows (Photoshop + ComfyUI integration)
Buyers wanting silence + simplicity over peak speed
Privacy-first creative work (no cloud dependency)

Skip if

Production image gen at scale (CUDA wins 1.5-2x)
Video gen workflows (Apple Silicon support is partial)
LoRA training (PyTorch MPS lacks parity with CUDA)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

HonestyWhy benchmark numbers on this page might not reflect your real experience

tok/s is not user experience. Humans read at ~10-15 tok/s — anything above that is buffer time, not perceived speed.
Context length changes everything. A 70B Q4 model at 1024 tokens generates ~25 tok/s; the same model at 32K context drops to ~8-12 tok/s as KV cache fills.
Quantization changes the conclusion. Q4_K_M vs Q5_K_M vs Q8 produce different speed AND different quality. A benchmark at one quant doesn't translate to another.
Thermal throttling changes long sessions. The first 15 minutes of a benchmark see boost-clock peak; the next 4 hours see steady-state, which is 5-15% slower depending on case airflow.
Driver and runtime versions silently shift winners. A 2024 benchmark on PyTorch 2.4 + CUDA 12.4 doesn't reflect 2026 reality on PyTorch 2.6 + CUDA 12.6. Discount benchmarks older than 6 months.
Vendor and YouTuber benchmarks are cherry-picked. The standard 'Llama 3.1 70B Q4 at 1024 tokens' chart shows peak decode on a tiny prompt — exactly the conditions least representative of daily use.
Our ranking is by workload fit at the buyer's actual budget — not by raw benchmark order. A faster card that doesn't fit your workload ranks below a slower card that does.

We try to surface these caveats where they apply. If a number on this page reads more confident than it should, please email us via contact. See also our methodology and editorial philosophy.

How to think about VRAM tiers

Image gen prioritizes compute (FP16 TFLOPS) over bandwidth, unlike LLM inference. VRAM matters for the resolution + checkpoint count + LoRA training stack. Pick the tier that fits your largest realistic workflow.

8 GB — SD 1.5 only with optimizations. SDXL impractical.
12 GB — SDXL workable. Flux Dev not realistic. No LoRA training.
16 GB — SDXL comfortable; Flux Dev FP8 fits with offloading. LoRA training tight on SDXL, not viable on Flux.
24 GB (the sweet spot) — Flux Dev FP16; LoRA training comfortable on SDXL + Flux; ComfyUI multi-model workflows.
32 GB+ — Video gen (LTX-Video, Mochi); long-form video; concurrent multi-model + LoRA + ControlNet.
Apple unified 64+ GB — Workable alternative; trade-off is bandwidth + ecosystem maturity.

Compare these picks head-to-head

RTX 3090 vs RTX 4090

Same 24 GB — Ada's compute advantage shows on image gen specifically.

RTX 4090 vs RTX 5090

Image gen still gets the 5090 win — when video gen pays for the premium.

M4 Max vs RTX 4090

Apple Silicon for image gen — when each platform wins.

RTX 3090 vs RTX 5080

24 GB used vs 16 GB new for image gen.

Frequently asked questions

VRAM for SDXL vs Flux Dev?

SDXL works on 8 GB with optimizations (xformers, fp16, sequential CPU offload). Flux Dev FP8 needs ~12-16 GB practical. Flux Dev FP16 needs 24+ GB comfortably. Flux Dev FP32 isn't realistic on consumer hardware.

Can I run video generation locally in 2026?

Yes. LTX-Video and Mochi run on 24 GB cards (3090, 4090, 7900 XTX). 32 GB (5090) gives meaningful headroom. Sub-24 GB cards don't run these models reliably even with offloading.

Mac vs PC for image gen?

PC wins on speed (1.5-2x faster Flux on RTX 4090 vs M4 Max), CUDA ecosystem (ComfyUI custom nodes, training repos). Mac wins on simplicity, silence, integration with creative apps. Pick PC for production, Mac for casual + Mac-native workflows.

ComfyUI vs A1111 — does hardware preference differ?

ComfyUI's multi-model graph wants more VRAM than A1111's single-pipeline architecture. With a 16 GB card, A1111 is more comfortable; with 24+ GB, ComfyUI's flexibility pays off. Forge (faster A1111 fork) sits in between.

Used 3090 vs new 5080 for image gen specifically?

3090 wins. 24 GB > 16 GB unlocks Flux FP16, LoRA training, ComfyUI multi-model workflows. The 5080's bandwidth advantage doesn't matter much for image gen (compute-bound). Pick 5080 only if used silicon is a dealbreaker.

How much VRAM for LoRA training?

SDXL LoRA: 16 GB tight, 24 GB comfortable. Flux LoRA: 24 GB minimum, 32 GB comfortable for higher batch sizes / resolutions. Below 16 GB, LoRA training isn't viable for serious work.

Go deeper

Best GPU for local AI (pillar) — All picks ranked across LLM + image-gen workloads
Best GPU for ComfyUI — If your SD workflow lives in ComfyUI graphs
Best GPU for local OCR — Vision-model sibling — different VRAM tier
16 GB vs 24 GB VRAM — Image gen's biggest VRAM-tier decision
Best used GPU — Used 3090 — the value pick that handles Flux
Flux family — All Flux variants + hardware sizing
Stable Diffusion family — SD 1.5, SDXL, SD3 — what runs on what
ComfyUI — The reference graph-based image-gen tool

When it doesn't work

Hardware bought, set up correctly, still failing? The highest-volume local-AI errors and their fixes:

If this isn't the right fit

Common alternatives readers consider:

If your budget is tighter →best budget GPU for local AI
If you'd rather buy used →best used GPU for local AI
If you're on Apple Silicon →best Mac for local AI
If you're not sure what fits your build →the will-it-run checker
If you don't want to buy anything yet →our editorial philosophy