Hardware buyer guide · 5 picksEditorialReviewed May 2026

Best GPU for Stable Diffusion (local)

Honest 2026 guide to picking a GPU for local Stable Diffusion + Flux + ComfyUI: 4070 Ti Super, used 3090, 4090, 5090, M4 Max. SDXL vs Flux VRAM math, LoRA training requirements, video gen tier.

By Fredoline Eruo · Last reviewed 2026-05-08

The short answer

For most local image-gen operators, a used RTX 3090 24 GB at $700-1,000 is the right answer. 24 GB unlocks Flux Dev FP16 + LoRA training + ComfyUI multi-model workflows.

If you want new with warranty and SDXL/SD 1.5 is your daily, the RTX 4070 Ti Super 16 GB is the value entry. The RTX 4090 is the best mainstream pick across all 2026 image-gen workloads.

Image gen is compute-bound, not memory-bandwidth-bound — the opposite of LLM inference. FP16 TFLOPS matters. The 5090's 32 GB unlocks video gen (LTX-Video, Mochi) and is the right pick if Flux + video are your daily.

The picks, ranked by buyer-leverage

#1

RTX 4070 Ti Super

full verdict →

16 GB · $800-1,000 (2026 retail)

Best 16 GB new card for SDXL + entry-tier Flux. Compute-strong, warranty included.

Buy if
  • SDXL / SD 1.5 daily workflows
  • Flux Dev FP8 (with offloading)
  • Buyers wanting new + warranty without used 3090 risk
Skip if
  • LoRA training on Flux (16 GB is too tight)
  • Video gen workflows (LTX-Video, Mochi need 24 GB+)
  • Buyers willing to accept used 3090 (more VRAM at half the cost)
▼ CHECK CURRENT PRICE
Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.
#2

RTX 3090 (used)

full verdict →

24 GB · $700-1,000 (2026 used)

The single highest-leverage image-gen buy in 2026. 24 GB unlocks Flux Dev FP16 + LoRA training comfortably.

Buy if
  • Flux Dev FP16 + LoRA training daily
  • ComfyUI multi-model workflows (multiple checkpoints loaded)
  • Best $/GB-VRAM at the 24 GB tier
Skip if
  • Buyers who hate used silicon (warranty risk)
  • Power-budget-constrained builds (350W TDP)
  • Video gen production (32 GB on 5090 is meaningfully better)
▼ CHECK CURRENT PRICE
Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

24 GB · $1,400-1,900 used / $1,800-2,200 new

The best mainstream image-gen card. 24 GB + 1008 GB/s + Ada compute = fastest at every quant tier the 4090 can fit.

Buy if
  • Buyers wanting fastest 24 GB single-card image gen
  • Flux + video gen in one machine
  • ComfyUI heavy users (multi-model + LoRA + ControlNet stacks)
Skip if
  • Tight budgets where used 3090 covers the same workload
  • Buyers stretching to 5090 for video-gen production
  • Multi-GPU setups (4090s in 2-card rigs are tight)
▼ CHECK CURRENT PRICE
Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

32 GB · $2,000-2,500 (2026 retail)

Video gen's first viable consumer card. 32 GB unlocks LTX-Video + Mochi + long-form video workflows.

Buy if
  • Local video generation (LTX-Video, Mochi, AnimateDiff long-form)
  • Flux Dev + video gen + LoRA training same machine
  • Production image-gen serving with multiple concurrent generations
Skip if
  • Image-gen-only operators (4090's 24 GB is plenty for Flux)
  • PSU-constrained builds (575W TDP)
  • Multi-GPU rigs (4-slot reference cooler is brutal)
▼ CHECK CURRENT PRICE
Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.
#5

Apple M4 Max (64-128 GB unified)

full verdict →

64 GB · $3,500-5,500 (MacBook Pro / Mac Studio configs)

Mac alternative for image gen. ComfyUI on MPS works. ~30-50% slower than 4090 on Flux but silent + plug-and-play.

Buy if
  • Mac-first creative workflows (Photoshop + ComfyUI integration)
  • Buyers wanting silence + simplicity over peak speed
  • Privacy-first creative work (no cloud dependency)
Skip if
  • Production image gen at scale (CUDA wins 1.5-2x)
  • Video gen workflows (Apple Silicon support is partial)
  • LoRA training (PyTorch MPS lacks parity with CUDA)
▼ CHECK CURRENT PRICE
Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.
HonestyWhy benchmark numbers on this page might not reflect your real experience
  • tok/s is not user experience. Humans read at ~10-15 tok/s — anything above that is buffer time, not perceived speed.
  • Context length changes everything. A 70B Q4 model at 1024 tokens generates ~25 tok/s; the same model at 32K context drops to ~8-12 tok/s as KV cache fills.
  • Quantization changes the conclusion. Q4_K_M vs Q5_K_M vs Q8 produce different speed AND different quality. A benchmark at one quant doesn't translate to another.
  • Thermal throttling changes long sessions. The first 15 minutes of a benchmark see boost-clock peak; the next 4 hours see steady-state, which is 5-15% slower depending on case airflow.
  • Driver and runtime versions silently shift winners. A 2024 benchmark on PyTorch 2.4 + CUDA 12.4 doesn't reflect 2026 reality on PyTorch 2.6 + CUDA 12.6. Discount benchmarks older than 6 months.
  • Vendor and YouTuber benchmarks are cherry-picked. The standard 'Llama 3.1 70B Q4 at 1024 tokens' chart shows peak decode on a tiny prompt — exactly the conditions least representative of daily use.
  • Our ranking is by workload fit at the buyer's actual budget — not by raw benchmark order. A faster card that doesn't fit your workload ranks below a slower card that does.

We try to surface these caveats where they apply. If a number on this page reads more confident than it should, please email us via contact. See also our methodology and editorial philosophy.

How to think about VRAM tiers

Image gen prioritizes compute (FP16 TFLOPS) over bandwidth, unlike LLM inference. VRAM matters for the resolution + checkpoint count + LoRA training stack. Pick the tier that fits your largest realistic workflow.

  • 8 GBSD 1.5 only with optimizations. SDXL impractical.
  • 12 GBSDXL workable. Flux Dev not realistic. No LoRA training.
  • 16 GBSDXL comfortable; Flux Dev FP8 fits with offloading. LoRA training tight on SDXL, not viable on Flux.
  • 24 GB (the sweet spot)Flux Dev FP16; LoRA training comfortable on SDXL + Flux; ComfyUI multi-model workflows.
  • 32 GB+Video gen (LTX-Video, Mochi); long-form video; concurrent multi-model + LoRA + ControlNet.
  • Apple unified 64+ GBWorkable alternative; trade-off is bandwidth + ecosystem maturity.

Compare these picks head-to-head

Frequently asked questions

VRAM for SDXL vs Flux Dev?

SDXL works on 8 GB with optimizations (xformers, fp16, sequential CPU offload). Flux Dev FP8 needs ~12-16 GB practical. Flux Dev FP16 needs 24+ GB comfortably. Flux Dev FP32 isn't realistic on consumer hardware.

Can I run video generation locally in 2026?

Yes. LTX-Video and Mochi run on 24 GB cards (3090, 4090, 7900 XTX). 32 GB (5090) gives meaningful headroom. Sub-24 GB cards don't run these models reliably even with offloading.

Mac vs PC for image gen?

PC wins on speed (1.5-2x faster Flux on RTX 4090 vs M4 Max), CUDA ecosystem (ComfyUI custom nodes, training repos). Mac wins on simplicity, silence, integration with creative apps. Pick PC for production, Mac for casual + Mac-native workflows.

ComfyUI vs A1111 — does hardware preference differ?

ComfyUI's multi-model graph wants more VRAM than A1111's single-pipeline architecture. With a 16 GB card, A1111 is more comfortable; with 24+ GB, ComfyUI's flexibility pays off. Forge (faster A1111 fork) sits in between.

Used 3090 vs new 5080 for image gen specifically?

3090 wins. 24 GB > 16 GB unlocks Flux FP16, LoRA training, ComfyUI multi-model workflows. The 5080's bandwidth advantage doesn't matter much for image gen (compute-bound). Pick 5080 only if used silicon is a dealbreaker.

How much VRAM for LoRA training?

SDXL LoRA: 16 GB tight, 24 GB comfortable. Flux LoRA: 24 GB minimum, 32 GB comfortable for higher batch sizes / resolutions. Below 16 GB, LoRA training isn't viable for serious work.

Go deeper

When it doesn't work

Hardware bought, set up correctly, still failing? The highest-volume local-AI errors and their fixes:

If this isn't the right fit

Common alternatives readers consider: