Hardware buyer guide · 4 picksEditorialReviewed May 2026

Best GPU for Flux models

Honest 2026 GPU buyer guide for Flux Dev, Flux Pro, and LoRA training locally: VRAM math, FP8 vs FP16 tradeoffs, ComfyUI vs A1111 hardware fit.

By Fredoline Eruo · Last reviewed 2026-05-08

The short answer

For Flux Dev FP8 inference (the dominant local-Flux workload), 16 GB VRAM is workable — but tight. RTX 4070 Ti Super at $800 or used RTX 3090 at $800 are the entry tier.

For Flux Dev FP16 + serious LoRA training, 24 GB minimum: RTX 4090 is the comfort pick. The Ada compute advantage on image gen is real (30-50% faster Flux throughput vs 3090).

Flux is compute-bound, not bandwidth-bound. FP16 TFLOPS matters more than memory bandwidth — the opposite of LLM inference. The 5090's compute advantage shows up here in ways it doesn't on Llama 70B Q4.

The picks, ranked by buyer-leverage

RTX 4070 Ti Super — Flux Dev FP8 entry pick

full verdict →

16 GB · $800-1,000 (2026 retail)

Best new 16 GB card for Flux Dev FP8 inference. SDXL + Flux Dev FP8 comfortable; Flux Dev FP16 doesn't fit.

Buy if

Flux Dev FP8 daily image generation
SDXL + Flux Dev FP8 mixed workflows
New + warranty preference

Skip if

Flux Dev FP16 inference (16 GB blocks you)
Flux LoRA training (need 24 GB+)
Buyers willing to accept used 3090 (more VRAM, similar price)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

RTX 3090 (used) — Flux + LoRA training value pick

full verdict →

24 GB · $700-1,000 (2026 used)

24 GB unlocks Flux Dev FP16 + LoRA training comfortably. The leverage Flux pick.

Buy if

Flux Dev FP16 inference
Flux LoRA training (single-LoRA, comfortable)
ComfyUI multi-model workflows (multiple checkpoints loaded)

Skip if

Buyers who hate used silicon
Production Flux serving (Ada efficiency real)
Flux + video gen concurrent (need 32 GB)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

RTX 4090 — Flux production pick

full verdict →

24 GB · $1,400-1,900 used / $1,800-2,200 new

Best mainstream Flux card. Ada compute advantage = 30-50% faster Flux throughput vs 3090.

Buy if

Production Flux serving (high-throughput batch generation)
Flux + LoRA training daily
ComfyUI heavy users (multi-model + LoRA + ControlNet stacks)

Skip if

Tight budgets where used 3090 covers it (slower but works)
Buyers stretching to 5090 for video + Flux concurrent
Multi-GPU operators (4090s in 2-card rigs are tight)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

RTX 5090 — Flux + video gen pick

full verdict →

32 GB · $2,000-2,500 (2026 retail)

32 GB unlocks Flux + LTX-Video / Mochi concurrent workflows. Production-tier image + video gen.

Buy if

Local video gen + Flux concurrent
Flux LoRA training + image gen serving same machine
Production multi-tenant Flux serving

Skip if

Image-gen-only operators (4090's 24 GB is plenty)
PSU-constrained builds (575W TDP)
Multi-GPU rigs (4-slot reference brutal)

▼ CHECK CURRENT PRICE

Check on Amazon →

Affiliate disclosure: we earn a small commission on purchases made through these links. The opinion comes first.

HonestyWhy benchmark numbers on this page might not reflect your real experience

tok/s is not user experience. Humans read at ~10-15 tok/s — anything above that is buffer time, not perceived speed.
Context length changes everything. A 70B Q4 model at 1024 tokens generates ~25 tok/s; the same model at 32K context drops to ~8-12 tok/s as KV cache fills.
Quantization changes the conclusion. Q4_K_M vs Q5_K_M vs Q8 produce different speed AND different quality. A benchmark at one quant doesn't translate to another.
Thermal throttling changes long sessions. The first 15 minutes of a benchmark see boost-clock peak; the next 4 hours see steady-state, which is 5-15% slower depending on case airflow.
Driver and runtime versions silently shift winners. A 2024 benchmark on PyTorch 2.4 + CUDA 12.4 doesn't reflect 2026 reality on PyTorch 2.6 + CUDA 12.6. Discount benchmarks older than 6 months.
Vendor and YouTuber benchmarks are cherry-picked. The standard 'Llama 3.1 70B Q4 at 1024 tokens' chart shows peak decode on a tiny prompt — exactly the conditions least representative of daily use.
Our ranking is by workload fit at the buyer's actual budget — not by raw benchmark order. A faster card that doesn't fit your workload ranks below a slower card that does.

We try to surface these caveats where they apply. If a number on this page reads more confident than it should, please email us via contact. See also our methodology and editorial philosophy.

How to think about VRAM tiers

Flux is compute-bound. VRAM matters for resolution + LoRA training stack, but FP16 TFLOPS decides throughput. Higher-tier cards win disproportionately on Flux vs LLM inference.

8-12 GB — SD 1.5 / SDXL only. Flux Dev FP8 doesn't realistically fit.
16 GB — Flux Dev FP8 with offloading. LoRA training tight on SDXL, not viable on Flux.
24 GB (Flux sweet spot) — Flux Dev FP16; LoRA training comfortable; ComfyUI multi-model workflows.
32 GB+ — Flux + video gen concurrent; long-form video; production multi-tenant.

Who should skip this guide

Flux is a specific image generation model family with specific hardware requirements. Not every local-AI user needs to optimize for it.

If you only do text generation. Flux is an image diffusion model. It uses approximately 16-24 GB of VRAM at FP16 depending on the variant (Flux Dev, Flux Schnell, Flux Pro). If your workload is 100% LLM inference — chat, coding, summarization, RAG — optimizing for Flux is optimizing for the wrong thing. This guide assumes you're doing image generation alongside or instead of LLM work. If you're LLM-only, use the best-gpu-for-local-ai guide.

If you're on a 12 GB card. Flux Dev FP16 requires approximately 16-18 GB of VRAM. Flux Dev FP8 (via ComfyUI or Diffusers with bitsandbytes) squeezes into approximately 12 GB — barely, and with generation times of approximately 30-45 seconds per 1024×1024 image on an RTX 3060 12 GB. This is functional but painful. Flux Schnell at FP16 is approximately 18-20 GB. If you're on 12 GB, expect to use NF4 quantization or GGUF quantized Flux variants, which trade image quality for VRAM savings. This guide's recommendations start at 16 GB.

If you're using Stable Diffusion XL (SDXL) as your daily driver. SDXL at FP16 fits comfortably in 8-10 GB of VRAM and generates a 1024×1024 image in approximately 6-12 seconds on a budget card. Flux's quality improvement over SDXL is real but comes with a 3-5× VRAM and time-cost multiplier. If SDXL meets your quality bar, skip Flux optimization — the hardware requirement gap is large and the cost of entry into Flux-grade hardware is approximately $500-2,000, not $150-300.

If you're not doing LoRA or ControlNet with Flux. Flux with LoRA adapters adds approximately 1-3 GB of VRAM overhead. Flux with ControlNet adds approximately 2-4 GB. Flux with IP-Adapter FaceID adds approximately 2-3 GB. A card that runs Flux Dev bare at FP16 (16-18 GB) may OOM with LoRA + ControlNet stacked (21-26 GB total). If your Flux workflow includes multiple adapters stacked, add approximately 4-8 GB to the base VRAM requirement. This guide's recommendations for "Flux capable" mean bare Flux Dev FP16 — you need 24 GB for the stacked workflow.

What breaks first when running Flux on consumer GPUs

Flux is not a lightweight model. Here's the OOM and performance degradation sequence on consumer hardware.

First: OOM on Flux Dev FP16 + LoRA. Flux Dev at FP16 is approximately 16-18 GB base. Add a LoRA (1-3 GB) and you're at approximately 17-21 GB. This fits on a 24 GB card with headroom but OOMs on a 16 GB card. The typical user experience: "Flux works fine on its own, but when I add my character LoRA, ComfyUI crashes." This is the most common Flux failure mode on 16 GB cards. Mitigation: use Flux Dev NF4 (approximately 8-10 GB base + LoRA) on 16 GB cards, accepting the quality trade. On 24 GB cards, FP16 + LoRA is comfortable.

Second: ControlNet stacking OOM. Flux ControlNet Canny adds approximately 2-3 GB of VRAM overhead. ControlNet Depth adds approximately 2-3 GB. If you stack both (Canny + Depth), the total with Flux Dev FP16 exceeds 24 GB. The mitigation — run ControlNets sequentially, not simultaneously — works but doubles generation time.

Third: batch size > 1 on generation pipelines. Flux generates one image per forward pass. Batch generation in ComfyUI processes images sequentially, not in parallel, so VRAM stays flat. But if you're using a custom pipeline that batches multiple images in a single forward pass to amortize the VAE decode cost, each additional image in the batch costs approximately 2-4 GB of VRAM. On a 16 GB card, batch size 2 may OOM where batch size 1 was fine. This catches users who try to optimize generation throughput without realizing the VRAM cost.

Fourth: VAE decode memory spike. Flux's VAE decode step temporarily allocates additional VRAM (approximately 2-4 GB above the UNet forward pass). If you're running at 90%+ VRAM utilization during the UNet pass, the VAE decode OOMs. This is why "the model loads fine but fails at 95% generation" — the peak VRAM usage happens at decode, not at inference.

Fifth: thermal throttling during batch generation. Generating 100 images sequentially at Flux Dev FP16 takes approximately 30-60 minutes on a consumer GPU. This is sustained 100% utilization — the GPU stabilizes at its thermal ceiling after 15-20 minutes, and each subsequent image generates approximately 5-10% slower than the first. For a single image this is invisible; for a batch of 100, the cumulative time penalty is approximately 3-6 minutes. This is small but real, and it's the reason cloud GPU instances with aggressive cooling (or water-cooled consumer cards) show better sustained Flux throughput than air-cooled cards.

Sixth: FP8 performance on pre-Blackwell architectures. Flux benefits from FP8 acceleration, but only Ada Lovelace (RTX 40-series) and Blackwell (RTX 50-series) have native FP8 tensor core support. On Ampere (RTX 30-series), FP8 inference falls back to FP16, adding approximately 10-20% to generation time. The RTX 4060 Ti 16 GB and RTX 4090 both benefit from FP8; the RTX 3090 does not. For Flux specifically, the 40-series architectural advantage is real.

Used GPU market for Flux workloads

Flux's VRAM requirements create a specific used-GPU buying dynamic that's different from the LLM market.

The 24 GB floor is the buyer's dividing line. Flux Dev FP16 + LoRA + ControlNet requires 24 GB. This makes the used RTX 3090 ($700-900) the value king for Flux specifically — it's the cheapest 24 GB card on the market, and the approximately 25% throughput gap vs the RTX 4090 matters less for image generation than for LLM inference because Flux generation is compute-bound during the denoising steps, not bandwidth-bound during decode. A 3090 generates a 1024×1024 Flux Dev image in approximately 12-18 seconds; a 4090 does it in approximately 8-12 seconds. The 4090 is 35-50% faster, not 25% — the Ada FP8 advantage compounds.

16 GB cards are the "Flux on a budget" tier — with compromises. The RTX 4060 Ti 16 GB ($450 new, approximately $350 used) runs Flux Dev NF4 comfortably and Flux Dev FP16 with no adapters — barely. For single-image, no-adapter Flux generation, the 16 GB tier is usable. But the moment you add a LoRA, the math changes. Budget an additional $200-300 for the jump to a used 24 GB card if you plan to use adapters.

AMD cards and Flux: ROCm is the variable. The RX 7900 XTX (24 GB, $900 new, approximately $750 used) should be a Flux monster on spec — 24 GB VRAM, 960 GB/s bandwidth, 122.8 FP16 TFLOPS. In practice, ComfyUI on ROCm for Flux is functional but approximately 10-20% slower than CUDA equivalents and occasionally hits driver edge cases with custom nodes. If you're buying a card primarily for Flux, the NVIDIA premium buys you software reliability on the generation pipeline. If you're dual-purposing the card for both LLM and Flux, the 7900 XTX's value proposition is stronger but the Flux-specific experience is rougher.

Used RTX 4080 Super (16 GB) vs used RTX 3090 (24 GB) for Flux. This is the most common buyer's dilemma in mid-2026. The 4080 Super at approximately $900-1,000 used has better compute, FP8 support, and is newer; the 3090 at approximately $700-900 used has 50% more VRAM. For Flux specifically, the VRAM wins. A 3090 at 24 GB runs Flux Dev FP16 + LoRA + ControlNet Canny simultaneously; the 4080 Super at 16 GB runs Flux Dev FP16 bare and OOMs with adapters stacked. Buy the 3090 for Flux.

Power, noise, heat, and electricity cost for Flux workloads

Flux image generation has a distinct power profile from LLM inference: high utilization spikes followed by idle periods, rather than sustained throughput. This changes the thermal and acoustic experience.

Spiky power draw, not sustained. A Flux generation cycle draws peak power for approximately 10-20 seconds (denoising steps), then drops to idle for the VAE decode and save-to-disk phases. The GPU doesn't reach thermal steady-state on a single image — it takes 4-6 sequential generations to reach thermal equilibrium. For a user generating one image every few minutes, the card spends more time near-idle than at load. This means Flux workloads are thermally kinder than LLM inference, where sustained throughput keeps the card at steady-state temperature indefinitely.

Noise is intermittent and less fatiguing than LLM workloads. A card that hits 42 dBA for 15 seconds and then drops to 25 dBA for 45 seconds between generations is much less annoying than a card that holds 42 dBA continuously for an hour-long coding session. The intermittent pattern means you can work between generations in relative quiet, with brief fan ramp-ups. This is one reason Flux is more comfortable to run in a shared workspace than sustained LLM inference.

Electricity cost for a typical Flux user. A user generating 50 images per day at 15 seconds per image on an RTX 4090 (450W peak) is drawing peak power for approximately 12.5 minutes total and idling the rest. The daily power cost is approximately $0.08-0.12 at $0.16/kWh — approximately $2.50-3.60/month. On an RTX 3090 (350W) with slightly longer generation times (18 seconds), it's approximately $0.06-0.09/day, approximately $1.80-2.70/month. Flux is one of the cheapest AI workloads to run locally in electricity terms because it's bursty, not continuous.

Heat is a non-issue for casual Flux use. A card that's at peak load for 15 seconds and idle for 45 seconds dumps minimal heat into the room compared to continuous inference. Only batch Flux workflows (100+ sequential images, LoRA training) generate sustained heat. For the typical user generating 20-50 images per session, the room temperature impact is negligible.

Compare these picks head-to-head

RTX 3090 vs RTX 4090

Image gen: 4090's compute advantage shows.

RTX 4090 vs RTX 5090

When video gen pays for the 32 GB premium.

M4 Max vs RTX 4090

Mac vs PC for Flux — CUDA wins decisively.

Frequently asked questions

Can I run Flux Dev on a 16 GB GPU?

Flux Dev FP8: yes with offloading (slower, but works). Flux Dev FP16: no, doesn't fit. Most 16 GB users run FP8 quants and accept the quality trade-off (small, often acceptable for non-final work).

How much VRAM for Flux LoRA training?

16 GB tight (works for some LoRAs at low resolution). 24 GB comfortable for typical Flux LoRA training. 32 GB allows higher batch sizes and longer training runs without OOM.

Flux vs SDXL — does GPU choice differ?

Same hardware-tier logic — both are compute-bound. SDXL works on 8-12 GB; Flux needs 16+ GB minimum. If your roadmap includes Flux, pick at least 16 GB; if you're only doing SDXL, 12 GB is workable.

ComfyUI vs A1111 — does VRAM preference differ?

ComfyUI's multi-model graph wants more VRAM than A1111's single-pipeline. With 16 GB, A1111 / Forge is more comfortable; with 24+ GB, ComfyUI's flexibility pays off.

Go deeper

Best GPU for Stable Diffusion (local) — Image gen pillar — Flux + SDXL combined view
Best GPU for ComfyUI — Multi-model graph workflows — different VRAM math
Best GPU for local video generation — Hunyuan / Wan / Mochi — same 24 GB minimum tier
Best GPU for local AI (pillar) — All workloads ranked across VRAM tiers
Flux family — All Flux variants + capability deep-dive
ComfyUI — Reference graph-based image-gen tool

When it doesn't work

Hardware bought, set up correctly, still failing? The highest-volume local-AI errors and their fixes:

If this isn't the right fit

Common alternatives readers consider:

If your budget is tighter →best budget GPU for local AI
If you'd rather buy used →best used GPU for local AI
If you're on Apple Silicon →best Mac for local AI
If you're not sure what fits your build →the will-it-run checker
If you don't want to buy anything yet →our editorial philosophy