Capability notes
Open-weight text-to-image in 2026 splits into three VRAM-defined tiers. The Flux family (Black Forest Labs) leads: Flux Schnell (4-step distilled, Apache 2.0, 12 GB VRAM) outputs 1024×1024 in 1.5–3 seconds on [consumer GPUs](/hardware/rtx-4090) with best-in-class text rendering — embedded text is legible at 12pt+, a 3× improvement over SDXL. Flux Dev (50-step, non-commercial, 16 GB VRAM) adds fine prompt adherence at 8–15 second generation time. Flux Pro (API-only, 24 GB+ for equivalent quality) handles 2048×2048 with ControlNet guidance.
Stable Diffusion 3.5 Large (8B params, permissive license) excels at photorealistic portraits, natural lighting, and skin texture at 1024×1024 — MMDIT architecture handles human faces with 40% fewer anatomical errors than SDXL's UNet. Weakness: text rendering is garbled on ~60% of outputs, making it unsuitable for posters or marketing assets with embedded text.
[SDXL](/tools/diffusers) (2.6B+ params) remains the most widely-supported open-weight model with the largest fine-tuned ecosystem — thousands of LoRAs, ControlNets, and community fine-tunes. SDXL is the default starting point at 8 GB VRAM minimum. Resolution ceiling: 1024×1024 natively, 1536×1536 with high-res fix.
Quality differentiators: prompt adherence, text rendering, photorealism, anatomical correctness, style consistency. No single model leads across all. Flux dominates text rendering and prompt adherence. SD3.5 dominates photorealism. SDXL dominates ecosystem breadth. Match model to output requirement, not to benchmark leaderboards.
If you just want to try this
Lowest-friction path to a working setup.
Install [ComfyUI](/tools/comfyui) via Stability Matrix (stabilitymatrix.com → download → one-click install) on Windows, or via pinokio (pinokio.ai → search "ComfyUI") on any OS. Both bundle ComfyUI with GPU detection — no manual Python setup.
Once launched (browser tab at localhost:8188), download Flux Schnell:
1. ComfyUI Manager → Install Models → search "flux1-schnell" → download safetensors (~23 GB).
2. ComfyUI Manager → Install Custom Nodes → search "ComfyUI-GGUF" for FP8/NF4 quantization.
3. Load the default Flux workflow from ComfyUI's workflow library.
4. Set width=1024, height=1024, steps=4, guidance=3.5, type prompt → Queue Prompt.
On [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb): 3–5 seconds per image. [RTX 4090](/hardware/rtx-4090): 1.5–2.5 seconds. [RTX 3060 12GB](/hardware/rtx-3060-12gb) with FP8 via GGUF: 6–10 seconds.
If Flux exceeds VRAM, fall back to SDXL. Download "sd_xl_base_1.0.safetensors" via ComfyUI Manager (6.9 GB). SDXL runs on 8 GB GPUs at 8–15 seconds for 1024×1024. Quality is lower on prompt adherence and text rendering, but the LoRA ecosystem (character styles, art styles, specific subjects) is 10× larger.
Alternative: [LM Studio](/tools/lm-studio) + "Stable Diffusion WebUI" plugin gives an A1111-style UI without manual Python/CUDA setup. LM Studio handles model download and GPU config in one application.
For production deployment
Operator-grade recommendation.
Production image generation requires GPU sizing for throughput and OOM monitoring. Throughput = (60 / seconds-per-image) × batch-size. ComfyUI + Flux Schnell (4 steps, 1024×1024, FP8):
- [RTX 4090](/hardware/rtx-4090) (24 GB): 25–35 images/min at batch=1, 55–70 at batch=4. VRAM saturated at batch=4 (22.5 GB).
- [RTX 5090](/hardware/rtx-5090) (32 GB): 40–55 images/min at batch=1, 80–100 at batch=4. Saturation at batch=6 (30 GB).
- [RTX 6000 Ada](/hardware/rtx-6000-ada) (48 GB): 30–40 at batch=1, 80–110 at batch=8. Larger batches despite lower bandwidth than 5090.
- [L40S](/hardware/nvidia-l40s) (48 GB): identical profile — datacenter SKU.
For Flux Dev (50 steps): divide throughput by 6–8×. Use Dev only when Schnell's 4-step distillation produces visible artifacts on your specific prompt domain — ~15% of prompts requiring precise spatial composition.
**API vs self-host.** Flux Pro via Replicate ~$0.05/image at 1024×1024. SD3.5 Turbo ~$0.003/image. At 10,000 images/month, API = $50–500/month. Self-hosted [RTX 4090](/hardware/rtx-4090) (~$250/month amortized) breaks even at Flux Pro tier. At 100,000 images/month, self-hosted [L40S](/hardware/nvidia-l40s) (~$400–600/month cloud rental) saves 60–80% vs API.
**Production architecture.** ComfyUI in API mode (`comfyui --listen --port 8188`) behind NGINX with Redis job queue. Each API call accepts prompt + optional reference + workflow template, queues to GPU worker, returns image URL. Version-control workflows as JSON in git. Separate GPU workers by workload: 24 GB for Flux Schnell batch=4 (throughput), 48 GB for Flux Dev + ControlNet combos (quality), 8–12 GB for SDXL LoRA style-specific batches.
**OOM management.** VRAM scales quadratically with resolution. Monitor per-workflow VRAM, set hard batch-size caps. Retry queue: on OOM, halve batch size and retry. Set max resolution per GPU tier: 8 GB → 1024×1024, 16 GB → 1536×1536, 24 GB → 2048×2048. Flux at 2048 needs 24–32 GB at FP8 — will OOM on 16 GB cards.
What breaks
Failure modes operators see in the wild.
**OOM at high resolutions.** Symptom: CUDA OOM when resolution exceeds GPU capacity. 2048×2048 Flux Schnell FP16 = 22–28 GB; + ControlNet = +4–8 GB; + LoRA = +1–3 GB. Cause: VRAM scales quadratically (2× dimensions = 4× VRAM). Mitigation: FP8/NF4 quantization cuts VRAM 40–50% (ComfyUI-GGUF), limit max resolution per GPU in workflow config, use tiled VAE decoding. For batch: process sequentially when VRAM-tight — batching 4×1024 uses 3.5× more VRAM than sequential.
**CFG scale artifacts.** Symptom: CFG above 7–8 produces oversaturated colors, burnt highlights, unnatural contrast. Cause: high CFG pushes too far from unconditional path — "CFG burn." Mitigation: CFG 3.5–5.0 for Flux family, cap at 7 for SDXL, use dynamic thresholding (CFG Rescale node in ComfyUI).
**Face and limb distortion.** Symptom: 6+ fingers, merged limbs, asymmetrical eyes. SD3.5 reduces to ~15% of outputs (vs SDXL's ~25%). Cause: diffusion models denoise patches independently — no architectural guarantee of anatomy. Mitigation: use SD3.5 or Flux for humans (DiT architectures handle global coherence better), apply face-detailed LoRAs, use ADetailer (auto face inpainting), batch-generate 5–10 variants and filter by aesthetic-scoring model.
**Text gibberish in images.** Symptom: embedded text reads as random characters. SDXL: ~10% legible; SD3.5: ~40%; Flux: ~85%. Cause: patch-based denoising doesn't produce character-level stroke coherence. Flux's MMDIT encodes character-level representations. Mitigation: use Flux for text-containing images, keep text short (1–5 words), composite generated images with real text in post-processing for commercial use.
**NSFW filter false positives.** Symptom: innocuous prompts containing "girl," "body," anatomical terms blocked. SD3.5's safety classifier blocks ~3–5% of benign prompts. Cause: keyword matching without semantic understanding. Mitigation: rephrase to avoid flagged keywords (community maintains lists), use SDXL community fine-tunes lacking safety classifier (verify license), deploy custom workflows bypassing the filter node.
**Prompt bleed in batch generation.** Symptom: previous-prompt elements appear in subsequent images — jacket color bleeds, background style persists. Cause: diffusion samplers can carry residual noise patterns between consecutive generations. Mitigation: randomize seed per generation, clear GPU cache between batch runs, toggle "Clear Cache" node in persistent ComfyUI workflows between batches.
Hardware guidance
**Hobbyist tier ($300–600).** [RTX 3060 12GB](/hardware/rtx-3060-12gb): SDXL at 12–20 seconds — functional but slow. Flux Schnell FP8 via GGUF: 8–15 seconds — VRAM at floor. [Intel Arc B580](/hardware/intel-arc-b580) at 12 GB: SDXL 20–30 seconds via IPEX. [RX 7600 XT](/hardware/rx-7600-xt) at 16 GB: SDXL 15–25 seconds via DirectML/ROCm — 30–50% slower than NVIDIA at same tier.
**Enthusiast tier ($1,500–2,500).** [RTX 4090](/hardware/rtx-4090) at 24 GB: the image generation king — Flux Schnell 1.5–2.5s, Flux Dev 8–12s, SDXL 2–4s. Fits Flux Dev + 1 ControlNet + 2 LoRAs. [RTX 5090](/hardware/rtx-5090) at 32 GB: Flux Dev + 2 ControlNets + 3 LoRAs + 2048 upscale — best single-card consumer gen machine. [RTX 5080](/hardware/rtx-5080) at 16 GB: Flux Schnell 3–5s but cannot fit Flux Dev + ControlNet combos — 16 GB restrictive for professional workflows. [RX 7900 XTX](/hardware/rx-7900-xtx) at 24 GB: SDXL 5–8s via ROCm on Linux — AMD stack has ComfyUI/Flux rough edges (node compatibility, fp8 gaps).
**Professional tier ($6,000–15,000).** [RTX 6000 Ada](/hardware/rtx-6000-ada) at 48 GB: fits Flux Dev + 3 ControlNets + 5 LoRAs + 2048 output simultaneously. Batch=8 at 1024 → 100+ images/min. [L40S](/hardware/nvidia-l40s) at 48 GB: datacenter equivalent, better sustained thermals.
**Enterprise tier ($25,000+).** [A100 80GB SXM](/hardware/nvidia-a100-80gb-sxm): all workflows with headroom — Flux Pro at 4096, multi-ControlNet, batch=16. [H100](/hardware/nvidia-h100-pcie) is less suited — image gen is bandwidth-bound, not compute-bound. Enterprise pick maximizes VRAM/$: A100 80 GB for 2048+ resolution, L40S 48 GB for 1024–1536 throughput.
VRAM is the binding constraint — more important than bandwidth or TFLOPS. If a workflow exceeds VRAM, it crashes. If it fits, time varies linearly with steps ÷ bandwidth. For throughput: prioritize VRAM → batch size → images/min. For low-latency: prioritize bandwidth → per-image speed.
Runtime guidance
**ComfyUI vs Automatic1111 vs Diffusers — workflow paradigm comparison.**
[ComfyUI](/tools/comfyui) uses a node-based graph editor: each operation (load model, encode prompt, sample, decode, upscale, save) is a node connected by edges. This is the production standard in 2026. Advantages: workflows are JSON files — version-controllable, shareable, reproducible. Node architecture enables complex pipelines (multi-model, multi-ControlNet, upscale chains) impossible in linear UIs. API mode serves as production backend — POST workflow JSON + prompt = image. Tradeoffs: learning curve for node graph, debugging invisible edges, community fragmented across 500+ custom node packages.
[Automatic1111 WebUI](/tools/automatic1111) uses a linear tab interface: txt2img → img2img → extras. This was the default from 2022–2024 but is largely superseded. Advantages: simpler learning curve, one-click installers, enormous tutorial base. Tradeoffs: linear workflows cannot express multi-model pipelines, modal UI can't show full pipeline at once, development slowed — last major update 1.6.0 (October 2024). Acceptable for casual use; not production-grade.
**SD WebUI Forge** (A1111 fork by lllyasviel, ControlNet author) optimizes memory management for 30–50% lower VRAM. Forge handles UNet offloading, VAE tiling, and gradient checkpointing more aggressively. Drop-in upgrade for A1111 users with VRAM constraints — identical UI, same extensions, lower VRAM. Right choice for 8–12 GB GPUs running SDXL.
[Diffusers](/tools/diffusers) (Hugging Face) is the Python API for programmatic generation — no GUI, pure code. Use for API-level control: dynamic prompt scheduling, custom sampling, model merging at inference, integration into larger Python apps. Loads any Hugging Face model (Flux, SDXL, SD3.5) with same API. Tradeoffs: write Python for every pipeline, no visual workflow debugging, manual GPU memory management.
**Decision tree.** Primary: [ComfyUI](/tools/comfyui) — standard, community momentum, scales from hobbyist to production API. [Automatic1111](/tools/automatic1111) or Forge: only for existing A1111 workflows with simple pipelines (txt2img, img2img, no ControlNets). [Diffusers](/tools/diffusers): custom applications wrapping image gen in business logic. Production serving: ComfyUI API mode + Python client — workflow JSON defines pipeline, application handles queuing/auth/delivery.
**Model loading.** ComfyUI's "Load Checkpoint" loads full model to VRAM at startup.