Latent Diffusion
Latent diffusion is a technique used in image generation models (like Stable Diffusion) that applies the diffusion process in a compressed, lower-dimensional latent space rather than directly in pixel space. The model first encodes an image into a latent representation using a pretrained autoencoder, then gradually adds and removes noise in that latent space. This dramatically reduces computational cost and memory requirements, making it feasible to run on consumer GPUs. Operators encounter latent diffusion in tools like Stable Diffusion WebUI or ComfyUI, where the VAE (Variational Autoencoder) handles the encode/decode steps, and the UNet denoises the latent.
Deeper dive
Standard diffusion models operate directly on high-resolution pixel grids (e.g., 512x512x3 = ~786k dimensions), which is computationally prohibitive. Latent diffusion compresses the image into a smaller latent space (e.g., 64x64x4 = ~16k dimensions) using a pretrained VAE encoder. The diffusion process—forward noise addition and reverse denoising—then runs in this latent space, reducing memory and compute by orders of magnitude. After denoising, the VAE decoder reconstructs the final image. This design is why Stable Diffusion can run on GPUs with as little as 4 GB VRAM (at reduced resolution or with optimizations). The trade-off is that the VAE introduces slight compression artifacts, and the latent space's structure can affect output quality. Variants like Stable Diffusion XL use larger latent spaces for finer detail.
Practical example
A 512x512 RGB image has 786,432 pixel values. After VAE encoding, the latent representation is 64x64x4 = 16,384 values—a 48x reduction. This means the UNet denoising step operates on 16k dimensions instead of 786k, fitting into ~4 GB VRAM for a 512x512 generation. On an RTX 3060 12 GB, generating a 512x512 image with Stable Diffusion 1.5 takes ~2-3 seconds; without latent compression, the same task would require >48 GB VRAM and be impractical.
Workflow example
In Stable Diffusion WebUI, the operator selects a checkpoint (e.g., v1-5-pruned-emaonly.safetensors) and sets the VAE to 'automatic' or a specific vae-ft-mse-840000-ema-pruned. When generating, the workflow: (1) VAE encoder compresses the initial noise into latent space, (2) UNet denoises the latent over ~20-50 steps (controlled by sampler settings), (3) VAE decoder reconstructs the final image. The operator sees 'VAE loading' in the console and can monitor VRAM usage—latent diffusion keeps VRAM under ~6 GB for 512x512, enabling batch generation on mid-range GPUs.
Reviewed by Fredoline Eruo. See our editorial policy.