Generative AI

ControlNet

ControlNet is a neural network architecture that adds spatial conditioning to pretrained image diffusion models (like Stable Diffusion). It takes an additional input image (e.g., a depth map, edge detection, or pose skeleton) and guides the generation to follow that structure. The operator loads a ControlNet alongside the base model; the extra input constrains where content appears. ControlNets are small enough (typically 1–2 GB at FP16) to fit alongside a 7B–13B diffusion model on a 12–24 GB GPU, though VRAM usage increases by roughly 20–30%.

Deeper dive

ControlNet works by copying the weights of a pretrained diffusion model's encoder and locking them, then training a separate 'control' network that injects conditioning features at multiple resolutions. During inference, the operator provides a conditioning image (e.g., Canny edges, depth map, OpenPose skeleton) and a prompt. The ControlNet modifies the UNet's intermediate activations so the output respects the spatial layout. Standard variants include Canny (edge-guided), depth (3D structure), normal map, and scribble. Operators often combine multiple ControlNets (e.g., depth + Canny) for finer control, though each adds VRAM overhead. In practice, ControlNet is used in Stable Diffusion workflows via ComfyUI, Automatic1111, or InvokeAI; the operator selects a preprocessor to generate the conditioning image from a source photo, then runs the combined model.

Practical example

An operator wants to generate an image of a castle that matches the layout of a photo. They load Stable Diffusion XL (SDXL) base (6.9 GB) plus a depth ControlNet (1.2 GB) on an RTX 4090 (24 GB). They run the depth preprocessor on the photo to produce a grayscale depth map, then set the ControlNet weight to 0.8. The output preserves the photo's 3D structure while the prompt 'fantasy castle, sunset' changes the style. VRAM usage peaks at ~18 GB, leaving room for a 1024×1024 image.

Workflow example

In ComfyUI, the operator loads a checkpoint (e.g., sd_xl_base_1.0.safetensors) and a ControlNet model (e.g., controlnet-depth-sdxl-1.0.safetensors). They connect a 'Load Image' node for the source photo, a 'ControlNet Preprocessor' node (set to 'Depth MiDaS'), and a 'ControlNet Apply' node that takes the base model, conditioning, and ControlNet. They set the strength to 0.9 and start queue. The runtime loads both models into VRAM; the operator monitors memory via nvidia-smi to avoid OOM.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work