DreamBooth — AI glossary

DreamBooth is a fine-tuning technique that personalizes a text-to-image model (like Stable Diffusion) to generate images of a specific subject (e.g., a person, pet, or object) in various contexts. It works by training the model on a small set of input images (typically 3–5) of the subject, paired with a unique identifier token (e.g., "sks dog"), while using a prior-preservation loss to prevent overfitting and catastrophic forgetting. The result is a custom checkpoint or LoRA adapter that can be loaded into image-generation software to produce novel scenes featuring the subject. For operators, DreamBooth requires significant VRAM (12–24 GB for full fine-tuning) and time (30–60 minutes on a consumer GPU), though LoRA-based variants reduce both.

Deeper dive

DreamBooth, introduced by Google Research in 2022, extends diffusion models by embedding a new concept into the model's latent space. The process involves: (1) collecting 3–5 images of the subject from different angles/backgrounds, (2) assigning a rare token (e.g., "sks") as a placeholder, (3) fine-tuning the UNet and text encoder on those images with a prior-preservation term that uses the base model's own generated samples to retain general knowledge. The output is a fine-tuned checkpoint (typically 2 GB for Stable Diffusion 1.5) or, more commonly, a LoRA adapter (100 MB) that can be merged at inference time. Operator-relevant variants include: full DreamBooth (high VRAM, high quality), DreamBooth + LoRA (lower VRAM, faster training), and text-inversion (no model weights, only embedding vectors). Tools like Kohya_ss, EveryDream2, and Hugging Face's diffusers library provide scripts for training. On a 24 GB GPU (RTX 3090), full fine-tuning takes ~45 minutes at 512x512 resolution; LoRA training on a 12 GB card takes ~15 minutes. Inference requires loading the custom checkpoint or LoRA into software like Automatic1111, ComfyUI, or InvokeAI.

Practical example

An operator wants to generate images of their cat "Mittens" in various styles. They take 5 photos of Mittens from different angles, then use Kohya_ss to train a LoRA with the token "mttns cat" on a 12 GB RTX 3060. Training takes 20 minutes at 512x512, outputting a 144 MB LoRA file. They load this LoRA into Automatic1111 alongside Stable Diffusion XL, and prompt "a portrait of mttns cat wearing a wizard hat, oil painting" — the model generates Mittens in that style, preserving fur color and face shape.

Workflow example

In a typical workflow, the operator first prepares a dataset of 3–5 subject images, resized to 512x512. They then run a DreamBooth training script (e.g., accelerate launch train_dreambooth.py --pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 --instance_data_dir=./mittens --instance_prompt="a photo of sks cat" --class_prompt="a photo of a cat" --resolution=512 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=5e-6 --lr_scheduler=constant --lr_warmup_steps=0 --max_train_steps=800). After training, the output checkpoint is loaded in LM Studio or Automatic1111 by pointing the model path to the new folder. Inference prompts use the unique token (e.g., "sks cat") to trigger the personalized concept.