Evaluation metrics

FID (Fréchet Inception Distance)

FID (Fréchet Inception Distance) is a metric that measures the quality of images generated by a model by comparing the statistical distribution of generated images to that of real images. It uses a pre-trained Inception v3 network to extract feature vectors from both sets of images, then computes the Fréchet distance (a measure of similarity between two multivariate Gaussians) between the feature distributions. Lower FID scores indicate generated images that are more similar to real images in terms of feature distribution. Operators encounter FID when evaluating generative models like Stable Diffusion or GANs; a typical FID for a good text-to-image model on a standard benchmark (e.g., COCO) is around 10-30.

Deeper dive

FID improves upon earlier metrics like Inception Score (IS) by comparing generated images to real ones, not just measuring diversity within generated samples. The calculation involves: (1) passing images through Inception v3 (trained on ImageNet) and taking activations from the last pooling layer (2048-dimensional vector); (2) fitting a multivariate Gaussian to the feature vectors of real images and another to generated ones; (3) computing the Fréchet distance (Wasserstein-2 distance) between the two Gaussians. The formula is: FID = ||μ_r - μ_g||² + Tr(Σ_r + Σ_g - 2(Σ_r Σ_g)^(1/2)), where μ and Σ are mean and covariance. FID is sensitive to mode dropping and intra-class diversity. However, it has limitations: it relies on Inception features that may not capture all perceptual differences, and it requires a large sample size (typically 10k-50k images) for stable estimates. For local AI operators, computing FID on a single GPU can take minutes to hours depending on image count and resolution.

Practical example

After fine-tuning Stable Diffusion 2.1 on a custom dataset using an RTX 3090, an operator runs a script that generates 10,000 images from 10,000 prompts. They compute FID against the real dataset (also 10,000 images) using torchmetrics.image.fid. A score of 15 indicates generated images are reasonably close to real ones; a score above 50 suggests poor quality or mode collapse. The operator might iterate on training hyperparameters to bring FID below 20.

Workflow example

In a typical evaluation workflow, an operator uses Hugging Face's diffusers library to generate images, then computes FID with torchmetrics. The command might be: python fid_score.py --real_path ./real_images --fake_path ./generated_images --batch_size 32. The script loads Inception v3 (downloads weights once), processes images in batches, and outputs the FID score. Operators often compare FID across checkpoints to select the best model for deployment.

Reviewed by Fredoline Eruo. See our editorial policy.