14. CLIP Models

Chapter 14 of 18 · 15 min

KEY INSIGHT

CLIP learns joint image-text representations by training on image-caption pairs. This enables zero-shot classification and cross-modal retrieval without task-specific fine-tuning. CLIP encodes images and text into a shared embedding space where related concepts cluster. Query text defines the classification space at inference time, enabling flexible recognition of concepts not seen during training. ```python import torch from torchvision.models import clip class CLIPEmbedder: def __init__(self, model_name: str = "ViT-B/32"): self.device = "cuda" if torch.cuda.is_available() else "cpu" # Load pre-trained CLIP model self.model, self.preprocess = clip.load( model_name, device=self.device ) def encode_image(self, image_tensor: torch.Tensor) -> torch.Tensor: """Encode single image into embedding""" with torch.no_grad(): return self.model.encode_image(image_tensor) def encode_text(self, text: str) -> torch.Tensor: """Encode text into embedding""" with torch.no_grad(): text_tokens = clip.tokenize([text]) return self.model.encode_text(text_tokens) def compute_similarity( self, image_emb: torch.Tensor, text_emb: torch.Tensor ) -> torch.Tensor: """Compute cosine similarity between image and text embeddings""" return torch.cosine_similarity( image_emb, text_emb, dim=-1 ) def zero_shot_classify( self, image_tensor: torch.Tensor, candidate_labels: list[str] ) -> list[dict]: """ Classify image without task-specific training. candidate_labels: ["cat", "dog", "bird", "fish"] """ # Encode candidate labels text_tokens = clip.tokenize(candidate_labels) text_embeddings = self.model.encode_text(text_tokens) # Encode image image_embedding = self.model.encode_image(image_tensor) # Normalize embeddings text_embeddings = text_embeddings / text_embeddings.norm(dim=-1, keepdim=True) image_embedding = image_embedding / image_embedding.norm(dim=-1, keepdim=True) # Compute similarities similarity = 100.0 * image_embedding @ text_embeddings.T # Convert to probabilities probs = similarity.softmax(dim=-1)[0] return [ {"label": label, "probability": prob.item()} for label, prob in zip(candidates, probs) ] ``` **Failure Modes:** - CLIP struggles with fine-grained distinctions (breeds of dogs). Consider specialized models for precision tasks. - Domain mismatch: CLIP trained on internet images may underperform on specialized domains (medical, satellite). - Text encoding limit: prompts exceeding token limit truncate unexpectedly. Keep labels under 75 tokens.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Implement image search using CLIP encode query text, retrieve matching images from a corpus by computing text-image similarity scores.