RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Multi-Modal AI: Vision and Text
  6. /Ch. 14
Multi-Modal AI: Vision and Text

14. CLIP Models

Chapter 14 of 18 · 15 min
KEY INSIGHT

CLIP learns joint image-text representations by training on image-caption pairs. This enables zero-shot classification and cross-modal retrieval without task-specific fine-tuning. CLIP encodes images and text into a shared embedding space where related concepts cluster. Query text defines the classification space at inference time, enabling flexible recognition of concepts not seen during training. ```python import torch from torchvision.models import clip class CLIPEmbedder: def __init__(self, model_name: str = "ViT-B/32"): self.device = "cuda" if torch.cuda.is_available() else "cpu" # Load pre-trained CLIP model self.model, self.preprocess = clip.load( model_name, device=self.device ) def encode_image(self, image_tensor: torch.Tensor) -> torch.Tensor: """Encode single image into embedding""" with torch.no_grad(): return self.model.encode_image(image_tensor) def encode_text(self, text: str) -> torch.Tensor: """Encode text into embedding""" with torch.no_grad(): text_tokens = clip.tokenize([text]) return self.model.encode_text(text_tokens) def compute_similarity( self, image_emb: torch.Tensor, text_emb: torch.Tensor ) -> torch.Tensor: """Compute cosine similarity between image and text embeddings""" return torch.cosine_similarity( image_emb, text_emb, dim=-1 ) def zero_shot_classify( self, image_tensor: torch.Tensor, candidate_labels: list[str] ) -> list[dict]: """ Classify image without task-specific training. candidate_labels: ["cat", "dog", "bird", "fish"] """ # Encode candidate labels text_tokens = clip.tokenize(candidate_labels) text_embeddings = self.model.encode_text(text_tokens) # Encode image image_embedding = self.model.encode_image(image_tensor) # Normalize embeddings text_embeddings = text_embeddings / text_embeddings.norm(dim=-1, keepdim=True) image_embedding = image_embedding / image_embedding.norm(dim=-1, keepdim=True) # Compute similarities similarity = 100.0 * image_embedding @ text_embeddings.T # Convert to probabilities probs = similarity.softmax(dim=-1)[0] return [ {"label": label, "probability": prob.item()} for label, prob in zip(candidates, probs) ] ``` **Failure Modes:** - CLIP struggles with fine-grained distinctions (breeds of dogs). Consider specialized models for precision tasks. - Domain mismatch: CLIP trained on internet images may underperform on specialized domains (medical, satellite). - Text encoding limit: prompts exceeding token limit truncate unexpectedly. Keep labels under 75 tokens.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Implement image search using CLIP encode query text, retrieve matching images from a corpus by computing text-image similarity scores.

← Chapter 13
Image Embeddings
Chapter 15 →
Performance Optimization