RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Multi-Modal AI: Vision and Text
  6. /Ch. 13
Multi-Modal AI: Vision and Text

13. Image Embeddings

Chapter 13 of 18 · 15 min
KEY INSIGHT

Vision embeddings compress visual information into dense vectors capturing semantic content. Understanding embedding dimensionality and normalization affects retrieval accuracy significantly. Image embeddings transform pixel data into fixed-length vectors where semantically similar images cluster together. The embedding model determines what aspects of similarity matter for your use case. ```python import numpy as np from typing import Protocol from abc import ABC, abstractmethod class EmbeddingModel(Protocol): def embed(self, image_path: str) -> np.ndarray: ... def batch_embed(Self, image_paths: list[str]) -> list[np.ndarray]: ... class VertexEmbeddingModel: def __init__(self, model_name: str = "imagen-3.0-fast"): self.model_name = model_name # Vertex does not expose embedding models directly # Use multimodal models with image input async def embed_images(self, image_paths: list[str]) -> list[list[float]]: """ Generate embeddings via multimodal API. Returns list of embedding vectors. """ embeddings = [] for path in image_paths: # Encode image with open(path, "rb") as f: img_b64 = base64.b64encode(f.read()).decode() # Use vision model to generate description # Then embed description as proxy async with AsyncVertexAI() as client: desc_response = await client.messages.create( model="gemini-2.0-flash-thinking", messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "data": img_b64}}, {"type": "text", "text": "Describe this image in exactly 10 words."} ] }] ) desc = desc_response.content[0].text # Embed description text embed_response = await client.models.embed_content( model="text-embedding-005", content=desc ) embeddings.append(embed_response.embedding) return embeddings def compute_similarity( self, emb1: list[float], emb2: list[float] ) -> float: """Cosine similarity between two embeddings""" v1 = np.array(emb1) v2 = np.array(emb2) norm1 = np.linalg.norm(v1) norm2 = np.linalg.norm(v2) return float(np.dot(v1, v2) / (norm1 * norm2)) def batch_similarity_matrix( self, embeddings: list[list[float]] ) -> np.ndarray: """Compute pairwise similarity matrix""" n = len(embeddings) matrix = np.zeros((n, n)) for i in range(n): for j in range(i, n): sim = self.compute_similarity(embeddings[i], embeddings[j]) matrix[i, j] = sim matrix[j, i] = sim return matrix ``` **Common Mistakes:** - Embedding mismatch when different runs use different model versions. Pin model versions. - Ignoring normalization: unnormalized embeddings produce misleading similarity scores. - Batch size limits: large images or batches cause timeout. Resize and chunk.

EXERCISE

Build an image deduplication system using embeddings. Generate embeddings for a folder of images, compute similarity matrix, and cluster duplicate candidates (similarity > threshold).

← Chapter 12
Multi-Modal RAG
Chapter 14 →
CLIP Models