RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Multi-Modal AI: Vision and Text
  6. /Ch. 12
Multi-Modal AI: Vision and Text

12. Multi-Modal RAG

Chapter 12 of 18 · 15 min
KEY INSIGHT

Multi-Modal RAG retrieves relevant images and text chunks, then uses vision models to ground generated responses in visual evidence from the corpus. Standard RAG fails when queries rely on visual similarity. Multi-Modal RAG indexes both images and their descriptions, enabling retrieval across modalities. At query time, the system retrieves candidate images and generates responses that cite visual evidence. ```python import json import base64 from pathlib import Path from anthropic import AsyncVertexAI class MultiModalVectorStore: def __init__(self, client: AsyncVertexAI, embedding_model: str): self.client = client self.embedding_model = embedding_model self.image_descriptions = {} self.image_embeddings = {} async def index_document( self, doc_path: Path, images_dir: Path ): """Index document with embedded images""" with open(doc_path) as f: text_content = f.read() # Generate text embeddings text_embedding = await self._get_embedding(text_content) # Process images for img_path in images_dir.glob("*.png"): # Generate description description = await self._describe_image(img_path) # Store description for retrieval img_key = img_path.stem self.image_descriptions[img_key] = description # Generate embedding for description desc_embedding = await self._get_embedding(description) self.image_embeddings[img_key] = desc_embedding async def _describe_image(self, img_path: Path) -> str: """Generate searchable description of image""" with open(img_path, "rb") as f: img_data = base64.b64encode(f.read()).decode() response = await self.client.messages.create( model="gemini-2.0-flash-thinking", messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "data": img_data}}, {"type": "text", "text": "Generate a detailed description suitable for retrieval."} ] }] ) return response.content[0].text async def retrieve( self, query: str, top_k: int = 4 ) -> list[dict]: """Retrieve relevant images and text chunks""" query_embedding = await self._get_embedding(query) # Simple similarity search (use vector DB in production) candidates = [] for key, emb in self.image_embeddings.items(): score = self._cosine_similarity(query_embedding, emb) candidates.append({ "key": key, "description": self.image_descriptions[key], "score": score }) return sorted(candidates, key=lambda x: x["score"], reverse=True)[:top_k] def _cosine_similarity(self, a: list, b: list) -> float: dot = sum(x * y for x, y in zip(a, b)) norm_a = sum(x * x for x in a) ** 0.5 norm_b = sum(x * x for x in b) ** 0.5 return dot / (norm_a * norm_b) ``` **Failure Modes:** - Mismatched embeddings when images have metadata but descriptions explain context. Index both. - Missed relevance when query visual similarity differs from semantic similarity. Consider hybrid retrieval. - RAG hallucination when response cites images loosely related to query. Include citation validation.

EXERCISE

Create a product manual RAG system where users query by visual similarity. System retrieves similar diagrams and generates answers grounded in the retrieved images.

← Chapter 11
Vision Agents
Chapter 13 →
Image Embeddings