12. Multi-Modal RAG

Chapter 12 of 18 · 15 min

KEY INSIGHT

Multi-Modal RAG retrieves relevant images and text chunks, then uses vision models to ground generated responses in visual evidence from the corpus. Standard RAG fails when queries rely on visual similarity. Multi-Modal RAG indexes both images and their descriptions, enabling retrieval across modalities. At query time, the system retrieves candidate images and generates responses that cite visual evidence. ```python import json import base64 from pathlib import Path from anthropic import AsyncVertexAI class MultiModalVectorStore: def __init__(self, client: AsyncVertexAI, embedding_model: str): self.client = client self.embedding_model = embedding_model self.image_descriptions = {} self.image_embeddings = {} async def index_document( self, doc_path: Path, images_dir: Path ): """Index document with embedded images""" with open(doc_path) as f: text_content = f.read() # Generate text embeddings text_embedding = await self._get_embedding(text_content) # Process images for img_path in images_dir.glob("*.png"): # Generate description description = await self._describe_image(img_path) # Store description for retrieval img_key = img_path.stem self.image_descriptions[img_key] = description # Generate embedding for description desc_embedding = await self._get_embedding(description) self.image_embeddings[img_key] = desc_embedding async def _describe_image(self, img_path: Path) -> str: """Generate searchable description of image""" with open(img_path, "rb") as f: img_data = base64.b64encode(f.read()).decode() response = await self.client.messages.create( model="gemini-2.0-flash-thinking", messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "data": img_data}}, {"type": "text", "text": "Generate a detailed description suitable for retrieval."} ] }] ) return response.content[0].text async def retrieve( self, query: str, top_k: int = 4 ) -> list[dict]: """Retrieve relevant images and text chunks""" query_embedding = await self._get_embedding(query) # Simple similarity search (use vector DB in production) candidates = [] for key, emb in self.image_embeddings.items(): score = self._cosine_similarity(query_embedding, emb) candidates.append({ "key": key, "description": self.image_descriptions[key], "score": score }) return sorted(candidates, key=lambda x: x["score"], reverse=True)[:top_k] def _cosine_similarity(self, a: list, b: list) -> float: dot = sum(x * y for x, y in zip(a, b)) norm_a = sum(x * x for x in a) ** 0.5 norm_b = sum(x * x for x in b) ** 0.5 return dot / (norm_a * norm_b) ``` **Failure Modes:** - Mismatched embeddings when images have metadata but descriptions explain context. Index both. - Missed relevance when query visual similarity differs from semantic similarity. Consider hybrid retrieval. - RAG hallucination when response cites images loosely related to query. Include citation validation.

EXERCISE

Create a product manual RAG system where users query by visual similarity. System retrieves similar diagrams and generates answers grounded in the retrieved images.