Advanced RAG — Chunking, Retrieval, Re-ranking

18. Multi-Modal RAG

Chapter 18 of 24 · 25 min

KEY INSIGHT

Retrieval pipelines must handle text, images, tables, and code as distinct modalities with modality-appropriate embeddings and generation. ### Beyond Text-Only Retrieval Production RAG systems face documents that contain figures, diagrams, screenshots, code blocks, and tables. A text-only embedding pipeline discards the visual structure. Multi-modal RAG retrieves across modalities and generates answers that synthesize text with visual evidence. ### Modality-Specific Processing ```python import pypdf from PIL import Image import pytesseract def extractModalContent(pdf_path: str) -> list[dict]: """ Extract text, images, and tables from a PDF with modality labels. """ content = [] reader = pypdf.PdfReader(pdf_path) for page_num, page in enumerate(reader.pages): # Extract text text = page.extract_text() if text.strip(): content.append({ "modality": "text", "content": text, "page": page_num + 1, "source": f"{pdf_path}#page={page_num + 1}" }) # Extract images for img_idx, image in enumerate(page.images): img_bytes = image.data img_desc = describeImage(img_bytes) # LLM or vision model content.append({ "modality": "image", "content": img_desc, "raw_bytes": img_bytes, "page": page_num + 1, "source": f"{pdf_path}#page={page_num + 1}&image={img_idx}" }) # Extract tables tables = page.extract_tables() for tbl_idx, table in enumerate(tables): table_text = formatTable(table) content.append({ "modality": "table", "content": table_text, "raw_table": table, "page": page_num + 1, "source": f"{pdf_path}#page={page_num + 1}&table={tbl_idx}" }) return content ``` ### Describing Images for Retrieval ```python from openai import OpenAI client = OpenAI() def describeImage(image_bytes: bytes) -> str: """ Use vision model to generate a text description for image retrieval. """ response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_bytes}"}}, {"type": "text", "text": "Describe this image with enough technical detail " "that a semantic search system can retrieve it for a RAG query. " "Include labels, values, and relationships visible in the diagram."} ]} ], temperature=0.0, max_tokens=256 ) return response.choices[0].message.content.strip() ``` ### Multi-Modal Indexing ```python def indexByModality(content: list[dict], vector_store) -> None: """ Index each modality into its appropriate vector store or database. """ for item in content: embedding = embed_texts([item["content"]])[0] vector_store.insert( embedding=embedding, text=item["content"], modality=item["modality"], source=item["source"], metadata={"page": item["page"]} ) ``` ### Multi-Modal Generation ```python def multiModalAnswer(query: str, retrieved: list[dict]) -> dict: """ Generate answer using vision for images, text synthesis for text. """ text_chunks = [c for c in retrieved if c["modality"] == "text"] image_items = [c for c in retrieved if c["modality"] == "image"] table_items = [c for c in retrieved if c["modality"] == "table"] synthesis_parts = [c["content"] for c in text_chunks] # Include image descriptions inline for img in image_items: synthesis_parts.append(f"[Image: {img['content']}]") # Include table representations inline for tbl in table_items: synthesis_parts.append(f"[Table: {tbl['content']}]") synthesis_context = "\n\n".join(synthesis_parts) response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "Answer using both the text and image/table descriptions provided."}, {"role": "user", "content": f"Question: {query}\n\nContext: {synthesis_context}"} ], temperature=0.0 ) return {"answer": response.choices[0].message.content, "sources": retrieved} ``` ### Failure Modes Image descriptions are lossy—the vision model may miss small axis labels, faint grid lines, or color-coded legend items. Always include the image source link so users can verify. Table extraction boundary detection (rows vs. columns) varies by PDF structure; validate on a sample before production. Multi-modal pipelines triple the processing time of text-only pipelines; parallelize embedding calls across modalities.

EXERCISE

Extract images from a PDF, use a vision model to describe them, index with text chunks, and run a query that retrieves both text and image content. Verify the generated answer references both. (15 min)