08. Multi-Modal Enterprise RAG
Enterprise data is not just text. Contracts include tables, diagrams, and signature images. Product manuals contain schematics. Financial reports embed charts. Training materials combine video, audio transcripts, and slides.
Multi-modal retrieval requires converting non-text content into queryable representations. Tables become structured data with row-level metadata. Images get captions and embedded text extracted via OCR. Videos generate transcript chunks linked to timestamps.
# Multi-modal document processing pipeline
def process_multimodal_document(doc: Document) -> list[Chunk]:
chunks = []
# Process text sections
for section in doc.text_sections:
chunks.append(TextChunk(content=section.text, metadata=section.meta))
# Extract and process tables
for table in doc.tables:
table_data = table_to_structured_records(table)
for row in table_data:
# Each row is queryable with its table context
chunks.append(TableChunk(
content=row.to_text(),
table_context=table.header,
row_metadata=row.metadata
))
# Process images with OCR and captioning
for image in doc.images:
ocr_text = ocr.extract(image)
caption = vision_model.describe(image)
chunks.append(ImageChunk(
content=f"{ocr_text}\nCaption: {caption}",
image_ref=image.id
))
return chunks
Query routing determines which modalities to search. A query about "revenue by region" should search tables. A query about "assembly procedure" should search video transcripts. A general query should search everything.
Cross-modal retrieval enables queries like "show me the table mentioned in the Q3 earnings call video." This requires linking chunks across modalities through shared metadata.
The failure modes are exotic. OCR produces garbage from low-resolution scanned tables. Captions describe images incorrectly, causing irrelevant chunks to appear in retrieval results. Video transcripts lose synchronization with the actual content when editing happens post-transcription.
Index architecture must accommodate modality-specific metadata. Table chunks need column names and data types for filtering. Image chunks need resolution and source document for access control. Video chunks need timestamps for clip extraction.
# Modality-specific metadata schemas
class ChunkMetadata(BaseModel):
base: BaseMetadata # document_id, page, created_at
class TableChunkMetadata(ChunkMetadata):
table_id: str
column_names: list[str]
row_count: int
header_text: str
class ImageChunkMetadata(ChunkMetadata):
image_id: str
width: int
height: int
source_page: int
ocr_confidence: float
class VideoChunkMetadata(ChunkMetadata):
video_id: str
start_timestamp: float
end_timestamp: float
transcript_confidence: float
Design a query routing system that routes a user query to the appropriate modalities. List 10 example queries and how they should be routed. Identify edge cases where routing is ambiguous.