Multi-Modal Enterprise RAG — Enterprise-Scale RAG (Chapter 8)

Enterprise data is not just text. Contracts include tables, diagrams, and signature images. Product manuals contain schematics. Financial reports embed charts. Training materials combine video, audio transcripts, and slides.

Multi-modal retrieval requires converting non-text content into queryable representations. Tables become structured data with row-level metadata. Images get captions and embedded text extracted via OCR. Videos generate transcript chunks linked to timestamps.

# Multi-modal document processing pipeline
def process_multimodal_document(doc: Document) -> list[Chunk]:
    chunks = []
    
    # Process text sections
    for section in doc.text_sections:
        chunks.append(TextChunk(content=section.text, metadata=section.meta))
    
    # Extract and process tables
    for table in doc.tables:
        table_data = table_to_structured_records(table)
        for row in table_data:
            # Each row is queryable with its table context
            chunks.append(TableChunk(
                content=row.to_text(),
                table_context=table.header,
                row_metadata=row.metadata
            ))
    
    # Process images with OCR and captioning
    for image in doc.images:
        ocr_text = ocr.extract(image)
        caption = vision_model.describe(image)
        chunks.append(ImageChunk(
            content=f"{ocr_text}\nCaption: {caption}",
            image_ref=image.id
        ))
    
    return chunks

Query routing determines which modalities to search. A query about "revenue by region" should search tables. A query about "assembly procedure" should search video transcripts. A general query should search everything.

Cross-modal retrieval enables queries like "show me the table mentioned in the Q3 earnings call video." This requires linking chunks across modalities through shared metadata.

The failure modes are exotic. OCR produces garbage from low-resolution scanned tables. Captions describe images incorrectly, causing irrelevant chunks to appear in retrieval results. Video transcripts lose synchronization with the actual content when editing happens post-transcription.

Index architecture must accommodate modality-specific metadata. Table chunks need column names and data types for filtering. Image chunks need resolution and source document for access control. Video chunks need timestamps for clip extraction.

# Modality-specific metadata schemas
class ChunkMetadata(BaseModel):
    base: BaseMetadata  # document_id, page, created_at
    
class TableChunkMetadata(ChunkMetadata):
    table_id: str
    column_names: list[str]
    row_count: int
    header_text: str
    
class ImageChunkMetadata(ChunkMetadata):
    image_id: str
    width: int
    height: int
    source_page: int
    ocr_confidence: float
    
class VideoChunkMetadata(ChunkMetadata):
    video_id: str
    start_timestamp: float
    end_timestamp: float
    transcript_confidence: float