RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Enterprise-Scale RAG
  6. /Ch. 8
Enterprise-Scale RAG

08. Multi-Modal Enterprise RAG

Chapter 8 of 24 · 15 min
KEY INSIGHT

Multi-modal RAG multiplies system complexity. Each modality introduces its own parser, embedding strategy, and metadata schema. Start with text-only, prove retrieval quality, then expand modalities incrementally.

Enterprise data is not just text. Contracts include tables, diagrams, and signature images. Product manuals contain schematics. Financial reports embed charts. Training materials combine video, audio transcripts, and slides.

Multi-modal retrieval requires converting non-text content into queryable representations. Tables become structured data with row-level metadata. Images get captions and embedded text extracted via OCR. Videos generate transcript chunks linked to timestamps.

# Multi-modal document processing pipeline
def process_multimodal_document(doc: Document) -> list[Chunk]:
    chunks = []
    
    # Process text sections
    for section in doc.text_sections:
        chunks.append(TextChunk(content=section.text, metadata=section.meta))
    
    # Extract and process tables
    for table in doc.tables:
        table_data = table_to_structured_records(table)
        for row in table_data:
            # Each row is queryable with its table context
            chunks.append(TableChunk(
                content=row.to_text(),
                table_context=table.header,
                row_metadata=row.metadata
            ))
    
    # Process images with OCR and captioning
    for image in doc.images:
        ocr_text = ocr.extract(image)
        caption = vision_model.describe(image)
        chunks.append(ImageChunk(
            content=f"{ocr_text}\nCaption: {caption}",
            image_ref=image.id
        ))
    
    return chunks

Query routing determines which modalities to search. A query about "revenue by region" should search tables. A query about "assembly procedure" should search video transcripts. A general query should search everything.

Cross-modal retrieval enables queries like "show me the table mentioned in the Q3 earnings call video." This requires linking chunks across modalities through shared metadata.

The failure modes are exotic. OCR produces garbage from low-resolution scanned tables. Captions describe images incorrectly, causing irrelevant chunks to appear in retrieval results. Video transcripts lose synchronization with the actual content when editing happens post-transcription.

Index architecture must accommodate modality-specific metadata. Table chunks need column names and data types for filtering. Image chunks need resolution and source document for access control. Video chunks need timestamps for clip extraction.

# Modality-specific metadata schemas
class ChunkMetadata(BaseModel):
    base: BaseMetadata  # document_id, page, created_at
    
class TableChunkMetadata(ChunkMetadata):
    table_id: str
    column_names: list[str]
    row_count: int
    header_text: str
    
class ImageChunkMetadata(ChunkMetadata):
    image_id: str
    width: int
    height: int
    source_page: int
    ocr_confidence: float
    
class VideoChunkMetadata(ChunkMetadata):
    video_id: str
    start_timestamp: float
    end_timestamp: float
    transcript_confidence: float
EXERCISE

Design a query routing system that routes a user query to the appropriate modalities. List 10 example queries and how they should be routed. Identify edge cases where routing is ambiguous.

← Chapter 7
Batch vs Streaming Ingestion
Chapter 9 →
Document Access Control