RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Enterprise-Scale RAG
  6. /Ch. 5
Enterprise-Scale RAG

05. Document Ingestion Pipeline

Chapter 5 of 24 · 15 min
KEY INSIGHT

Document ingestion is a reliability engineering problem. Parser correctness, chunking quality, and error handling matter more than throughput for the first 90% of implementation. Optimize throughput only after achieving 99.9% pipeline success rate.

Document ingestion transforms raw files into searchable chunks. The pipeline includes: receipt, validation, parsing, chunking, enrichment, embedding, and indexing. Each stage can fail, delay, or corrupt data.

Receipt and validation checks file integrity before processing. Verify MIME type matches extension. Check file size limits (reject PDFs over 100MB). Scan for malware. Validate against allowed file types list.

# Validation that prevents common ingestion failures
def validate_document(file_path: Path, metadata: dict) -> ValidationResult:
    # Size check prevents memory exhaustion
    file_size = file_path.stat().st_size
    if file_size > MAX_FILE_SIZE:
        return ValidationResult.invalid(f"File size {file_size} exceeds {MAX_FILE_SIZE}")
    
    # Extension-MIME mismatch indicates upload errors
    expected_extensions = ALLOWED_MIME_TYPES.get(metadata['mime_type'], set())
    if file_path.suffix not in expected_extensions:
        return ValidationResult.invalid(
            f"Extension {file_path.suffix} inconsistent with MIME {metadata['mime_type']}"
        )
    
    return ValidationResult.valid()

Parsing is the most fragile stage. PDFs have 20+ years of format variations. Some parsers handle PDF/A well but fail on scanned images. DOCX files hide metadata in XML structures that vary by Office version. Table extraction requires understanding layout algorithms.

# Parser selection based on file characteristics
def select_parser(file_path: Path, mime_type: str) -> DocumentParser:
    if mime_type == "application/pdf":
        # Check if PDF is image-based (scanned) or text-based
        if is_image_pdf(file_path):
            return OCRParser()  # Slow, expensive
        else:
            return PDFMinerParser()  # Fast, accurate
    
    elif mime_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
        return DOCXParser()
    
    else:
        return FallbackParser()  # Plain text extraction

Chunking strategy determines retrieval quality. Fixed-size chunking (2,000 characters) ignores semantic boundaries. Semantic chunking (split at paragraph boundaries) preserves context but creates variable chunk sizes. Parent-child chunking maintains hierarchical relationships.

The enrichment stage adds metadata: document classification, named entity extraction, summary generation. This metadata powers filtering during retrieval but must stay synchronized with document updates.

The hardest failure mode is silent data corruption. A parser that produces garbled text from a malformed PDF passes validation because the file exists and has the correct MIME type. You discover the issue months later when users report that queries return gibberish.

EXERCISE

Write a chunking strategy that handles documents with both narrative text and data tables. The chunks should preserve table context (what paragraph discusses this table) while staying under 4,000 characters.

← Chapter 4
Event Queue with Kafka
Chapter 6 →
Real-Time Indexing