05. Document Ingestion Pipeline
Document ingestion transforms raw files into searchable chunks. The pipeline includes: receipt, validation, parsing, chunking, enrichment, embedding, and indexing. Each stage can fail, delay, or corrupt data.
Receipt and validation checks file integrity before processing. Verify MIME type matches extension. Check file size limits (reject PDFs over 100MB). Scan for malware. Validate against allowed file types list.
# Validation that prevents common ingestion failures
def validate_document(file_path: Path, metadata: dict) -> ValidationResult:
# Size check prevents memory exhaustion
file_size = file_path.stat().st_size
if file_size > MAX_FILE_SIZE:
return ValidationResult.invalid(f"File size {file_size} exceeds {MAX_FILE_SIZE}")
# Extension-MIME mismatch indicates upload errors
expected_extensions = ALLOWED_MIME_TYPES.get(metadata['mime_type'], set())
if file_path.suffix not in expected_extensions:
return ValidationResult.invalid(
f"Extension {file_path.suffix} inconsistent with MIME {metadata['mime_type']}"
)
return ValidationResult.valid()
Parsing is the most fragile stage. PDFs have 20+ years of format variations. Some parsers handle PDF/A well but fail on scanned images. DOCX files hide metadata in XML structures that vary by Office version. Table extraction requires understanding layout algorithms.
# Parser selection based on file characteristics
def select_parser(file_path: Path, mime_type: str) -> DocumentParser:
if mime_type == "application/pdf":
# Check if PDF is image-based (scanned) or text-based
if is_image_pdf(file_path):
return OCRParser() # Slow, expensive
else:
return PDFMinerParser() # Fast, accurate
elif mime_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
return DOCXParser()
else:
return FallbackParser() # Plain text extraction
Chunking strategy determines retrieval quality. Fixed-size chunking (2,000 characters) ignores semantic boundaries. Semantic chunking (split at paragraph boundaries) preserves context but creates variable chunk sizes. Parent-child chunking maintains hierarchical relationships.
The enrichment stage adds metadata: document classification, named entity extraction, summary generation. This metadata powers filtering during retrieval but must stay synchronized with document updates.
The hardest failure mode is silent data corruption. A parser that produces garbled text from a malformed PDF passes validation because the file exists and has the correct MIME type. You discover the issue months later when users report that queries return gibberish.
Write a chunking strategy that handles documents with both narrative text and data tables. The chunks should preserve table context (what paragraph discusses this table) while staying under 4,000 characters.