What this does

Extracting structured data from PDFs—especially mixed content with text, tables, and figures—requires combining layout-aware parsing (PyMuPDF) with LLM reasoning. This guide covers page-level extraction, table detection, schema validation, and confidence scoring for reliable automated pipelines.

Steps

1. Install and configure dependencies

pip install pymupdf pydantic openai python-dotenv

2. Parse pages with layout awareness

import fitz  # PyMuPDF

def extract_pages(path: str):
    doc = fitz.open(path)
    pages = []
    for page_num, page in enumerate(doc):
        blocks = page.get_text("blocks")
        tables = page.find_tables()
        pages.append({
            "page_num": page_num,
            "text_blocks": blocks,
            "tables": tables.extract(),
        })
    return pages

3. Build the extraction prompt Send page content to an LLM with a structured output schema. Include table data in markdown format so the model handles column mapping correctly.

SYSTEM_PROMPT = """Extract data according to this schema:
{"invoice_id": str, "date": str, "total": float, "line_items": [{"description": str, "quantity": int, "unit_price": float}]}
Return JSON only. If a field is missing, use null."""

4. Validate with pydantic Parse the LLM output through a pydantic model to catch schema mismatches. Re-prompt on failure with a correction hint.

from pydantic import BaseModel, ValidationError

class Invoice(BaseModel):
    invoice_id: str | None
    date: str | None
    total: float | None
    line_items: list[dict]

def validate_output(raw: str) -> Invoice | None:
    try:
        return Invoice.model_validate_json(raw)
    except ValidationError:
        return None

5. Compute confidence scores Score each extraction: 1.0 if all required fields present, 0.5 if partial, 0.0 if validation failed entirely. Flag low-confidence records for manual review.

Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

python -m extraction.extract --file sample_invoice.pdf

Expected output:

Page 1: 2 text blocks, 1 table extracted
LLM extraction: valid JSON, 4/4 fields populated
Confidence: 1.0 — auto-confirm

Run validation tests:

pytest tests/test_extraction.py -v --tb=short

Expected: test_valid_invoice PASSED, test_missing_fields PASSED (confidence: 0.5)

Common failures

Table detection errors: PyMuPDF's find_tables() struggles with rotated or nested tables. Use tabula-py or a dedicated table detector as a fallback.
LLM hallucination: Always validate through pydantic. The model may invent plausible-but-wrong values for dates or totals.
Large PDFs: Process pages in batches of 10. Memory usage scales linearly with page count; stream pages for files over 200 pages.
Multi-column layout: Use page.get_text("dict") to retrieve block-level coordinates and reorder blocks by position before sending to the LLM.

Related guides

How to build an AI content generation pipeline — The structured output and validation patterns used here integrate directly into content pipelines.
How to build a local AI product with Nigerian naira pricing — Extracted data feeds into automated report generation that can be packaged as a monetized product.