How to implement AI-powered data extraction from PDFs
PyMuPDF, LLM endpoint, structured output format
What this does
Extracting structured data from PDFs—especially mixed content with text, tables, and figures—requires combining layout-aware parsing (PyMuPDF) with LLM reasoning. This guide covers page-level extraction, table detection, schema validation, and confidence scoring for reliable automated pipelines.
Steps
1. Install and configure dependencies
pip install pymupdf pydantic openai python-dotenv
2. Parse pages with layout awareness
import fitz # PyMuPDF
def extract_pages(path: str):
doc = fitz.open(path)
pages = []
for page_num, page in enumerate(doc):
blocks = page.get_text("blocks")
tables = page.find_tables()
pages.append({
"page_num": page_num,
"text_blocks": blocks,
"tables": tables.extract(),
})
return pages
3. Build the extraction prompt Send page content to an LLM with a structured output schema. Include table data in markdown format so the model handles column mapping correctly.
SYSTEM_PROMPT = """Extract data according to this schema:
{"invoice_id": str, "date": str, "total": float, "line_items": [{"description": str, "quantity": int, "unit_price": float}]}
Return JSON only. If a field is missing, use null."""
4. Validate with pydantic Parse the LLM output through a pydantic model to catch schema mismatches. Re-prompt on failure with a correction hint.
from pydantic import BaseModel, ValidationError
class Invoice(BaseModel):
invoice_id: str | None
date: str | None
total: float | None
line_items: list[dict]
def validate_output(raw: str) -> Invoice | None:
try:
return Invoice.model_validate_json(raw)
except ValidationError:
return None
5. Compute confidence scores Score each extraction: 1.0 if all required fields present, 0.5 if partial, 0.0 if validation failed entirely. Flag low-confidence records for manual review.
- Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
python -m extraction.extract --file sample_invoice.pdf
Expected output:
Page 1: 2 text blocks, 1 table extracted
LLM extraction: valid JSON, 4/4 fields populated
Confidence: 1.0 — auto-confirm
Run validation tests:
pytest tests/test_extraction.py -v --tb=short
Expected: test_valid_invoice PASSED, test_missing_fields PASSED (confidence: 0.5)
Common failures
- Table detection errors: PyMuPDF's
find_tables()struggles with rotated or nested tables. Usetabula-pyor a dedicated table detector as a fallback. - LLM hallucination: Always validate through pydantic. The model may invent plausible-but-wrong values for dates or totals.
- Large PDFs: Process pages in batches of 10. Memory usage scales linearly with page count; stream pages for files over 200 pages.
- Multi-column layout: Use
page.get_text("dict")to retrieve block-level coordinates and reorder blocks by position before sending to the LLM.
Related guides
- How to build an AI content generation pipeline — The structured output and validation patterns used here integrate directly into content pipelines.
- How to build a local AI product with Nigerian naira pricing — Extracted data feeds into automated report generation that can be packaged as a monetized product.