HOW-TO · SUP

How to implement AI-powered data extraction from PDFs

advanced30 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

PyMuPDF, LLM endpoint, structured output format

What this does

Extracting structured data from PDFs—especially mixed content with text, tables, and figures—requires combining layout-aware parsing (PyMuPDF) with LLM reasoning. This guide covers page-level extraction, table detection, schema validation, and confidence scoring for reliable automated pipelines.

Steps

1. Install and configure dependencies

pip install pymupdf pydantic openai python-dotenv

2. Parse pages with layout awareness

import fitz  # PyMuPDF

def extract_pages(path: str):
    doc = fitz.open(path)
    pages = []
    for page_num, page in enumerate(doc):
        blocks = page.get_text("blocks")
        tables = page.find_tables()
        pages.append({
            "page_num": page_num,
            "text_blocks": blocks,
            "tables": tables.extract(),
        })
    return pages

3. Build the extraction prompt Send page content to an LLM with a structured output schema. Include table data in markdown format so the model handles column mapping correctly.

SYSTEM_PROMPT = """Extract data according to this schema:
{"invoice_id": str, "date": str, "total": float, "line_items": [{"description": str, "quantity": int, "unit_price": float}]}
Return JSON only. If a field is missing, use null."""

4. Validate with pydantic Parse the LLM output through a pydantic model to catch schema mismatches. Re-prompt on failure with a correction hint.

from pydantic import BaseModel, ValidationError

class Invoice(BaseModel):
    invoice_id: str | None
    date: str | None
    total: float | None
    line_items: list[dict]

def validate_output(raw: str) -> Invoice | None:
    try:
        return Invoice.model_validate_json(raw)
    except ValidationError:
        return None

5. Compute confidence scores Score each extraction: 1.0 if all required fields present, 0.5 if partial, 0.0 if validation failed entirely. Flag low-confidence records for manual review.

  • Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.

Verification

python -m extraction.extract --file sample_invoice.pdf

Expected output:

Page 1: 2 text blocks, 1 table extracted
LLM extraction: valid JSON, 4/4 fields populated
Confidence: 1.0 — auto-confirm

Run validation tests:

pytest tests/test_extraction.py -v --tb=short

Expected: test_valid_invoice PASSED, test_missing_fields PASSED (confidence: 0.5)

Common failures

  • Table detection errors: PyMuPDF's find_tables() struggles with rotated or nested tables. Use tabula-py or a dedicated table detector as a fallback.
  • LLM hallucination: Always validate through pydantic. The model may invent plausible-but-wrong values for dates or totals.
  • Large PDFs: Process pages in batches of 10. Memory usage scales linearly with page count; stream pages for files over 200 pages.
  • Multi-column layout: Use page.get_text("dict") to retrieve block-level coordinates and reorder blocks by position before sending to the LLM.

Related guides

  • How to build an AI content generation pipeline — The structured output and validation patterns used here integrate directly into content pipelines.
  • How to build a local AI product with Nigerian naira pricing — Extracted data feeds into automated report generation that can be packaged as a monetized product.