RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to Extract Tables from PDFs for Structured Data
HOW-TO · RAG

How to Extract Tables from PDFs for Structured Data

advanced·30 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

PyMuPDF or Camelot installed, PDFs with tables

What this does

Tabular data inside PDFs such as financial statements and schedules is invisible to standard text loaders. This guide covers extracting tables as structured formats (CSV, pandas DataFrame) using PyMuPDF and Camelot.

Steps

  1. Find table regions with PyMuPDF.

    import fitz
    
    doc = fitz.open("/data/financials.pdf")
    page = doc[0]
    tabs = page.find_tables()
    print(f"Found {len(tabs.tables)} table(s)")
    for i, tbl in enumerate(tabs.tables):
        print(f"Table {i}: bbox={tbl.bbox}, rows={tbl.rows}, cols={tbl.cols}")
    
  2. Extract table data into a DataFrame.

    import pandas as pd
    
    if tabs.tables:
        tbl = tabs.tables[0]
        data = tbl.extract()
        df = pd.DataFrame(data[1:], columns=data[0])
        print(df.to_string())
    
  3. Use Camelot for lattice extraction.

    from camelot import read_pdf
    
    tables = read_pdf("/data/financials.pdf", pages="1-end", flavor="lattice")
    print(f"Camelot found {len(tables)} table(s)")
    for i, t in enumerate(tables):
        print(t.df.head(3).to_string())
    
  4. Serialize for RAG ingestion.

    for i, t in enumerate(tables):
        t.df.to_csv(f"/data/table_page_{i}.csv", index=False)
    table_md = df.to_markdown(index=False)
    print(table_md)
    

Verification

python -c "import fitz; print('PyMuPDF version:', fitz.__version__)"
# Expected: PyMuPDF version: <version>

Common failures

  • ValueError: no tables found. PDF uses scanned images. Run OCR first with pytesseract or use flavor="stream" in Camelot.
  • Camelot returns empty DataFrames. Switch to flavor="stream" which guesses borders from whitespace.
  • Mismatched column headers. Pass header=1 to pandas read_csv or set data[0] explicitly.
  • ModuleNotFoundError for camelot. Install with pip install camelot-py[cv] and verify Ghostscript with gs -version.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

  • load-documents-langchain-loaders
  • extract-metadata-documents-filtering
RELATED GUIDES
RAG
How to Load Documents with LangChain Document Loaders
RAG
How to Extract Metadata from Documents for Filtering
← All how-to guidesCourses →