What this does

Tabular data inside PDFs such as financial statements and schedules is invisible to standard text loaders. This guide covers extracting tables as structured formats (CSV, pandas DataFrame) using PyMuPDF and Camelot.

Steps

Find table regions with PyMuPDF.

import fitz

doc = fitz.open("/data/financials.pdf")
page = doc[0]
tabs = page.find_tables()
print(f"Found {len(tabs.tables)} table(s)")
for i, tbl in enumerate(tabs.tables):
    print(f"Table {i}: bbox={tbl.bbox}, rows={tbl.rows}, cols={tbl.cols}")

Extract table data into a DataFrame.

import pandas as pd

if tabs.tables:
    tbl = tabs.tables[0]
    data = tbl.extract()
    df = pd.DataFrame(data[1:], columns=data[0])
    print(df.to_string())

Use Camelot for lattice extraction.

from camelot import read_pdf

tables = read_pdf("/data/financials.pdf", pages="1-end", flavor="lattice")
print(f"Camelot found {len(tables)} table(s)")
for i, t in enumerate(tables):
    print(t.df.head(3).to_string())

Serialize for RAG ingestion.

for i, t in enumerate(tables):
    t.df.to_csv(f"/data/table_page_{i}.csv", index=False)
table_md = df.to_markdown(index=False)
print(table_md)

Verification

python -c "import fitz; print('PyMuPDF version:', fitz.__version__)"
# Expected: PyMuPDF version: <version>

Common failures

ValueError: no tables found. PDF uses scanned images. Run OCR first with pytesseract or use flavor="stream" in Camelot.
Camelot returns empty DataFrames. Switch to flavor="stream" which guesses borders from whitespace.
Mismatched column headers. Pass header=1 to pandas read_csv or set data[0] explicitly.
ModuleNotFoundError for camelot. Install with pip install camelot-py[cv] and verify Ghostscript with gs -version.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to Extract Tables from PDFs for Structured Data

What this does

Steps

Verification

Common failures

Operator checkpoint

Related guides