HOW-TO · RAG
How to Extract Tables from PDFs for Structured Data
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
PyMuPDF or Camelot installed, PDFs with tables
What this does
Tabular data inside PDFs such as financial statements and schedules is invisible to standard text loaders. This guide covers extracting tables as structured formats (CSV, pandas DataFrame) using PyMuPDF and Camelot.
Steps
Find table regions with PyMuPDF.
import fitz doc = fitz.open("/data/financials.pdf") page = doc[0] tabs = page.find_tables() print(f"Found {len(tabs.tables)} table(s)") for i, tbl in enumerate(tabs.tables): print(f"Table {i}: bbox={tbl.bbox}, rows={tbl.rows}, cols={tbl.cols}")Extract table data into a DataFrame.
import pandas as pd if tabs.tables: tbl = tabs.tables[0] data = tbl.extract() df = pd.DataFrame(data[1:], columns=data[0]) print(df.to_string())Use Camelot for lattice extraction.
from camelot import read_pdf tables = read_pdf("/data/financials.pdf", pages="1-end", flavor="lattice") print(f"Camelot found {len(tables)} table(s)") for i, t in enumerate(tables): print(t.df.head(3).to_string())Serialize for RAG ingestion.
for i, t in enumerate(tables): t.df.to_csv(f"/data/table_page_{i}.csv", index=False) table_md = df.to_markdown(index=False) print(table_md)
Verification
python -c "import fitz; print('PyMuPDF version:', fitz.__version__)"
# Expected: PyMuPDF version: <version>
Common failures
- ValueError: no tables found. PDF uses scanned images. Run OCR first with
pytesseractor useflavor="stream"in Camelot. - Camelot returns empty DataFrames. Switch to
flavor="stream"which guesses borders from whitespace. - Mismatched column headers. Pass
header=1to pandasread_csvor setdata[0]explicitly. - ModuleNotFoundError for camelot. Install with
pip install camelot-py[cv]and verify Ghostscript withgs -version. - Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Related guides
RELATED GUIDES