10. Table Extraction
Tables appear in contracts, reports, and scientific papers. Extracting them as raw text produces a tangled mess. This chapter covers methods to reconstruct tables faithfully and export them to structured formats.
Why Tables Break
PDF renderers flatten tables into line positions. Without semantic understanding of cell boundaries, extracted text follows reading orderΓÇöleft to right, top to bottomΓÇölosing row and column relationships entirely.
A table with merged cells, spanning columns, or nested headers becomes unreadable when processed as a stream of strings.
Extracting Tabular Data
Libraries like pdfplumber provide table detection and cell extraction.
import pdfplumber
def extract_tables(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages, start=1):
tables = page.extract_tables()
for table_idx, table in enumerate(tables):
yield {
"page": page_num,
"table_index": table_idx,
"rows": table,
"header": table[0] if table else [],
"data": table[1:] if len(table) > 1 else []
}
for result in extract_tables("quarterly-report.pdf"):
print(f"Page {result['page']}, Table {result['table_index']}: {len(result['data'])} rows")
The extract_tables() method uses heuristics based on whitespace and line detection. For scanned documents, accuracy drops significantly.
Improving Detection Accuracy
Configure detection parameters when defaults fail:
with pdfplumber.open("complex-report.pdf") as pdf:
page = pdf.pages[0]
tables = page.extract_tables(
table_settings={
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"explicit_vertical_lines": [],
"explicit_horizontal_lines": [],
"intersection_tolerance": 5
}
)
Explicit line detection works when tables have clear borders. Without borders, fall back to text strategy with custom spacing thresholds.
Converting to Structured Formats
Export extracted tables as CSV or JSON:
import csv
import json
def table_to_csv(table_data, output_path):
with open(output_path, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(table_data["rows"])
def table_to_json(table_data, output_path):
header = table_data.get("header", [])
with open(output_path, "w") as f:
json.dump({
"header": header,
"rows": table_data["data"]
}, f, indent=2)
Handling Complex Tables
Merged cells require post-processing to reconstruct relationships. Compare row lengths; shorter rows indicate merged cells in preceding positions.
Spanning cells need special handling. When a cell value appears once but occupies multiple columns in the visual layout, duplicate the value across all spanned columns.
Create a script that extracts tables from a multi-page PDF, identifies which tables contain financial data (by checking for currency symbols or numeric patterns), and exports only those tables to a JSON file with page numbers preserved.