Table Extraction — Document Processing with Local AI (Chapter 10)

Tables appear in contracts, reports, and scientific papers. Extracting them as raw text produces a tangled mess. This chapter covers methods to reconstruct tables faithfully and export them to structured formats.

Why Tables Break

PDF renderers flatten tables into line positions. Without semantic understanding of cell boundaries, extracted text follows reading orderΓÇöleft to right, top to bottomΓÇölosing row and column relationships entirely.

A table with merged cells, spanning columns, or nested headers becomes unreadable when processed as a stream of strings.

Extracting Tabular Data

Libraries like pdfplumber provide table detection and cell extraction.

import pdfplumber

def extract_tables(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            tables = page.extract_tables()
            for table_idx, table in enumerate(tables):
                yield {
                    "page": page_num,
                    "table_index": table_idx,
                    "rows": table,
                    "header": table[0] if table else [],
                    "data": table[1:] if len(table) > 1 else []
                }

for result in extract_tables("quarterly-report.pdf"):
    print(f"Page {result['page']}, Table {result['table_index']}: {len(result['data'])} rows")

The extract_tables() method uses heuristics based on whitespace and line detection. For scanned documents, accuracy drops significantly.

Improving Detection Accuracy

Configure detection parameters when defaults fail:

with pdfplumber.open("complex-report.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables(
        table_settings={
            "vertical_strategy": "lines",
            "horizontal_strategy": "lines",
            "explicit_vertical_lines": [],
            "explicit_horizontal_lines": [],
            "intersection_tolerance": 5
        }
    )

Explicit line detection works when tables have clear borders. Without borders, fall back to text strategy with custom spacing thresholds.

Converting to Structured Formats

Export extracted tables as CSV or JSON:

import csv
import json

def table_to_csv(table_data, output_path):
    with open(output_path, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(table_data["rows"])

def table_to_json(table_data, output_path):
    header = table_data.get("header", [])
    with open(output_path, "w") as f:
        json.dump({
            "header": header,
            "rows": table_data["data"]
        }, f, indent=2)

Handling Complex Tables

Merged cells require post-processing to reconstruct relationships. Compare row lengths; shorter rows indicate merged cells in preceding positions.

Spanning cells need special handling. When a cell value appears once but occupies multiple columns in the visual layout, duplicate the value across all spanned columns.