RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Document Processing with Local AI
  6. /Ch. 10
Document Processing with Local AI

10. Table Extraction

Chapter 10 of 18 · 20 min
KEY INSIGHT

Table extraction requires both layout analysis and semantic interpretation. No single strategy works across all document types. Test extraction accuracy on samples before processing large batches.

Tables appear in contracts, reports, and scientific papers. Extracting them as raw text produces a tangled mess. This chapter covers methods to reconstruct tables faithfully and export them to structured formats.

Why Tables Break

PDF renderers flatten tables into line positions. Without semantic understanding of cell boundaries, extracted text follows reading orderΓÇöleft to right, top to bottomΓÇölosing row and column relationships entirely.

A table with merged cells, spanning columns, or nested headers becomes unreadable when processed as a stream of strings.

Extracting Tabular Data

Libraries like pdfplumber provide table detection and cell extraction.

import pdfplumber

def extract_tables(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            tables = page.extract_tables()
            for table_idx, table in enumerate(tables):
                yield {
                    "page": page_num,
                    "table_index": table_idx,
                    "rows": table,
                    "header": table[0] if table else [],
                    "data": table[1:] if len(table) > 1 else []
                }

for result in extract_tables("quarterly-report.pdf"):
    print(f"Page {result['page']}, Table {result['table_index']}: {len(result['data'])} rows")

The extract_tables() method uses heuristics based on whitespace and line detection. For scanned documents, accuracy drops significantly.

Improving Detection Accuracy

Configure detection parameters when defaults fail:

with pdfplumber.open("complex-report.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables(
        table_settings={
            "vertical_strategy": "lines",
            "horizontal_strategy": "lines",
            "explicit_vertical_lines": [],
            "explicit_horizontal_lines": [],
            "intersection_tolerance": 5
        }
    )

Explicit line detection works when tables have clear borders. Without borders, fall back to text strategy with custom spacing thresholds.

Converting to Structured Formats

Export extracted tables as CSV or JSON:

import csv
import json

def table_to_csv(table_data, output_path):
    with open(output_path, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(table_data["rows"])

def table_to_json(table_data, output_path):
    header = table_data.get("header", [])
    with open(output_path, "w") as f:
        json.dump({
            "header": header,
            "rows": table_data["data"]
        }, f, indent=2)

Handling Complex Tables

Merged cells require post-processing to reconstruct relationships. Compare row lengths; shorter rows indicate merged cells in preceding positions.

Spanning cells need special handling. When a cell value appears once but occupies multiple columns in the visual layout, duplicate the value across all spanned columns.

EXERCISE

Create a script that extracts tables from a multi-page PDF, identifies which tables contain financial data (by checking for currency symbols or numeric patterns), and exports only those tables to a JSON file with page numbers preserved.

← Chapter 9
Entity Extraction
Chapter 11 →
Batch Processing Architecture