RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Document Processing with Local AI
  6. /Ch. 2
Document Processing with Local AI

02. PDF Text Extraction

Chapter 2 of 18 · 20 min
KEY INSIGHT

Text extraction from born-digital PDFs is trivial with PyMuPDFΓÇöjust call `get_text()`ΓÇöbut handling multi-column layouts and tables requires understanding document structure analysis. ### Extracting Text with PyMuPDF PyMuPDF (also known as fitz) is the standard tool for PDF text extraction on local systems. Installation: ```bash pip install pymupdf ``` Basic extraction: ```python import fitz doc = fitz.open("document.pdf") for page_num, page in enumerate(doc): text = page.get_text() print(f"--- Page {page_num + 1} ---") print(text[:500]) doc.close() ``` This works for straightforward documents but fails on complex layouts. ### Handling Multi-Column Documents Academic papers and legal documents often use multi-column layouts. Default text extraction produces scrambled output. Use `get_text("blocks")` to understand layout: ```python import fitz doc = fitz.open("paper.pdf") page = doc[0] blocks = page.get_text("blocks") for block in blocks: x0, y0, x1, y1, text, block_no, block_type = block # block_type 0 = text, 1 = image, 2 = drawing if block_type == 0: print(f"Y:{y0:.0f} - {text[:100]}") doc.close() ``` Sort blocks by `y0` (vertical position) then `x0` (horizontal) to reconstruct reading order: ```python def extract_by_layout(page): blocks = page.get_text("blocks") text_blocks = [(b[1], b[0], b[4]) for b in blocks if b[6] == 0] text_blocks.sort() # Sort by y, then x return "\n".join(text for _, _, text in text_blocks) ``` ### Extracting Tables Tables are notoriously difficult. PyMuPDF's `extract_table()` attempts detection but often fails on complex formats: ```python page = doc[0] tables = page.extract_tables() if tables: for table in tables: for row in table: print("\t".join(str(cell) for cell in row)) ``` For reliable table extraction, consider `tabula-py` (requires Java) or `camelot`. The table extraction task remains unsolved perfectlyΓÇöexpect iteration. ### Handling Encodings Extracted text sometimes contains encoding issues. Handle explicitly: ```python text = page.get_text() # Remove null bytes that break processing text = text.replace("\x00", "") # Normalize line endings text = text.replace("\r\n", "\n").replace("\r", "\n") ``` ### Extracting Metadata PDF metadata includes author, creation date, and custom fields: ```python doc = fitz.open("document.pdf") meta = doc.metadata print(f"Author: {meta.get('author')}") print(f"Created: {meta.get('creationDate')}") print(f"Pages: {len(doc)}") ```

EXERCISE

Download a multi-page PDF (try an arXiv paper). Extract text using basic get_text() and observe scrambled output. Rewrite using block-based extraction sorted by position. Compare the two outputsΓÇöboth raw text length and perceived readability.

← Chapter 1
Document Processing Overview
Chapter 3 →
PyMuPDF Deep Dive