02. PDF Text Extraction

Chapter 2 of 18 · 20 min

KEY INSIGHT

Text extraction from born-digital PDFs is trivial with PyMuPDFΓÇöjust call `get_text()`ΓÇöbut handling multi-column layouts and tables requires understanding document structure analysis. ### Extracting Text with PyMuPDF PyMuPDF (also known as fitz) is the standard tool for PDF text extraction on local systems. Installation: ```bash pip install pymupdf ``` Basic extraction: ```python import fitz doc = fitz.open("document.pdf") for page_num, page in enumerate(doc): text = page.get_text() print(f"--- Page {page_num + 1} ---") print(text[:500]) doc.close() ``` This works for straightforward documents but fails on complex layouts. ### Handling Multi-Column Documents Academic papers and legal documents often use multi-column layouts. Default text extraction produces scrambled output. Use `get_text("blocks")` to understand layout: ```python import fitz doc = fitz.open("paper.pdf") page = doc[0] blocks = page.get_text("blocks") for block in blocks: x0, y0, x1, y1, text, block_no, block_type = block # block_type 0 = text, 1 = image, 2 = drawing if block_type == 0: print(f"Y:{y0:.0f} - {text[:100]}") doc.close() ``` Sort blocks by `y0` (vertical position) then `x0` (horizontal) to reconstruct reading order: ```python def extract_by_layout(page): blocks = page.get_text("blocks") text_blocks = [(b[1], b[0], b[4]) for b in blocks if b[6] == 0] text_blocks.sort() # Sort by y, then x return "\n".join(text for _, _, text in text_blocks) ``` ### Extracting Tables Tables are notoriously difficult. PyMuPDF's `extract_table()` attempts detection but often fails on complex formats: ```python page = doc[0] tables = page.extract_tables() if tables: for table in tables: for row in table: print("\t".join(str(cell) for cell in row)) ``` For reliable table extraction, consider `tabula-py` (requires Java) or `camelot`. The table extraction task remains unsolved perfectlyΓÇöexpect iteration. ### Handling Encodings Extracted text sometimes contains encoding issues. Handle explicitly: ```python text = page.get_text() # Remove null bytes that break processing text = text.replace("\x00", "") # Normalize line endings text = text.replace("\r\n", "\n").replace("\r", "\n") ``` ### Extracting Metadata PDF metadata includes author, creation date, and custom fields: ```python doc = fitz.open("document.pdf") meta = doc.metadata print(f"Author: {meta.get('author')}") print(f"Created: {meta.get('creationDate')}") print(f"Pages: {len(doc)}") ```

EXERCISE

Download a multi-page PDF (try an arXiv paper). Extract text using basic get_text() and observe scrambled output. Rewrite using block-based extraction sorted by position. Compare the two outputsΓÇöboth raw text length and perceived readability.