03. PyMuPDF Deep Dive

Chapter 3 of 18 · 20 min

KEY INSIGHT

PyMuPDF's granular controlΓÇöpage rotation, image extraction, redaction, annotationsΓÇöenables sophisticated document preprocessing directly within your pipeline. ### Beyond Basic Text Extraction PyMuPDF provides low-level PDF manipulation that enables preprocessing workflows. Understanding these capabilities allows building sophisticated pipelines without external tools. ### Image Extraction from PDFs Scraping images embedded in PDFs: ```python import fitz doc = fitz.open("document.pdf") page = doc[0] # List images on page image_list = page.get_images(full=True) for img_index, img in enumerate(image_list): xref = img[0] base_image = doc.extract_image(xref) image_bytes = base_image["image"] image_ext = base_image["ext"] with open(f"image_{img_index}.{image_ext}", "wb") as f: f.write(image_bytes) doc.close() ``` This extracts raster images. Vector graphics require different handling. ### Page Rotation and Orientation Many scanned documents arrive rotated. Detect and fix: ```python import fitz def detect_rotation(page): # Check rotation attribute rotation = page.rotation # Analyze text block positions for implicit rotation blocks = page.get_text("blocks") if not blocks: return 0 # Heuristic: if most text starts on right side of page, likely rotated right_heavy = sum(1 for b in blocks if b[0] > page.rect.width / 2) if right_heavy > len(blocks) * 0.7: return 90 return rotation doc = fitz.open("scanned.pdf") for page in doc: current_rotation = detect_rotation(page) if current_rotation: page.set_rotation(current_rotation) doc.save("corrected.pdf") doc.close() ``` ### Text Extraction Methods PyMuPDF offers multiple text extraction modes with different output structures: | Method | Use Case | |--------|----------| | `get_text()` | Raw text with optional block sorting | | `get_text("dict")` | Structured dict with font info | | `get_text("blocks")` | List of positioned text rectangles | | `get_text("words")` | Individual words with positions | | `get_text("rawdict")` | Low-level internal structure | ```python # Word-level extraction for precise positioning page = doc[0] words = page.get_text("words") for word in words: x0, y0, x1, y1, content, block_no, line_no, word_no = word if "search_term" in content.lower(): print(f"Found at ({x0:.0f}, {y0:.0f})") ``` ### Redacting Content Redaction permanently removes content (unlike annotation which overlays): ```python doc = fitz.open("document.pdf") page = doc[0] # Redact specific text redact = page.add_redact_annot(page.rect, fill=(1, 1, 1)) page.apply_redactions() doc.save("redacted.pdf") doc.close() ``` ### Creating New PDFs Generate processed output: ```python doc = fitz.open("source.pdf") new_doc = fitz.open() for page in doc: # Process page new_page = new_doc.new_page(width=page.rect.width, height=page.rect.height) new_page.show_pdf_page(new_page.rect, doc, page.number) new_doc.save("processed.pdf") new_doc.close() doc.close() ``` ### Performance Considerations PyMuPDF loads entire PDF into memory. For large files: ```python # Stream processing for memory efficiency with fitz.open("large.pdf", stream=open("large.pdf", "rb").read()) as doc: for page in doc: # Process and yield immediately text = page.get_text() yield page.number, text ```

EXERCISE

Take a PDF with multiple pages, different orientations, and embedded images. Write a script that: (1) detects rotated pages and rotates them, (2) extracts all embedded images to a folder, (3) extracts text from each page with word-level positions, (4) identifies pages containing specific keywords by position.