RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Document Processing with Local AI
  6. /Ch. 3
Document Processing with Local AI

03. PyMuPDF Deep Dive

Chapter 3 of 18 · 20 min
KEY INSIGHT

PyMuPDF's granular controlΓÇöpage rotation, image extraction, redaction, annotationsΓÇöenables sophisticated document preprocessing directly within your pipeline. ### Beyond Basic Text Extraction PyMuPDF provides low-level PDF manipulation that enables preprocessing workflows. Understanding these capabilities allows building sophisticated pipelines without external tools. ### Image Extraction from PDFs Scraping images embedded in PDFs: ```python import fitz doc = fitz.open("document.pdf") page = doc[0] # List images on page image_list = page.get_images(full=True) for img_index, img in enumerate(image_list): xref = img[0] base_image = doc.extract_image(xref) image_bytes = base_image["image"] image_ext = base_image["ext"] with open(f"image_{img_index}.{image_ext}", "wb") as f: f.write(image_bytes) doc.close() ``` This extracts raster images. Vector graphics require different handling. ### Page Rotation and Orientation Many scanned documents arrive rotated. Detect and fix: ```python import fitz def detect_rotation(page): # Check rotation attribute rotation = page.rotation # Analyze text block positions for implicit rotation blocks = page.get_text("blocks") if not blocks: return 0 # Heuristic: if most text starts on right side of page, likely rotated right_heavy = sum(1 for b in blocks if b[0] > page.rect.width / 2) if right_heavy > len(blocks) * 0.7: return 90 return rotation doc = fitz.open("scanned.pdf") for page in doc: current_rotation = detect_rotation(page) if current_rotation: page.set_rotation(current_rotation) doc.save("corrected.pdf") doc.close() ``` ### Text Extraction Methods PyMuPDF offers multiple text extraction modes with different output structures: | Method | Use Case | |--------|----------| | `get_text()` | Raw text with optional block sorting | | `get_text("dict")` | Structured dict with font info | | `get_text("blocks")` | List of positioned text rectangles | | `get_text("words")` | Individual words with positions | | `get_text("rawdict")` | Low-level internal structure | ```python # Word-level extraction for precise positioning page = doc[0] words = page.get_text("words") for word in words: x0, y0, x1, y1, content, block_no, line_no, word_no = word if "search_term" in content.lower(): print(f"Found at ({x0:.0f}, {y0:.0f})") ``` ### Redacting Content Redaction permanently removes content (unlike annotation which overlays): ```python doc = fitz.open("document.pdf") page = doc[0] # Redact specific text redact = page.add_redact_annot(page.rect, fill=(1, 1, 1)) page.apply_redactions() doc.save("redacted.pdf") doc.close() ``` ### Creating New PDFs Generate processed output: ```python doc = fitz.open("source.pdf") new_doc = fitz.open() for page in doc: # Process page new_page = new_doc.new_page(width=page.rect.width, height=page.rect.height) new_page.show_pdf_page(new_page.rect, doc, page.number) new_doc.save("processed.pdf") new_doc.close() doc.close() ``` ### Performance Considerations PyMuPDF loads entire PDF into memory. For large files: ```python # Stream processing for memory efficiency with fitz.open("large.pdf", stream=open("large.pdf", "rb").read()) as doc: for page in doc: # Process and yield immediately text = page.get_text() yield page.number, text ```

EXERCISE

Take a PDF with multiple pages, different orientations, and embedded images. Write a script that: (1) detects rotated pages and rotates them, (2) extracts all embedded images to a folder, (3) extracts text from each page with word-level positions, (4) identifies pages containing specific keywords by position.

← Chapter 2
PDF Text Extraction
Chapter 4 →
OCR with Tesseract