01. Document Processing Overview

Chapter 1 of 18 · 15 min
EXERCISE

Inspect three documents: one born-digital PDF, one scanned PDF, and one image file (JPG/PNG). For each, run the following to observe the difference:

# Check if born-digital PDF contains text
python3 -c "import fitz; doc = fitz.open('doc.pdf'); print(doc[0].get_text()[:200] if doc[0].get_text().strip() else 'NO EMBEDDED TEXT')"

# Check image properties
python3 -c "from PIL import Image; img = Image.open('doc.jpg'); print(f'Size: {img.size}, Mode: {img.mode}')"

Categorize each document as born-digital or scanned. This classification determines which extraction method you use in subsequent chapters.