04. OCR with Tesseract

Chapter 4 of 18 · 25 min

KEY INSIGHT

Tesseract OCR requires preprocessing to achieve reliable resultsΓÇöraw scanned images produce 60-70% accuracy while properly preprocessed images often exceed 95%. ### Installing and Configuring Tesseract Tesseract is the most widely-used open-source OCR engine. Install both the engine and Python bindings: ```bash # Ubuntu/Debian sudo apt install tesseract-ocr pip install pytesseract # macOS brew install tesseract pip install pytesseract # Windows # Download installer from github.com/UB-Mannheim/tesseract/wiki # Add to PATH, then: pip install pytesseract ``` Verify installation: ```bash tesseract --version ``` Tesseract supports multiple languages. Install additional language packs for non-English documents: ```bash # Install French and German language data sudo apt install tesseract-ocr-fra tesseract-ocr-deu ``` ### Basic OCR with pytesseract ```python import pytesseract from PIL import Image image = Image.open("scanned_page.png") text = pytesseract.image_to_string(image) print(text) ``` This single line handles basic cases. The output quality depends entirely on input image quality. ### Image Preprocessing for OCR Preprocessing dramatically affects accuracy. Key operations: ```python from PIL import Image, ImageFilter, ImageOps import numpy as np def preprocess_for_ocr(image_path, output_path=None): img = Image.open(image_path) # Convert to grayscale img = img.convert('L') # Increase contrast img = ImageOps.autocontrast(img) # Resize if too small (Tesseract works better on larger images) if min(img.size) < 1500: scale = 1500 / min(img.size) new_size = tuple(int(s * scale) for s in img.size) img = img.resize(new_size, Image.LANCZOS) # Apply sharpening img = img.filter(ImageFilter.SHARPEN) if output_path: img.save(output_path) return img ``` ### Page Segmentation Modes Tesseract uses different page segmentation modes (PSM) for different document types: ```python # Single block of text text = pytesseract.image_to_string(img, config='--psm 6') # Assume uniform block of text text = pytesseract.image_to_string(img, config='--psm 7') # Treat as single line text = pytesseract.image_to_string(img, config='--psm 8') # Sparse text search text = pytesseract.image_to_string(img, config='--psm 11') # Treat as single character text = pytesseract.image_to_string(img, config='--psm 10') ``` Auto-detect appropriate mode: ```python def auto_psm(img): width, height = img.size aspect = width / height # Likely a form with single lines if aspect > 3: return '--psm 8' # Likely a book page elif aspect < 0.8: return '--psm 6' # Standard document else: return '--psm 3' ``` ### Extracting Structure Get more than just textΓÇöextract bounding boxes and confidence: ```python data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT) for i, text in enumerate(data['text']): if text.strip(): conf = data['conf'][i] x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i] print(f"'{text}' (conf: {conf:.0f}%) at ({x}, {y})") ``` ### Handling Multi-column Documents For multi-column layouts, split the image before OCR: ```python def split_columns(img, num_cols=2): width, height = img.size col_width = width // num_cols columns = [] for i in range(num_cols): left = i * col_width right = (i + 1) * col_width if i < num_cols - 1 else width column = img.crop((left, 0, right, height)) columns.append(column) return columns # Process each column separately for i, col in enumerate(split_columns(img)): text = pytesseract.image_to_string(col, config='--psm 6') print(f"Column {i + 1}: {text[:200]}") ``` ### Common Failure Modes Tesseract fails in predictable ways: - **Inverted images** produce gibberish. Detect by checking average pixel value and invert if needed - **Excessive noise** confuses recognition. Apply thresholding after deskewing - **Skewed text** reduces accuracy. Correct skew before OCR - **Low resolution** loses detail. Upscale before processing

EXERCISE

Take a poorly scanned document (photographed receipt, angled scan). Write a preprocessing pipeline that: (1) converts to grayscale, (2) rotates to correct angle, (3) binarizes with adaptive thresholding, (4) applies deskew correction, (5) runs OCR with appropriate PSM. Compare output quality at each stage to identify which preprocessing step has the most impact.