06. Image Preprocessing

Chapter 6 of 18 · 25 min

KEY INSIGHT

OCR accuracy depends more on preprocessing quality than the OCR engine itselfΓÇöinvest time in proper image preparation and you'll need less capable (faster) OCR models. ### Why Preprocessing Matters Raw scanned documents contain noise, skew, uneven lighting, and artifacts. Preprocessing transforms these into clean inputs that OCR engines can handle reliably. The improvement from good preprocessing typically exceeds the improvement from switching OCR engines. ### Binarization: Converting to Black and White Binarization converts grayscale or color images to pure black and white. This removes noise and standardizes input. ```python import cv2 import numpy as np def binarize(image_path, method='otsu'): img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE) if method == 'otsu': # Otsu's method finds optimal threshold automatically _, binary = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) elif method == 'adaptive': # Adaptive threshold handles uneven lighting binary = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2) cv2.imwrite('binary.png', binary) return binary ``` Adaptive threshold excels on scanned documents with uneven lightingΓÇöcommon in photographs of documents. ### Deskewing: Correcting Rotation Skewed documents reduce OCR accuracy. Detect and correct skew: ```python import cv2 import numpy as np def deskew(image_path): img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE) # Apply Gaussian blur to reduce noise blurred = cv2.GaussianBlur(img, (5, 5), 0) # Edge detection edges = cv2.Canny(blurred, 50, 150, apertureSize=3) # Hough line transform to find lines lines = cv2.HoughLines(edges, 1, np.pi / 180, 200) if lines is None: return cv2.imread(image_path) # Calculate average angle angles = [] for line in lines[:20]: # Sample first 20 lines rho, theta = line[0] angle = np.degrees(theta) - 90 angles.append(angle) median_angle = np.median(angles) # Rotate image h, w = img.shape center = (w // 2, h // 2) rotation_matrix = cv2.getRotationMatrix2D(center, median_angle, 1.0) rotated = cv2.warpAffine(img, rotation_matrix, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE) cv2.imwrite('deskewed.png', rotated) return rotated ``` Typical deskew angles range from -5┬░ to +5┬░. Angles outside this range often indicate scanning errors rather than document skew. ### Noise Reduction Scanner noise and compression artifacts create false text. Apply morphological operations: ```python def denoise(image_path): img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE) # Morphological opening (remove small white noise) kernel = np.ones((2, 2), np.uint8) opened = cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel) # Morphological closing (remove small black noise) closed = cv2.morphologyEx(opened, cv2.MORPH_CLOSE, kernel) cv2.imwrite('denoised.png', closed) return closed ``` For heavy noise, apply bilateral filtering which preserves edges while smoothing: ```python denoised = cv2.bilateralFilter(img, 9, 75, 75) ``` ### Contrast Enhancement Enhance text visibility through histogram equalization: ```python def enhance_contrast(image_path): img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE) # CLAHE (Contrast Limited Adaptive Histogram Equalization) clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8)) enhanced = clahe.apply(img) cv2.imwrite('enhanced.png', enhanced) return enhanced ``` CLAHE handles documents with both bright and dark regions better than global histogram equalization. ### Complete Preprocessing Pipeline ```python def preprocess_document(input_path, output_path): import cv2 import numpy as np # Load image img = cv2.imread(input_path) gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # Step 1: Denoise denoised = cv2.bilateralFilter(gray, 9, 75, 75) # Step 2: Binarize with adaptive threshold binary = cv2.adaptiveThreshold(denoised, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2) # Step 3: Morphological cleanup kernel = np.ones((1, 1), np.uint8) cleaned = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel) # Step 4: Deskew # ... (deskew function from above) ... deskewed = deskew_from_image(cleaned) cv2.imwrite(output_path, deskewed) return output_path ```

EXERCISE

Take a poorly scanned document (phone photo of printed receipt works well). Create a preprocessing script that applies each transformation incrementally, running OCR after each step. Measure character accuracy at each stage. Identify which transformation has the greatest impact on your specific document type. This informs which preprocessing steps to prioritize in production.