KEY INSIGHT
OCR accuracy depends more on preprocessing quality than the OCR engine itselfΓÇöinvest time in proper image preparation and you'll need less capable (faster) OCR models.
### Why Preprocessing Matters
Raw scanned documents contain noise, skew, uneven lighting, and artifacts. Preprocessing transforms these into clean inputs that OCR engines can handle reliably. The improvement from good preprocessing typically exceeds the improvement from switching OCR engines.
### Binarization: Converting to Black and White
Binarization converts grayscale or color images to pure black and white. This removes noise and standardizes input.
```python
import cv2
import numpy as np
def binarize(image_path, method='otsu'):
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
if method == 'otsu':
# Otsu's method finds optimal threshold automatically
_, binary = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
elif method == 'adaptive':
# Adaptive threshold handles uneven lighting
binary = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 11, 2)
cv2.imwrite('binary.png', binary)
return binary
```
Adaptive threshold excels on scanned documents with uneven lightingΓÇöcommon in photographs of documents.
### Deskewing: Correcting Rotation
Skewed documents reduce OCR accuracy. Detect and correct skew:
```python
import cv2
import numpy as np
def deskew(image_path):
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
# Apply Gaussian blur to reduce noise
blurred = cv2.GaussianBlur(img, (5, 5), 0)
# Edge detection
edges = cv2.Canny(blurred, 50, 150, apertureSize=3)
# Hough line transform to find lines
lines = cv2.HoughLines(edges, 1, np.pi / 180, 200)
if lines is None:
return cv2.imread(image_path)
# Calculate average angle
angles = []
for line in lines[:20]: # Sample first 20 lines
rho, theta = line[0]
angle = np.degrees(theta) - 90
angles.append(angle)
median_angle = np.median(angles)
# Rotate image
h, w = img.shape
center = (w // 2, h // 2)
rotation_matrix = cv2.getRotationMatrix2D(center, median_angle, 1.0)
rotated = cv2.warpAffine(img, rotation_matrix, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE)
cv2.imwrite('deskewed.png', rotated)
return rotated
```
Typical deskew angles range from -5┬░ to +5┬░. Angles outside this range often indicate scanning errors rather than document skew.
### Noise Reduction
Scanner noise and compression artifacts create false text. Apply morphological operations:
```python
def denoise(image_path):
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
# Morphological opening (remove small white noise)
kernel = np.ones((2, 2), np.uint8)
opened = cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel)
# Morphological closing (remove small black noise)
closed = cv2.morphologyEx(opened, cv2.MORPH_CLOSE, kernel)
cv2.imwrite('denoised.png', closed)
return closed
```
For heavy noise, apply bilateral filtering which preserves edges while smoothing:
```python
denoised = cv2.bilateralFilter(img, 9, 75, 75)
```
### Contrast Enhancement
Enhance text visibility through histogram equalization:
```python
def enhance_contrast(image_path):
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
# CLAHE (Contrast Limited Adaptive Histogram Equalization)
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
enhanced = clahe.apply(img)
cv2.imwrite('enhanced.png', enhanced)
return enhanced
```
CLAHE handles documents with both bright and dark regions better than global histogram equalization.
### Complete Preprocessing Pipeline
```python
def preprocess_document(input_path, output_path):
import cv2
import numpy as np
# Load image
img = cv2.imread(input_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Step 1: Denoise
denoised = cv2.bilateralFilter(gray, 9, 75, 75)
# Step 2: Binarize with adaptive threshold
binary = cv2.adaptiveThreshold(denoised, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 11, 2)
# Step 3: Morphological cleanup
kernel = np.ones((1, 1), np.uint8)
cleaned = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)
# Step 4: Deskew
# ... (deskew function from above) ...
deskewed = deskew_from_image(cleaned)
cv2.imwrite(output_path, deskewed)
return output_path
```