RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Document Processing with Local AI
  6. /Ch. 4
Document Processing with Local AI

04. OCR with Tesseract

Chapter 4 of 18 · 25 min
KEY INSIGHT

Tesseract OCR requires preprocessing to achieve reliable resultsΓÇöraw scanned images produce 60-70% accuracy while properly preprocessed images often exceed 95%. ### Installing and Configuring Tesseract Tesseract is the most widely-used open-source OCR engine. Install both the engine and Python bindings: ```bash # Ubuntu/Debian sudo apt install tesseract-ocr pip install pytesseract # macOS brew install tesseract pip install pytesseract # Windows # Download installer from github.com/UB-Mannheim/tesseract/wiki # Add to PATH, then: pip install pytesseract ``` Verify installation: ```bash tesseract --version ``` Tesseract supports multiple languages. Install additional language packs for non-English documents: ```bash # Install French and German language data sudo apt install tesseract-ocr-fra tesseract-ocr-deu ``` ### Basic OCR with pytesseract ```python import pytesseract from PIL import Image image = Image.open("scanned_page.png") text = pytesseract.image_to_string(image) print(text) ``` This single line handles basic cases. The output quality depends entirely on input image quality. ### Image Preprocessing for OCR Preprocessing dramatically affects accuracy. Key operations: ```python from PIL import Image, ImageFilter, ImageOps import numpy as np def preprocess_for_ocr(image_path, output_path=None): img = Image.open(image_path) # Convert to grayscale img = img.convert('L') # Increase contrast img = ImageOps.autocontrast(img) # Resize if too small (Tesseract works better on larger images) if min(img.size) < 1500: scale = 1500 / min(img.size) new_size = tuple(int(s * scale) for s in img.size) img = img.resize(new_size, Image.LANCZOS) # Apply sharpening img = img.filter(ImageFilter.SHARPEN) if output_path: img.save(output_path) return img ``` ### Page Segmentation Modes Tesseract uses different page segmentation modes (PSM) for different document types: ```python # Single block of text text = pytesseract.image_to_string(img, config='--psm 6') # Assume uniform block of text text = pytesseract.image_to_string(img, config='--psm 7') # Treat as single line text = pytesseract.image_to_string(img, config='--psm 8') # Sparse text search text = pytesseract.image_to_string(img, config='--psm 11') # Treat as single character text = pytesseract.image_to_string(img, config='--psm 10') ``` Auto-detect appropriate mode: ```python def auto_psm(img): width, height = img.size aspect = width / height # Likely a form with single lines if aspect > 3: return '--psm 8' # Likely a book page elif aspect < 0.8: return '--psm 6' # Standard document else: return '--psm 3' ``` ### Extracting Structure Get more than just textΓÇöextract bounding boxes and confidence: ```python data = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT) for i, text in enumerate(data['text']): if text.strip(): conf = data['conf'][i] x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i] print(f"'{text}' (conf: {conf:.0f}%) at ({x}, {y})") ``` ### Handling Multi-column Documents For multi-column layouts, split the image before OCR: ```python def split_columns(img, num_cols=2): width, height = img.size col_width = width // num_cols columns = [] for i in range(num_cols): left = i * col_width right = (i + 1) * col_width if i < num_cols - 1 else width column = img.crop((left, 0, right, height)) columns.append(column) return columns # Process each column separately for i, col in enumerate(split_columns(img)): text = pytesseract.image_to_string(col, config='--psm 6') print(f"Column {i + 1}: {text[:200]}") ``` ### Common Failure Modes Tesseract fails in predictable ways: - **Inverted images** produce gibberish. Detect by checking average pixel value and invert if needed - **Excessive noise** confuses recognition. Apply thresholding after deskewing - **Skewed text** reduces accuracy. Correct skew before OCR - **Low resolution** loses detail. Upscale before processing

EXERCISE

Take a poorly scanned document (photographed receipt, angled scan). Write a preprocessing pipeline that: (1) converts to grayscale, (2) rotates to correct angle, (3) binarizes with adaptive thresholding, (4) applies deskew correction, (5) runs OCR with appropriate PSM. Compare output quality at each stage to identify which preprocessing step has the most impact.

← Chapter 3
PyMuPDF Deep Dive
Chapter 5 →
OCR with AI Models