05. OCR with AI Models

Chapter 5 of 18 · 25 min

KEY INSIGHT

Modern OCR models (TrOCR, EasyOCR) handle imperfect input better than Tesseract but require GPU and produce different error patternsΓÇöknow when to use which approach. ### The AI OCR Landscape Tesseract remains the fastest option for clean documents but struggles with challenging inputs. AI-based OCR models using transformer architectures handle imperfect images better through learned features. Three primary options for local AI OCR: - **TrOCR** (Microsoft) ΓÇö Encoder-decoder transformer, excels at handwriting - **EasyOCR** ΓÇö Multi-language support, balanced speed/accuracy - **PaddleOCR** ΓÇö Fast, good Chinese support, quantized models available ### TrOCR for Document Recognition TrOCR uses vision transformer architecture. Best for structured documents and handwriting: ```bash pip install transformers torch ``` ```python from transformers import TrOCRProcessor, VisionEncoderDecoderModel from PIL import Image processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten") model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten") def ocr_trocr(image_path): image = Image.open(image_path).convert("RGB") pixel_values = processor(images=image, return_tensors="pt").pixel_values generated_ids = model.generate(pixel_values) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] return generated_text text = ocr_trocr("handwritten_notes.jpg") print(text) ``` Requires ~2GB VRAM for base model. Larger models improve accuracy but increase memory requirements. ### EasyOCR for Multi-language Support EasyOCR supports 80+ languages and handles mixed-language documents: ```bash pip install easyocr ``` ```python import easyocr reader = easyocr.Reader(['en', 'de', 'fr'], gpu=True) results = reader.readtext("multilingual_doc.png") for (bbox, text, confidence) in results: print(f"{text} (conf: {confidence:.2f})") ``` Results include bounding boxes for each detected text region. Useful for document understanding tasks beyond simple extraction. ### PaddleOCR for Speed PaddleOCR emphasizes inference speed while maintaining accuracy: ```bash pip install paddlepaddle paddleocr ``` ```python from paddleocr import PaddleOCR ocr = PaddleOCR(use_angle_cls=True, lang='en') results = ocr.ocr("document.png") for line in results[0]: bbox, (text, confidence) = line print(f"{text}") ``` ### Comparing Approaches | Engine | Speed (CPU) | Accuracy | Memory | Best For | |--------|-------------|----------|--------|----------| | Tesseract | Fast | Medium | Low | Clean documents, batch processing | | TrOCR | Slow | High | High | Handwriting, structured forms | | EasyOCR | Medium | High | Medium | Multi-language, varied quality | | PaddleOCR | Fast | High | Medium | Production pipelines, Chinese | ### Hybrid Pipelines Combine approaches for reliable: ```python def hybrid_ocr(image_path): from PIL import Image import pytesseract # Try EasyOCR first (better error messages) try: reader = easyocr.Reader(['en'], gpu=False) results = reader.readtext(image_path) # If low confidence, fall back to Tesseract avg_conf = sum(r[1][1] for r in results) / len(results) if avg_conf < 0.7: raise ValueError("Low confidence") return "\n".join(r[1][0] for r in results) except: # Fallback to Tesseract img = Image.open(image_path) return pytesseract.image_to_string(img) ``` ### Quantized Models for CPU When GPU unavailable, use quantized models: ```python # Use smaller model variant reader = easyocr.Reader(['en'], gpu=False, model_storage_directory='./models') # Or use ONNX runtime for CPU efficiency from paddleocr import PaddleOCR ocr = PaddleOCR(use_tensorrt=False, use_angle_cls=True) ``` Expect 2-3x slower processing but identical output quality.

EXERCISE

Take a challenging document (old newspaper scan, restaurant menu with decorative fonts, mixed-language invoice). Process it with Tesseract (optimized), EasyOCR, and TrOCR. Calculate word error rate by comparing against a manually created ground truth. Document which engine performs best and whyΓÇöuse this decision tree in future projects.