07. OCR with Vision Models

Chapter 7 of 18 · 20 min

KEY INSIGHT

Vision-language models perform OCR through learned visual patterns rather than explicit character recognition. They excel at contextual understanding but struggle with precise text extraction compared to dedicated OCR engines. Vision models approach text recognition differently than Tesseract or similar OCR engines. Instead of pixel-to-character mapping, they learn to "read" as part of their language understanding. This produces more human-like interpretation but with different trade-offs. ```python def extract_text_vision_model(model, processor, image_path, context_aware=True): image = Image.open(image_path).convert("RGB") if context_aware: prompt = """Transcribe ALL text visible in this image. Preserve line breaks and formatting. Include every word, number, and symbol exactly as written.""" else: prompt = "What text do you see? List all words." conversation = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": prompt} ] } ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor( images=image, text=prompt, return_tensors="pt" ).to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=500, do_sample=False # Deterministic for text extraction ) return processor.batch_decode(output, skip_special_tokens=True)[0] ``` Compare vision model OCR with dedicated engines: ```python def hybrid_ocr(image_path, use_fallback=True): """Combine vision model with Tesseract for optimal results.""" # First attempt: Vision model vision_text = extract_text_vision_model(model, processor, image_path) if use_fallback: # Fallback: Tesseract for exact transcription import pytesseract tesseract_text = pytesseract.image_to_string( Image.open(image_path), output_type=pytesseract.Output.STRING ) return { "vision_model": vision_text, "tesseract": tesseract_text, "combined": f"{vision_text}\n\n---Tesseract---\n{tesseract_text}" } return vision_text ``` Text extraction across document types: ```python document_prompts = { "screenshot": "Extract all visible UI text, labels, buttons, and any other textual elements.", "receipt": "Extract the itemized list, prices, totals, and vendor information.", "document": "Extract the full document text maintaining paragraph structure.", "signage": "Transcribe all visible text including size indicators if present." } ``` Performance characteristics: - **Handwriting**: Poor performance; consider specialized handwriting models - **Low resolution**: Text below 12px height becomes unreadable - **Perspective distortion**: Models often fail to correct tilted text - **Multi-column layouts**: Column order may be confused

EXERCISE

Create a document processing pipeline that attempts text extraction and flags confidence levels. Manually verify a sample of outputs to understand failure patterns.