Computer vision

Optical Character Recognition (OCR)

Optical Character Recognition (OCR) is the process of converting images of text—scanned documents, photos, or screenshots—into machine-readable text. In local AI, OCR models extract text from images without sending data to cloud services, preserving privacy. Operators encounter OCR when processing PDFs, receipts, or screenshots using tools like Tesseract or vision-language models (e.g., Llama 3.2 Vision) that can read text from images. The output is typically a string of characters, often with bounding boxes for layout preservation. Performance depends on image quality, font variation, and model size: a small OCR model runs fast on CPU, while a vision LLM may need GPU VRAM.

Deeper dive

Traditional OCR (e.g., Tesseract) uses pipeline stages: binarization, character segmentation, and recognition via pattern matching or LSTM neural networks. Modern approaches leverage transformer-based vision-language models (VLMs) like Llama 3.2 Vision or Qwen2-VL, which treat OCR as a visual question answering task—e.g., 'What text is in this image?' These models handle complex layouts, handwriting, and mixed text but require more compute: a 7B VLM needs ~4 GB VRAM at Q4 and runs at ~10-20 tok/s on an RTX 4090. For batch processing of many documents, lightweight OCR engines (Tesseract, EasyOCR) are faster and more memory-efficient. Operators choose between speed (CPU-based Tesseract) and accuracy (GPU-based VLM) based on their hardware and latency tolerance.

Practical example

An operator scans a multi-page contract into PDF images. Using Tesseract via tesseract page.png output.txt extracts text in seconds on CPU. For a handwritten note, they switch to Llama 3.2 11B Vision with ollama run llama3.2-vision:11b and prompt 'Read the text in this image.' The VLM uses ~7 GB VRAM at Q4 and takes ~30 seconds per page on an RTX 4090, but captures cursive script that Tesseract misses.

Workflow example

In a local RAG pipeline, an operator runs ollama run llama3.2-vision:11b to OCR a scanned invoice, then feeds the extracted text into a vector database. They may also use pytesseract in a Python script: import pytesseract; text = pytesseract.image_to_string('invoice.png'). For batch processing, they script for f in *.png; do tesseract "$f" stdout >> all_text.txt; done. VRAM usage is monitored with nvidia-smi to ensure the VLM doesn't exceed available memory.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work