RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Computer vision / Optical Character Recognition (OCR)
Computer vision

Optical Character Recognition (OCR)

Optical Character Recognition (OCR) is the process of converting images of text—scanned documents, photos, or screenshots—into machine-readable text. In local AI, OCR models extract text from images without sending data to cloud services, preserving privacy. Operators encounter OCR when processing PDFs, receipts, or screenshots using tools like Tesseract or vision-language models (e.g., Llama 3.2 Vision) that can read text from images. The output is typically a string of characters, often with bounding boxes for layout preservation. Performance depends on image quality, font variation, and model size: a small OCR model runs fast on CPU, while a vision LLM may need GPU VRAM.

Deeper dive

Traditional OCR (e.g., Tesseract) uses pipeline stages: binarization, character segmentation, and recognition via pattern matching or LSTM neural networks. Modern approaches leverage transformer-based vision-language models (VLMs) like Llama 3.2 Vision or Qwen2-VL, which treat OCR as a visual question answering task—e.g., 'What text is in this image?' These models handle complex layouts, handwriting, and mixed text but require more compute: a 7B VLM needs ~4 GB VRAM at Q4 and runs at ~10-20 tok/s on an RTX 4090. For batch processing of many documents, lightweight OCR engines (Tesseract, EasyOCR) are faster and more memory-efficient. Operators choose between speed (CPU-based Tesseract) and accuracy (GPU-based VLM) based on their hardware and latency tolerance.

Practical example

An operator scans a multi-page contract into PDF images. Using Tesseract via tesseract page.png output.txt extracts text in seconds on CPU. For a handwritten note, they switch to Llama 3.2 11B Vision with ollama run llama3.2-vision:11b and prompt 'Read the text in this image.' The VLM uses ~7 GB VRAM at Q4 and takes ~30 seconds per page on an RTX 4090, but captures cursive script that Tesseract misses.

Workflow example

In a local RAG pipeline, an operator runs ollama run llama3.2-vision:11b to OCR a scanned invoice, then feeds the extracted text into a vector database. They may also use pytesseract in a Python script: import pytesseract; text = pytesseract.image_to_string('invoice.png'). For batch processing, they script for f in *.png; do tesseract "$f" stdout >> all_text.txt; done. VRAM usage is monitored with nvidia-smi to ensure the VLM doesn't exceed available memory.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →