Vision
text extraction
document scanning
handwriting recognition
pdf extraction

OCR / Document Text Extraction

Extracting text from images, PDFs, screenshots, and handwritten documents. Modern multimodal LLMs (Qwen2.5-VL, InternVL, GPT-4V) increasingly outperform specialized OCR engines on complex layouts.

Capability notes

OCR in 2026 splits into specialized engines and multimodal LLMs, each dominating different document types. **Specialized engines.** PaddleOCR (Baidu, Apache 2.0) achieves 97–99% character accuracy on clean printed English/Chinese documents — the gold standard for structured digitization. Handles rotated text up to 45°, curved text on packaging, and multi-column layouts with 90%+ structure preservation. Weakness: handwriting drops to 75–85%, low-contrast text (<30% contrast ratio) degrades rapidly. Tesseract 5 (Google, Apache 2.0): 85–95% on clean printed English with LSTM recognition — adequate for basic scanning but requires pre-processing (binarization, deskewing) for sub-perfect quality. Covers 100+ languages; non-Latin scripts score 10–15% lower than Latin. **Multimodal LLMs.** Qwen 2.5-VL and [Llama 3.2 Vision](/models/llama-3-3-70b) extract text from complex documents — receipts, forms, handwritten notes, screenshots. Printed accuracy: 95–98% on clean English, 90–95% on medium-quality scans. LLMs excel at understanding structure — they identify headers vs body vs footnotes, extract table cells with row/column relationships, and handle handwriting-on-printed-forms (signatures over printed lines) that confuse specialized OCR. Weaknesses: speed (3–10 seconds/page vs PaddleOCR's 0.3–1 second), cost ([GPU inference](/tools/vllm) vs CPU), and consistency — the same page processed twice yields slightly different extraction on borderline legible text. **Accuracy by document type.** Printed English invoices: PaddleOCR 99%, Tesseract 93%, LLM 98%. Handwritten notes: PaddleOCR 75%, Tesseract 55%, LLM 85%. Faded receipts: PaddleOCR 82%, Tesseract 65%, LLM 88%. The pattern: multimodal LLMs win on degraded/heterogeneous documents; specialized OCR wins on clean, high-volume scanning.

If you just want to try this

Lowest-friction path to a working setup.

Install [LM Studio](/tools/lm-studio), search "llama-3.2-vision" and download the 11B instruct at Q4_K_M (~8 GB). Start the local server on port 1234. Use any OpenAI-compatible client: ```python import base64, requests with open("document.jpg", "rb") as f: img = base64.b64encode(f.read()).decode() resp = requests.post("http://localhost:1234/v1/chat/completions", json={ "model": "llama-3.2-vision-11b", "messages": [{"role": "user", "content": [ {"type": "text", "text": "Extract all visible text from this document. Preserve structure."}, {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}} ]}] }) print(resp.json()["choices"][0]["message"]["content"]) ``` On [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb): 3–8 seconds per page. Accuracy: 90–98% clean print, 80–90% handwriting. For high-volume batch OCR of clean printed documents, install PaddleOCR: ```bash pip install paddlepaddle paddleocr ``` ```python from paddleocr import PaddleOCR ocr = PaddleOCR(lang='en') result = ocr.ocr('document.jpg') for line in result[0]: print(line[1][0]) ``` PaddleOCR processes a page in 0.3–1 second on CPU — no GPU required. 97–99% accuracy on clean print. Right path for digitizing filing cabinets. Simplest no-code path: [Pinokio](https://pinokio.ai) → search "OCR" → install "Docling" or "Marker" one-click installer — wraps PaddleOCR + LLaMA Vision into a web UI: drag-and-drop PDF/image, receive text.

For production deployment

Operator-grade recommendation.

Production OCR combines a specialized engine for fast high-confidence extraction with a multimodal LLM for complex/degraded documents. **Two-stage pipeline.** Stage 1: PaddleOCR processes every document, extracts text regions with bounding boxes and confidence. Regions >= 95% confidence accepted directly. Regions < 95% (handwriting, low contrast, complex layout) cropped → Stage 2. Stage 2: multimodal LLM receives cropped region + specialized OCR's tentative output as context, produces corrected extraction. Pipeline: 80% of pages via fast Stage 1 (0.3–1 sec/page), 20% via slower Stage 2 (3–10 sec/page) — average ~2 seconds/page on mixed batch. **Throughput.** PaddleOCR on CPU (Ryzen 9): 40–60 pages/min. On GPU ([RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb)): 80–120 pages/min. LLM OCR ([Llama 3.2 Vision 11B](/models/llama-3-3-70b) on [RTX 4090](/hardware/rtx-4090)): 10–20 pages/min. Two-stage hybrid: ~35 pages/min on CPU+GPU server. A 100,000-page archive in ~48 hours on one server. **Specialized vs general.** Use specialized OCR when: documents are cleanly printed, volume >1,000 pages/day, CPU-only budget, extraction consistency across identical pages matters. Use multimodal LLM when: handwriting present, documents degraded (faxes, old records), complex structure (multi-column with callout boxes), error tolerance allows slight extraction variance. **Accuracy calibration.** Financial compliance: 99% accuracy required at line-item level — errors on dollar amounts unacceptable. Use PaddleOCR Stage 1 + LLM verification on every document with financial figures (regex-detected). Doubles cost per page but eliminates 95% of dollar-amount errors. Archival search: 90% accuracy acceptable — PaddleOCR alone suffices. Legal production: citations and case names must be exact — LLM-only, slower speed, human review on low-confidence regions. **Table extraction.** Hardest sub-task. PaddleOCR's table mode detects cell boundaries, but complex tables (merged cells, nested headers) have incorrect cell association on ~15%. LLMs handle merged cells/nested headers better but produce inconsistent column alignment on ~10% of wide tables (>6 columns). Hybrid: detect table regions → PaddleOCR extracts cells → LLM validates structure → output structured JSON with corrected grid.

What breaks

Failure modes operators see in the wild.

**Table structure loss.** Symptom: extracted table text is correct but cell relationships are wrong — values in wrong columns, merged cells split. Cause: OCR detects text regions as independent bounding boxes with no grid awareness. Mitigation: PaddleOCR PP-Structure pipeline explicitly models table grid detection. For LLMs, prompt "Extract this table preserving exact row and column structure. Output as CSV." CSV forces grid maintenance. Post-process to validate column count consistency. **Handwriting on printed forms.** Symptom: printed fields extract correctly but handwritten fill-ins produce garbage or are missed. Cause: text detection trained on printed fonts — handwriting differs in stroke thickness, spacing, baseline alignment. Mitigation: two passes — specialized OCR for printed text, then LLM for full page with "Focus on handwritten text and form fill-ins." LLM distinguishes handwriting from print. For checkboxes, prompt explicitly for checked/unchecked state. **Rotated text and multi-column confusion.** Symptom: columns merged into incoherent stream; rotated text extracted as character soup. Cause: OCR detects text left-to-right — multi-column requires column boundary detection before line detection. Mitigation: pre-process with layout analysis (PaddleOCR PP-Structure, DocLayout-YOLO) to identify column boundaries and reading order. Extract per-column. PaddleOCR handles up to 45° rotation but vertical text requires rotation-unaware detectors. **Low-contrast text.** Symptom: light-gray text, watermarks, faded thermal paper go undetected below ~30% contrast ratio. Cause: detection models threshold on pixel intensity gradients — low contrast produces weak gradients. Mitigation: pre-process with CLAHE (adaptive histogram equalization) to boost local contrast. For consistent patterns, train contrast-enhancement pipeline. For LLMs, include "Extract all text including low-contrast and faint text" in prompt. **Multi-language mixed documents.** Symptom: document with English + Arabic + Chinese extracts English correctly but Arabic is garbled, Chinese romanized incorrectly. Cause: specialized OCR uses single language model — unsupported languages passed through wrong character classifier. Mitigation: use multilingual models (PaddleOCR multilingual). For 3+ languages, use multimodal LLM — handles mixed-language natively. For specialized OCR, implement per-region language detection and model selection.

Hardware guidance

OCR is the lightest-weight local AI workload. Specialized OCR runs on CPU; LLM OCR benefits from GPU but is CPU-viable for low volume. **CPU-only ($0).** PaddleOCR: 40–60 pages/min on modern desktop — sufficient for weekend digitization. Tesseract: 30–50 pages/min. LLM OCR ([Llama 3.2 Vision 11B](/models/llama-3-3-70b) at Q4): 0.5–1 page/min — occasional use only. **Entry GPU ($300–600).** Any 8 GB+ GPU: LLM OCR becomes viable. [RTX 3060 12GB](/hardware/rtx-3060-12gb): 5–10 pages/min — adequate for 200–500 pages/day. [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb): 8–15 pages/min, full model fits with 8 GB headroom. **SMB tier ($1,500–2,500).** [RTX 4090](/hardware/rtx-4090) at 24 GB: 15–25 pages/min with 10 GB headroom — scan-to-searchable under 3 seconds. Enables 90B vision model for maximum accuracy at 2–4 pages/min — use 11B for throughput, 90B for quality-critical. **Enterprise ($8,000+).** Enterprise GPUs overkill. [RTX 6000 Ada](/hardware/rtx-6000-ada) enables simultaneous OCR + document analysis and batch of 4–8 images. But 4× small GPUs (~$2,000) outperform 1× enterprise GPU at 5× the cost. Only invest if same hardware serves other LLM workloads too. **CPU-GPU partition.** Deploy PaddleOCR on CPU servers for high-confidence first pass — CPU cores are cheap and PaddleOCR is CPU-optimized. Deploy one GPU for LLM QA on low-confidence regions. Single [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) handles QA pass for 10–15 CPU workers — LLM only processes ~20% of regions where PaddleOCR confidence <95%. Maximizes throughput per dollar.

Runtime guidance

**PaddleOCR vs Tesseract vs multimodal LLM — document-type routing.** PaddleOCR is the specialized OCR leader. Three-stage pipeline: text detection (DB++), recognition (SVTR), layout analysis (PP-Structure for tables/multi-column). Integrated table mode detects cells, extracts text, outputs HTML/Excel — only open-weight engine handling tables end-to-end. 80+ languages. Speed: 40–120 pages/min. Use for: clean printed at scale, table-heavy documents, mixed-language in supported languages. Tesseract 5 is the legacy workhorse. Advantage: C library callable from any language, 100+ languages, zero Python dependency. Disadvantage: 5–15% lower accuracy than PaddleOCR, no table structure, handwriting accuracy low (55%). Use for: C library integration into existing apps, languages PaddleOCR doesn't support, or legacy Tesseract pipelines where migration cost exceeds accuracy gap. Multimodal LLMs are the new paradigm. [Llama 3.2 Vision](/models/llama-3-3-70b) via [LM Studio](/tools/lm-studio) for interactive/single-page; Qwen 2.5-VL via [vLLM](/tools/vllm) for batch. Advantage: visual structure understanding (column relationships, callout boxes, handwriting vs print), degraded document handling, semantic context. Disadvantage: 5–30× slower than specialized OCR, per-page cost 10–100× higher, extraction variance between identical pages. Use for: heterogeneous documents, complex structure, or when accuracy matters more than speed. **Decision tree.** Route by document type: clean printed invoices/receipts → PaddleOCR on CPU (40+ pages/min). Handwritten notes → LLM on GPU (5–15 pages/min). Mixed (printed form + handwriting) → PaddleOCR first pass + LLM on handwritten regions. Table-heavy → PaddleOCR PP-Structure. Multi-language 3+ languages → multimodal LLM. Degraded documents → multimodal LLM. For production: deploy PaddleOCR via Python API with FastAPI wrapper. Deploy LLM via [vLLM](/tools/vllm) vision support (OpenAI-compatible). Route documents to appropriate backend based on document type classification (lightweight first pass). Cache OCR results by document hash to avoid re-processing.

Setup walkthrough

  1. pip install surya-ocr (VikParuchuri's Surya — SOTA open-weight OCR for documents).
  2. First run auto-downloads the detection + recognition models (~1 GB total). No manual setup.
  3. CLI: surya_ocr image.jpg → outputs a JSON with bounding boxes + text for every detected line.
  4. For PDFs: surya_ocr document.pdf --output_dir out/ → processes page-by-page, outputs per-page JSON + Markdown.
  5. First result in 10-30 seconds on CPU for a single page; faster on GPU.
  6. Alternative for simple cases: pip install pytesseract (wraps Tesseract) — faster but worse on complex layouts.
  7. For multimodal LLM OCR: ollama run minicpm-v → upload an image → ask "Transcribe all text in this image."

The cheap setup

Surya OCR runs on CPU at 10-30 seconds per page on a modern laptop (Ryzen 5/Intel i5). No GPU required. Any $300-400 laptop handles batch OCR of documents overnight. For faster throughput: a used GTX 1060 6 GB ($60) drops per-page time to 2-5 seconds. For multimodal LLM OCR (Qwen2-VL, MiniCPM-V), a used GTX 1660 Super 6 GB (~$100) handles 7B VL models at 5-10 seconds per image — good enough for complex layouts like tables and forms.

The serious setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Surya OCR at 1-3 seconds per page on GPU. Can run Qwen2-VL 7B at 3-5 seconds per image for complex document understanding (combined OCR + layout + table extraction). For production document pipelines processing 1000s of pages/day, pair with Ryzen 7 7700X + 32 GB DDR5 + 2TB NVMe. Total: ~$900-1,100. OCR is VRAM-light — 6 GB is sufficient for most models.

Common beginner mistake

The mistake: Running Tesseract with default settings on a scanned document with complex layout (multi-column, tables, headers) and getting garbled output. Why it fails: Tesseract is a line-level OCR engine — it doesn't understand document layout. On multi-column PDFs, it reads across columns, mixing unrelated text. The fix: Use Surya OCR or a multimodal VL model (Qwen2-VL, MiniCPM-V) for complex documents. These models understand layout — they detect columns, tables, headers, and reading order before extracting text. For simple single-column documents, Tesseract is fine; for anything with structure, use a layout-aware model.

Recommended setup for ocr / document text extraction

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running ocr / document text extraction locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle ocr / document text extraction before committing money.

Hardware buying guidance for OCR / Document Text Extraction

OCR and document-understanding workloads use vision-language models — the buyer math is different from text-only LLM shopping.

Specialized buyer guides
Updated 2026 roundup