OCR / Document Text Extraction

Extracting text from images, PDFs, screenshots, and handwritten documents. Modern multimodal LLMs (Qwen2.5-VL, InternVL, GPT-4V) increasingly outperform specialized OCR engines on complex layouts.

Capability notes

OCR in 2026 splits into specialized engines and multimodal LLMs, each dominating different document types. **Specialized engines.** PaddleOCR (Baidu, Apache 2.0) achieves 97–99% character accuracy on clean printed English/Chinese documents — the gold standard for structured digitization. Handles rotated text up to 45°, curved text on packaging, and multi-column layouts with 90%+ structure preservation. Weakness: handwriting drops to 75–85%, low-contrast text (<30% contrast ratio) degrades rapidly. Tesseract 5 (Google, Apache 2.0): 85–95% on clean printed English with LSTM recognition — adequate for basic scanning but requires pre-processing (binarization, deskewing) for sub-perfect quality. Covers 100+ languages; non-Latin scripts score 10–15% lower than Latin. **Multimodal LLMs.** Qwen 2.5-VL and [Llama 3.2 Vision](/models/llama-3-3-70b) extract text from complex documents — receipts, forms, handwritten notes, screenshots. Printed accuracy: 95–98% on clean English, 90–95% on medium-quality scans. LLMs excel at understanding structure — they identify headers vs body vs footnotes, extract table cells with row/column relationships, and handle handwriting-on-printed-forms (signatures over printed lines) that confuse specialized OCR. Weaknesses: speed (3–10 seconds/page vs PaddleOCR's 0.3–1 second), cost ([GPU inference](/tools/vllm) vs CPU), and consistency — the same page processed twice yields slightly different extraction on borderline legible text. **Accuracy by document type.** Printed English invoices: PaddleOCR 99%, Tesseract 93%, LLM 98%. Handwritten notes: PaddleOCR 75%, Tesseract 55%, LLM 85%. Faded receipts: PaddleOCR 82%, Tesseract 65%, LLM 88%. The pattern: multimodal LLMs win on degraded/heterogeneous documents; specialized OCR wins on clean, high-volume scanning.

If you just want to try this

Lowest-friction path to a working setup.

Install [LM Studio](/tools/lm-studio), search "llama-3.2-vision" and download the 11B instruct at Q4_K_M (~8 GB). Start the local server on port 1234. Use any OpenAI-compatible client: ```python import base64, requests with open("document.jpg", "rb") as f: img = base64.b64encode(f.read()).decode() resp = requests.post("http://localhost:1234/v1/chat/completions", json={ "model": "llama-3.2-vision-11b", "messages": [{"role": "user", "content": [ {"type": "text", "text": "Extract all visible text from this document. Preserve structure."}, {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img}"}} ]}] }) print(resp.json()["choices"][0]["message"]["content"]) ``` On [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb): 3–8 seconds per page. Accuracy: 90–98% clean print, 80–90% handwriting. For high-volume batch OCR of clean printed documents, install PaddleOCR: ```bash pip install paddlepaddle paddleocr ``` ```python from paddleocr import PaddleOCR ocr = PaddleOCR(lang='en') result = ocr.ocr('document.jpg') for line in result[0]: print(line[1][0]) ``` PaddleOCR processes a page in 0.3–1 second on CPU — no GPU required. 97–99% accuracy on clean print. Right path for digitizing filing cabinets. Simplest no-code path: [Pinokio](https://pinokio.ai) → search "OCR" → install "Docling" or "Marker" one-click installer — wraps PaddleOCR + LLaMA Vision into a web UI: drag-and-drop PDF/image, receive text.

For production deployment

Operator-grade recommendation.

Production OCR combines a specialized engine for fast high-confidence extraction with a multimodal LLM for complex/degraded documents. **Two-stage pipeline.** Stage 1: PaddleOCR processes every document, extracts text regions with bounding boxes and confidence. Regions >= 95% confidence accepted directly. Regions < 95% (handwriting, low contrast, complex layout) cropped → Stage 2. Stage 2: multimodal LLM receives cropped region + specialized OCR's tentative output as context, produces corrected extraction. Pipeline: 80% of pages via fast Stage 1 (0.3–1 sec/page), 20% via slower Stage 2 (3–10 sec/page) — average ~2 seconds/page on mixed batch. **Throughput.** PaddleOCR on CPU (Ryzen 9): 40–60 pages/min. On GPU ([RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb)): 80–120 pages/min. LLM OCR ([Llama 3.2 Vision 11B](/models/llama-3-3-70b) on [RTX 4090](/hardware/rtx-4090)): 10–20 pages/min. Two-stage hybrid: ~35 pages/min on CPU+GPU server. A 100,000-page archive in ~48 hours on one server. **Specialized vs general.** Use specialized OCR when: documents are cleanly printed, volume >1,000 pages/day, CPU-only budget, extraction consistency across identical pages matters. Use multimodal LLM when: handwriting present, documents degraded (faxes, old records), complex structure (multi-column with callout boxes), error tolerance allows slight extraction variance. **Accuracy calibration.** Financial compliance: 99% accuracy required at line-item level — errors on dollar amounts unacceptable. Use PaddleOCR Stage 1 + LLM verification on every document with financial figures (regex-detected). Doubles cost per page but eliminates 95% of dollar-amount errors. Archival search: 90% accuracy acceptable — PaddleOCR alone suffices. Legal production: citations and case names must be exact — LLM-only, slower speed, human review on low-confidence regions. **Table extraction.** Hardest sub-task. PaddleOCR's table mode detects cell boundaries, but complex tables (merged cells, nested headers) have incorrect cell association on ~15%. LLMs handle merged cells/nested headers better but produce inconsistent column alignment on ~10% of wide tables (>6 columns). Hybrid: detect table regions → PaddleOCR extracts cells → LLM validates structure → output structured JSON with corrected grid.

What breaks

Failure modes operators see in the wild.

**Table structure loss.** Symptom: extracted table text is correct but cell relationships are wrong — values in wrong columns, merged cells split. Cause: OCR detects text regions as independent bounding boxes with no grid awareness. Mitigation: PaddleOCR PP-Structure pipeline explicitly models table grid detection. For LLMs, prompt "Extract this table preserving exact row and column structure. Output as CSV." CSV forces grid maintenance. Post-process to validate column count consistency. **Handwriting on printed forms.** Symptom: printed fields extract correctly but handwritten fill-ins produce garbage or are missed. Cause: text detection trained on printed fonts — handwriting differs in stroke thickness, spacing, baseline alignment. Mitigation: two passes — specialized OCR for printed text, then LLM for full page with "Focus on handwritten text and form fill-ins." LLM distinguishes handwriting from print. For checkboxes, prompt explicitly for checked/unchecked state. **Rotated text and multi-column confusion.** Symptom: columns merged into incoherent stream; rotated text extracted as character soup. Cause: OCR detects text left-to-right — multi-column requires column boundary detection before line detection. Mitigation: pre-process with layout analysis (PaddleOCR PP-Structure, DocLayout-YOLO) to identify column boundaries and reading order. Extract per-column. PaddleOCR handles up to 45° rotation but vertical text requires rotation-unaware detectors. **Low-contrast text.** Symptom: light-gray text, watermarks, faded thermal paper go undetected below ~30% contrast ratio. Cause: detection models threshold on pixel intensity gradients — low contrast produces weak gradients. Mitigation: pre-process with CLAHE (adaptive histogram equalization) to boost local contrast. For consistent patterns, train contrast-enhancement pipeline. For LLMs, include "Extract all text including low-contrast and faint text" in prompt. **Multi-language mixed documents.** Symptom: document with English + Arabic + Chinese extracts English correctly but Arabic is garbled, Chinese romanized incorrectly. Cause: specialized OCR uses single language model — unsupported languages passed through wrong character classifier. Mitigation: use multilingual models (PaddleOCR multilingual). For 3+ languages, use multimodal LLM — handles mixed-language natively. For specialized OCR, implement per-region language detection and model selection.

Hardware guidance

OCR is the lightest-weight local AI workload. Specialized OCR runs on CPU; LLM OCR benefits from GPU but is CPU-viable for low volume. **CPU-only ($0).** PaddleOCR: 40–60 pages/min on modern desktop — sufficient for weekend digitization. Tesseract: 30–50 pages/min. LLM OCR ([Llama 3.2 Vision 11B](/models/llama-3-3-70b) at Q4): 0.5–1 page/min — occasional use only. **Entry GPU ($300–600).** Any 8 GB+ GPU: LLM OCR becomes viable. [RTX 3060 12GB](/hardware/rtx-3060-12gb): 5–10 pages/min — adequate for 200–500 pages/day. [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb): 8–15 pages/min, full model fits with 8 GB headroom. **SMB tier ($1,500–2,500).** [RTX 4090](/hardware/rtx-4090) at 24 GB: 15–25 pages/min with 10 GB headroom — scan-to-searchable under 3 seconds. Enables 90B vision model for maximum accuracy at 2–4 pages/min — use 11B for throughput, 90B for quality-critical. **Enterprise ($8,000+).** Enterprise GPUs overkill. [RTX 6000 Ada](/hardware/rtx-6000-ada) enables simultaneous OCR + document analysis and batch of 4–8 images. But 4× small GPUs (~$2,000) outperform 1× enterprise GPU at 5× the cost. Only invest if same hardware serves other LLM workloads too. **CPU-GPU partition.** Deploy PaddleOCR on CPU servers for high-confidence first pass — CPU cores are cheap and PaddleOCR is CPU-optimized. Deploy one GPU for LLM QA on low-confidence regions. Single [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) handles QA pass for 10–15 CPU workers — LLM only processes ~20% of regions where PaddleOCR confidence <95%. Maximizes throughput per dollar.

Runtime guidance

**PaddleOCR vs Tesseract vs multimodal LLM — document-type routing.** PaddleOCR is the specialized OCR leader. Three-stage pipeline: text detection (DB++), recognition (SVTR), layout analysis (PP-Structure for tables/multi-column). Integrated table mode detects cells, extracts text, outputs HTML/Excel — only open-weight engine handling tables end-to-end. 80+ languages. Speed: 40–120 pages/min. Use for: clean printed at scale, table-heavy documents, mixed-language in supported languages. Tesseract 5 is the legacy workhorse. Advantage: C library callable from any language, 100+ languages, zero Python dependency. Disadvantage: 5–15% lower accuracy than PaddleOCR, no table structure, handwriting accuracy low (55%). Use for: C library integration into existing apps, languages PaddleOCR doesn't support, or legacy Tesseract pipelines where migration cost exceeds accuracy gap. Multimodal LLMs are the new paradigm. [Llama 3.2 Vision](/models/llama-3-3-70b) via [LM Studio](/tools/lm-studio) for interactive/single-page; Qwen 2.5-VL via [vLLM](/tools/vllm) for batch. Advantage: visual structure understanding (column relationships, callout boxes, handwriting vs print), degraded document handling, semantic context. Disadvantage: 5–30× slower than specialized OCR, per-page cost 10–100× higher, extraction variance between identical pages. Use for: heterogeneous documents, complex structure, or when accuracy matters more than speed. **Decision tree.** Route by document type: clean printed invoices/receipts → PaddleOCR on CPU (40+ pages/min). Handwritten notes → LLM on GPU (5–15 pages/min). Mixed (printed form + handwriting) → PaddleOCR first pass + LLM on handwritten regions. Table-heavy → PaddleOCR PP-Structure. Multi-language 3+ languages → multimodal LLM. Degraded documents → multimodal LLM. For production: deploy PaddleOCR via Python API with FastAPI wrapper. Deploy LLM via [vLLM](/tools/vllm) vision support (OpenAI-compatible). Route documents to appropriate backend based on document type classification (lightweight first pass). Cache OCR results by document hash to avoid re-processing.

Setup walkthrough

pip install surya-ocr (VikParuchuri's Surya — SOTA open-weight OCR for documents).
First run auto-downloads the detection + recognition models (~1 GB total). No manual setup.
CLI: surya_ocr image.jpg → outputs a JSON with bounding boxes + text for every detected line.
For PDFs: surya_ocr document.pdf --output_dir out/ → processes page-by-page, outputs per-page JSON + Markdown.
First result in 10-30 seconds on CPU for a single page; faster on GPU.
Alternative for simple cases: pip install pytesseract (wraps Tesseract) — faster but worse on complex layouts.
For multimodal LLM OCR: ollama run minicpm-v → upload an image → ask "Transcribe all text in this image."

The cheap setup

Surya OCR runs on CPU at 10-30 seconds per page on a modern laptop (Ryzen 5/Intel i5). No GPU required. Any $300-400 laptop handles batch OCR of documents overnight. For faster throughput: a used GTX 1060 6 GB ($60) drops per-page time to 2-5 seconds. For multimodal LLM OCR (Qwen2-VL, MiniCPM-V), a used GTX 1660 Super 6 GB (~$100) handles 7B VL models at 5-10 seconds per image — good enough for complex layouts like tables and forms.

The serious setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Surya OCR at 1-3 seconds per page on GPU. Can run Qwen2-VL 7B at 3-5 seconds per image for complex document understanding (combined OCR + layout + table extraction). For production document pipelines processing 1000s of pages/day, pair with Ryzen 7 7700X + 32 GB DDR5 + 2TB NVMe. Total: ~$900-1,100. OCR is VRAM-light — 6 GB is sufficient for most models.

Common beginner mistake

The mistake: Running Tesseract with default settings on a scanned document with complex layout (multi-column, tables, headers) and getting garbled output. Why it fails: Tesseract is a line-level OCR engine — it doesn't understand document layout. On multi-column PDFs, it reads across columns, mixing unrelated text. The fix: Use Surya OCR or a multimodal VL model (Qwen2-VL, MiniCPM-V) for complex documents. These models understand layout — they detect columns, tables, headers, and reading order before extracting text. For simple single-column documents, Tesseract is fine; for anything with structure, use a layout-aware model.

Recommended setup for ocr / document text extraction

Recommended hardware

Best GPU for local AI →

All workloads ranked across VRAM tiers.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

Buying for spec-sheet VRAM without modeling KV cache + activation overhead
Underestimating quantization quality loss below Q4
Skipping flash-attention support (real perf gap on long context)
Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running ocr / document text extraction locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle ocr / document text extraction before committing money.

Hardware buying guidance for OCR / Document Text Extraction

OCR and document-understanding workloads use vision-language models — the buyer math is different from text-only LLM shopping.

Buyer guides

Compare hardware

Troubleshooting

Specialized buyer guides

Updated 2026 roundup

Best free local AI tools (2026) →