Document Understanding

Parsing complex document layouts — tables, multi-column text, footnotes, equations. Combines OCR + structure understanding + reasoning.

Setup walkthrough

Install Ollama → ollama pull qwen2.5-vl:7b (~5 GB — strong document parsing, 128K context).
pip install surya-ocr for layout detection + text extraction.
Two-stage pipeline for complex documents:

# Stage 1: Extract layout + text with Surya
from surya.detection import batch_text_detection
from surya.recognition import batch_recognition
# ... (Surya extracts reading-order text with bounding boxes)

# Stage 2: Feed structured output to VLM for understanding
import ollama
resp = ollama.chat(model="qwen2.5-vl:7b", messages=[{
    "role": "user",
    "content": f"This document contains: {extracted_text}\n\nAnswer: What is the total revenue in Q3? What are the key risks listed?",
    "images": [open("document.png", "rb").read()]
}])

First document-understanding output in 10-20 seconds per page. Surya handles layout (columns, tables, reading order); the VLM handles reasoning over the extracted structure.
For simpler documents (single-column, no tables): feed the document image directly to the VLM without OCR pre-processing.

The cheap setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Surya OCR at 1-3 seconds per page + Qwen2-VL 7B at 5-10 seconds per page for understanding. The combined pipeline handles 100-200 pages/hour. For simpler documents (invoices, forms), Surya alone extracts all structured data without an LLM. Pair with Ryzen 5 5600 + 32 GB DDR4 + 1TB NVMe. Total: ~$420-490. Document understanding at $400 works well for small-to-medium document sets (1,000-5,000 pages).

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs Surya OCR + Qwen2-VL 72B at 15-25 seconds per page — highest-quality document understanding. The 72B model correctly handles complex reasoning over table data, cross-page references, and technical diagrams. For enterprise document processing (50K+ pages/day): Surya + 7B VL model on 2× RTX 3060 in parallel. Total: ~$1,500-2,200. Document understanding is a pipeline problem — OCR + layout + reasoning. Budget for the OCR stage (CPU/GPU) AND the reasoning stage (GPU).

Common beginner mistake

The mistake: Feeding a raw PDF page image directly to a VLM and asking for structured output like "extract the table as JSON." Why it fails: VLMs read images at ~980×980 resolution and compress them to a fixed number of visual tokens. A dense PDF page with 50+ table cells, 100+ numbers, and multi-column layout exceeds the VLM's visual token budget. The model hallucinates values or misses cells entirely. The fix: Always use a layout-aware OCR stage first (Surya, Tesseract with layout analysis) to extract text with bounding boxes and reading order. Feed the VLM: (1) the OCR-extracted text (already structured), (2) the original image for visual context. The VLM reasons over the text, not reads every pixel. For production: OCR → structured extraction (regex/template) → VLM only for ambiguous cases.

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

Buying for spec-sheet VRAM without modeling KV cache + activation overhead
Underestimating quantization quality loss below Q4
Skipping flash-attention support (real perf gap on long context)
Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running document understanding locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle document understanding before committing money.

Hardware buying guidance for Document Understanding

RAG workflows mix embedding throughput, long-context inference, and reasonable VRAM headroom. The guides below cover the buyer decision honestly.

Buyer guides

Compare hardware

Troubleshooting

Specialized buyer guides

Updated 2026 roundup

Best free local AI tools (2026) →