Parsing complex document layouts — tables, multi-column text, footnotes, equations. Combines OCR + structure understanding + reasoning.
ollama pull qwen2.5-vl:7b (~5 GB — strong document parsing, 128K context).pip install surya-ocr for layout detection + text extraction.# Stage 1: Extract layout + text with Surya
from surya.detection import batch_text_detection
from surya.recognition import batch_recognition
# ... (Surya extracts reading-order text with bounding boxes)
# Stage 2: Feed structured output to VLM for understanding
import ollama
resp = ollama.chat(model="qwen2.5-vl:7b", messages=[{
"role": "user",
"content": f"This document contains: {extracted_text}\n\nAnswer: What is the total revenue in Q3? What are the key risks listed?",
"images": [open("document.png", "rb").read()]
}])
Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Surya OCR at 1-3 seconds per page + Qwen2-VL 7B at 5-10 seconds per page for understanding. The combined pipeline handles 100-200 pages/hour. For simpler documents (invoices, forms), Surya alone extracts all structured data without an LLM. Pair with Ryzen 5 5600 + 32 GB DDR4 + 1TB NVMe. Total: ~$420-490. Document understanding at $400 works well for small-to-medium document sets (1,000-5,000 pages).
Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs Surya OCR + Qwen2-VL 72B at 15-25 seconds per page — highest-quality document understanding. The 72B model correctly handles complex reasoning over table data, cross-page references, and technical diagrams. For enterprise document processing (50K+ pages/day): Surya + 7B VL model on 2× RTX 3060 in parallel. Total: ~$1,500-2,200. Document understanding is a pipeline problem — OCR + layout + reasoning. Budget for the OCR stage (CPU/GPU) AND the reasoning stage (GPU).
The mistake: Feeding a raw PDF page image directly to a VLM and asking for structured output like "extract the table as JSON." Why it fails: VLMs read images at ~980×980 resolution and compress them to a fixed number of visual tokens. A dense PDF page with 50+ table cells, 100+ numbers, and multi-column layout exceeds the VLM's visual token budget. The model hallucinates values or misses cells entirely. The fix: Always use a layout-aware OCR stage first (Surya, Tesseract with layout analysis) to extract text with bounding boxes and reading order. Feed the VLM: (1) the OCR-extracted text (already structured), (2) the original image for visual context. The VLM reasons over the text, not reads every pixel. For production: OCR → structured extraction (regex/template) → VLM only for ambiguous cases.
Browse all tools for runtimes that fit this workload.
Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.
The errors most operators hit when running document understanding locally. Each links to a diagnose+fix walkthrough.
Verify your specific hardware can handle document understanding before committing money.
RAG workflows mix embedding throughput, long-context inference, and reasonable VRAM headroom. The guides below cover the buyer decision honestly.