General screenshot understanding for productivity workflows — code screenshots, terminal output, error messages, document screenshots.
ollama pull minicpm-v (8 GB) or ollama pull llava:13b (8 GB).import ollama
with open("screenshot.png", "rb") as f:
img = f.read()
resp = ollama.chat(model="minicpm-v", messages=[{
"role": "user",
"content": "What's in this screenshot? If there's an error message, explain what it means and how to fix it.",
"images": [img]
}])
print(resp["message"]["content"])
Screenshot analysis is the "universal input" for local AI — any visual information becomes queryable.
Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs MiniCPM-V at 5-10 seconds per screenshot — practical for interactive use. Can process 200-300 screenshots/hour in batch. Pair with Ryzen 5 5600 + 16 GB DDR4 + 512 GB NVMe. Total: ~$360-405. For CPU-only: LLaVA 7B via llama.cpp runs at 30-60 seconds per screenshot on a modern laptop — slow but functional for occasional use. Screenshot analysis is a "quality of model" task — even small VLMs handle basic text extraction and scene description well.
Used RTX 3090 24 GB ($700-900, see /hardware/rtx-3090). Runs Qwen2-VL 72B at 10-20 seconds per screenshot — the highest-quality local screenshot analysis available. Can extract text from dense dashboards, understand multi-window layouts, and provide detailed technical analysis of error screenshots. For productivity workflows (analyze every screenshot taken during the day), Qwen2-VL 7B at 2-4 seconds per screenshot is the throughput play. Total: ~$1,800-2,200. RTX 4090 ($2,000, see /hardware/rtx-4090) drops analysis to 1-3 seconds.
The mistake: Taking a screenshot of a 4K monitor showing 6 windows of dense text, then asking "What does this say?" and getting a garbled summary. Why it fails: VLMs have a fixed resolution grid — a 4K screenshot downscaled to 980×980 loses the text in small windows entirely. The model hallucinates content because it sees pixel blobs, not readable text. The fix: Crop to the region of interest before analysis. Take screenshots of individual windows, not the entire desktop. For dense multi-window screenshots, use OCR (Surya/Tesseract) to extract text from each region first, then feed the extracted text to an LLM. Vision models complement OCR — they're good at understanding layout and context, bad at reading tiny text at low resolution.
Browse all tools for runtimes that fit this workload.
Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.
The errors most operators hit when running screenshot analysis locally. Each links to a diagnose+fix walkthrough.
Verify your specific hardware can handle screenshot analysis before committing money.
OCR and document-understanding workloads use vision-language models — the buyer math is different from text-only LLM shopping.