Build a local vision-model stack (May 2026)
Run a vision-language model locally for image understanding, document Q&A over screenshots, OCR-plus-reasoning, and visual analysis tasks. All processing on your hardware; images never leave the network.
- 01ModelPrimary multimodal model (text + vision)llama-4-scout
Llama 4 Scout is the multimodal flagship in the open Llama 4 family. Strong image understanding combined with the same reasoning quality the text-only Llama 4 line delivers. The pick when you need image-grounded analysis at frontier-tier quality.
- 02ModelHigher-capability reasoning + vision (when 24GB lets it fit)llama-4-maverick
Llama 4 Maverick is the larger variant — better reasoning quality but heavier. AWQ-INT4 makes it borderline-feasible on 24GB; the 5090 32GB is where it comfortably fits with image-token headroom.
- 03ToolInference engine (vision-aware)vllm
vLLM has first-class vision-language model support as of v0.7+. Image preprocessing happens server-side; the OAI endpoint accepts image URLs and base64 images. Continuous batching matters for vision because image tokenization is more expensive than text.
- 04ToolSingle-user alternative runtimeollama
Ollama supports vision models (llava family, llama 3.2 vision, qwen 2.5 vl) at the solo-developer tier. Drop-in replacement for vLLM in this stack when concurrency doesn't matter; loses ~30% throughput vs vLLM but wins on setup time.
- 05ToolFrontend with image uploadopenwebui
Open WebUI's image upload integration with vision models is the cleanest in the local-AI category. Drag-and-drop images into chat; the model sees them. RAG can also accept images for visual document search.
- 06HardwareGPU (minimum tier — vision tokens are heavy)rtx-4090
Vision-language models tokenize images as long sequences (a 1024x1024 image becomes ~256-1024 vision tokens depending on the model's tokenizer). VRAM budget shrinks fast on multi-image queries. RTX 4090 24GB is the floor; 5090 32GB or M-class Apple is more comfortable.
Why vision models are different
Vision-language models (VLMs) tokenize images as long sequences of vision tokens that get concatenated into the regular text token stream. The architectural reality this stack respects: vision tokens are expensive — a single 1024×1024 image can consume 256-1024 tokens of context depending on the model's vision encoder.
The downstream consequences:
- Multi-image queries fill context fast. Five 1024×1024 images at 512 vision tokens each = 2560 tokens of just images, before the user prompt or model response.
- Higher-resolution images cost more. Some vision encoders use tiled processing — a 2048×2048 image becomes 4 tiles of 1024×1024, multiplying token count.
- VRAM budget shrinks. KV cache for vision tokens is the same per-token cost as text tokens, so VRAM sized for text-only chat may not fit multi-image vision workloads.
- TTFT is longer. Image preprocessing + tokenization + encoder forward pass adds 1-3 seconds before the model starts generating its text response.
The headline architectural choice this stack makes: vLLM with native vision-language support, not the llama.cpp / Ollama vision path that pre-dates dedicated VLM serving infrastructure. vLLM's v0.7+ vision support handles tokenization server-side, batches efficiently, and handles the multi-modal tool-call format correctly. Ollama works for single-user solo workflows but loses concurrency efficiency.
Step-by-step setup
1. Bring up vLLM with a vision model
# Llama 4 Scout via vLLM with vision support enabled
docker run --gpus all --rm -d --name vllm-vision \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--restart unless-stopped \
vllm/vllm-openai:v0.17.1 \
--model meta-llama/Llama-4-Scout-AWQ \
--gpu-memory-utilization 0.85 \
--max-model-len 32768 \
--enable-chunked-prefill \
--max-num-seqs 4
# --max-num-seqs 4 because vision queries hold long KV cache;
# higher concurrency leads to OOM on 24GB.The vLLM vision-language path uses the same OpenAI-compatible chat completions API; image inputs are passed as image_url message parts (URLs or base64). Open WebUI's image upload uses base64 by default.
2. Optional — Ollama as a fallback for fast iteration
# Smaller, faster vision model via Ollama for quick iteration
ollama pull qwen2.5vl:7b
# Verify
ollama run qwen2.5vl:7b "Describe this image" \
--image /path/to/test-image.jpgRun alongside vLLM on a different port. Use Ollama's smaller VLM for fast iteration; switch to vLLM's Llama 4 Scout when frontier-quality matters. Open WebUI's model switcher handles both endpoints.
3. Configure Open WebUI for image upload
docker run -d --name open-webui \
-p 3000:8080 \
--restart unless-stopped \
-v open-webui:/app/backend/data \
--add-host=host.docker.internal:host-gateway \
-e OPENAI_API_BASE_URLS="http://host.docker.internal:8000/v1" \
-e OPENAI_API_KEYS="any-string" \
-e ENABLE_OLLAMA_API=true \
-e OLLAMA_BASE_URLS="http://host.docker.internal:11434" \
-e ENABLE_IMAGE_GENERATION=false \
ghcr.io/open-webui/open-webui:latestImage upload works out of the box once a vision model is selected. Drag-and-drop images into the chat input; Open WebUI base64-encodes them and includes them in the OAI chat- completion request.
4. Test with a multi-image query
# Direct API test — drop 2 images into a single chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Scout-AWQ",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What changed between these two screenshots?"},
{"type": "image_url", "image_url": {"url": "https://example.com/before.png"}},
{"type": "image_url", "image_url": {"url": "https://example.com/after.png"}}
]
}],
"max_tokens": 512
}'Vision token economics
The honest math you need to plan VRAM and latency budgets:
- Llama 4 Scout / Maverick use approximately 512 vision tokens per 1024×1024 image at default resolution.
- Qwen 2.5 VL uses tiled tokenization — roughly 256-1024 tokens per image depending on aspect ratio and resolution.
- Llama 3.2 Vision uses fewer vision tokens (256 per image at 560×560) but lower resolution.
- VRAM cost of a vision query: model weights (~22GB for AWQ-INT4 32B-class) + KV cache for text + image tokens. A 5-image query is roughly 0.5-1.5GB of additional KV cache.
- TTFT on a single 1024×1024 image: ~1.5-3 seconds on RTX 4090 (image preprocessing + encoder forward). Then text generation at the usual 30-40 tok/s.
- End-to-end on a typical query (one image + short prompt + 200-token response): ~5-10 seconds.
Failure modes you'll hit
- OOM on multi-image queries. A 5-image query blows past KV-cache budgets sized for text-only workloads. Lower
--max-num-seqsto 4 or 2; constrain users to fewer images per query. - Image-format mismatch. Some vision models require specific image formats (RGB, specific channel orders). Open WebUI handles common cases but obscure formats (TIFF, RAW) may fail. Convert to PNG/JPEG client-side.
- Resolution silently downsampled. Vision encoders typically downsample images to a fixed resolution (560-1024 px). High-detail tasks (OCR on small text) may need explicit higher-resolution model variants.
- Image upload size limits. Open WebUI's default upload limit is 10MB; vLLM accepts up to 32MB base64-encoded. Configure both consistently or large screenshots fail upload.
- Vision model tokenizer mismatch. Some vision models share the base text model's tokenizer; others have a separate visual tokenizer. Mismatches between the configured tokenizer and the model's actual expectations produce garbled output.
- Streaming-response truncation on long visual descriptions. Vision queries often produce verbose responses (describing what's in an image). Default
max_tokensof 512 truncates frequently; raise to 1024-2048 for visual-description workloads. - Mixed-modality tool-calling failures. Some clients can't handle the mixed text + image content format. Verify the OAI tool-call format works with your specific MCP server / agent harness before committing.
Variations and alternatives
Apple Silicon variation. M3-M4 Max with unified memory handles vision models gracefully — large images don't fight a separate VRAM pool. Swap vLLM for MLX-LM when MLX-VL support covers your model; otherwise use Ollama's vision path.
Smaller-VLM variation. Qwen 2.5 VL 7B fits comfortably on 16GB VRAM. Drop the Llama 4 Scout pick; upgrade Open WebUI to talk to a 7B-class VLM. Lower-quality image understanding but viable on the budget tier.
Multi-modal RAG variation. Combine vision with document RAG via AnythingLLM. Upload PDFs that contain images; AnythingLLM extracts both text and images, retrieval finds the right passage, the VLM analyzes the image inline. Heaviest of these variations; the right pick for “chat with my image-rich documents.”
Specialized OCR variation. For text-heavy screenshots / documents, dedicated OCR models (Florence-2, DocLayout-YOLO) preprocess the image into structured text before the VLM sees it. Lower latency and higher accuracy than asking a general VLM to do OCR; harder to set up.
Who should avoid this stack
- Anyone whose primary need is OCR. General VLMs are mediocre at OCR. Use specialized OCR models (Tesseract, Florence-2) for document text extraction; reserve VLMs for image-and-text reasoning.
- Anyone needing real-time visual processing. Vision tokenization + encoder forward = 1-3 seconds before text generation starts. Real-time use cases need specialized vision models, not VLMs.
- Anyone with strict 16GB VRAM ceiling. 32B-class VLMs don't fit. Drop to 7B-class VLMs (Qwen 2.5 VL 7B, Llama 3.2 Vision 11B) — useful but lower quality.
- Anyone running concurrent multi-user vision queries. The KV-cache cost per query is high; few concurrent users on a 24GB card. Use a 5090 (32GB) or multiple 4090s for team-shared vision deployments.
- Anyone whose image data is sensitive enough that even local model inference is concerning. VLMs don't exfiltrate data, but they consume VRAM unpredictably. For maximum-privacy workloads, dedicated air-gapped vision models with measured token budgets are safer.
Going deeper
- Llama 4 Scout catalog entry — multimodal architecture, capabilities, benchmarks.
- vLLM operational review — the runtime-specific operator detail; vision-language support landed in v0.7+.
- Open WebUI operational review — the L1.5 review covering image-upload UX and provider abstraction.
- Inference runtime ecosystem map — full landscape of vision-capable runtimes.
- RTX 4090 workstation stack — the text-only equivalent, with the same hardware.