Build a local vision-model stack (May 2026) — Llama 4 Scout + Qwen 2.5 VL + vLLM + Open WebUI

Why vision models are different

Vision-language models (VLMs) tokenize images as long sequences of vision tokens that get concatenated into the regular text token stream. The architectural reality this stack respects: vision tokens are expensive — a single 1024×1024 image can consume 256-1024 tokens of context depending on the model's vision encoder.

The downstream consequences:

Multi-image queries fill context fast. Five 1024×1024 images at 512 vision tokens each = 2560 tokens of just images, before the user prompt or model response.
Higher-resolution images cost more. Some vision encoders use tiled processing — a 2048×2048 image becomes 4 tiles of 1024×1024, multiplying token count.
VRAM budget shrinks. KV cache for vision tokens is the same per-token cost as text tokens, so VRAM sized for text-only chat may not fit multi-image vision workloads.
TTFT is longer. Image preprocessing + tokenization + encoder forward pass adds 1-3 seconds before the model starts generating its text response.

The headline architectural choice this stack makes: vLLM with native vision-language support, not the llama.cpp / Ollama vision path that pre-dates dedicated VLM serving infrastructure. vLLM's v0.7+ vision support handles tokenization server-side, batches efficiently, and handles the multi-modal tool-call format correctly. Ollama works for single-user solo workflows but loses concurrency efficiency.

Step-by-step setup

1. Bring up vLLM with a vision model

# Llama 4 Scout via vLLM with vision support enabled
docker run --gpus all --rm -d --name vllm-vision \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --restart unless-stopped \
  vllm/vllm-openai:v0.17.1 \
  --model meta-llama/Llama-4-Scout-AWQ \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --max-num-seqs 4

# --max-num-seqs 4 because vision queries hold long KV cache;
# higher concurrency leads to OOM on 24GB.

The vLLM vision-language path uses the same OpenAI-compatible chat completions API; image inputs are passed as image_url message parts (URLs or base64). Open WebUI's image upload uses base64 by default.

2. Optional — Ollama as a fallback for fast iteration

# Smaller, faster vision model via Ollama for quick iteration
ollama pull qwen2.5vl:7b

# Verify
ollama run qwen2.5vl:7b "Describe this image" \
  --image /path/to/test-image.jpg

Run alongside vLLM on a different port. Use Ollama's smaller VLM for fast iteration; switch to vLLM's Llama 4 Scout when frontier-quality matters. Open WebUI's model switcher handles both endpoints.

3. Configure Open WebUI for image upload

docker run -d --name open-webui \
  -p 3000:8080 \
  --restart unless-stopped \
  -v open-webui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  -e OPENAI_API_BASE_URLS="http://host.docker.internal:8000/v1" \
  -e OPENAI_API_KEYS="any-string" \
  -e ENABLE_OLLAMA_API=true \
  -e OLLAMA_BASE_URLS="http://host.docker.internal:11434" \
  -e ENABLE_IMAGE_GENERATION=false \
  ghcr.io/open-webui/open-webui:latest

Image upload works out of the box once a vision model is selected. Drag-and-drop images into the chat input; Open WebUI base64-encodes them and includes them in the OAI chat- completion request.

4. Test with a multi-image query

# Direct API test — drop 2 images into a single chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-4-Scout-AWQ",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What changed between these two screenshots?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/before.png"}},
        {"type": "image_url", "image_url": {"url": "https://example.com/after.png"}}
      ]
    }],
    "max_tokens": 512
  }'

Vision token economics

The honest math you need to plan VRAM and latency budgets:

Llama 4 Scout / Maverick use approximately 512 vision tokens per 1024×1024 image at default resolution.
Qwen 2.5 VL uses tiled tokenization — roughly 256-1024 tokens per image depending on aspect ratio and resolution.
Llama 3.2 Vision uses fewer vision tokens (256 per image at 560×560) but lower resolution.
VRAM cost of a vision query: model weights (~22GB for AWQ-INT4 32B-class) + KV cache for text + image tokens. A 5-image query is roughly 0.5-1.5GB of additional KV cache.
TTFT on a single 1024×1024 image: ~1.5-3 seconds on RTX 4090 (image preprocessing + encoder forward). Then text generation at the usual 30-40 tok/s.
End-to-end on a typical query (one image + short prompt + 200-token response): ~5-10 seconds.

Failure modes you'll hit

OOM on multi-image queries. A 5-image query blows past KV-cache budgets sized for text-only workloads. Lower --max-num-seqs to 4 or 2; constrain users to fewer images per query.
Image-format mismatch. Some vision models require specific image formats (RGB, specific channel orders). Open WebUI handles common cases but obscure formats (TIFF, RAW) may fail. Convert to PNG/JPEG client-side.
Resolution silently downsampled. Vision encoders typically downsample images to a fixed resolution (560-1024 px). High-detail tasks (OCR on small text) may need explicit higher-resolution model variants.
Image upload size limits. Open WebUI's default upload limit is 10MB; vLLM accepts up to 32MB base64-encoded. Configure both consistently or large screenshots fail upload.
Vision model tokenizer mismatch. Some vision models share the base text model's tokenizer; others have a separate visual tokenizer. Mismatches between the configured tokenizer and the model's actual expectations produce garbled output.
Streaming-response truncation on long visual descriptions. Vision queries often produce verbose responses (describing what's in an image). Defaultmax_tokens of 512 truncates frequently; raise to 1024-2048 for visual-description workloads.
Mixed-modality tool-calling failures. Some clients can't handle the mixed text + image content format. Verify the OAI tool-call format works with your specific MCP server / agent harness before committing.

Variations and alternatives

Apple Silicon variation. M3-M4 Max with unified memory handles vision models gracefully — large images don't fight a separate VRAM pool. Swap vLLM for MLX-LM when MLX-VL support covers your model; otherwise use Ollama's vision path.

Smaller-VLM variation. Qwen 2.5 VL 7B fits comfortably on 16GB VRAM. Drop the Llama 4 Scout pick; upgrade Open WebUI to talk to a 7B-class VLM. Lower-quality image understanding but viable on the budget tier.

Multi-modal RAG variation. Combine vision with document RAG via AnythingLLM. Upload PDFs that contain images; AnythingLLM extracts both text and images, retrieval finds the right passage, the VLM analyzes the image inline. Heaviest of these variations; the right pick for “chat with my image-rich documents.”

Specialized OCR variation. For text-heavy screenshots / documents, dedicated OCR models (Florence-2, DocLayout-YOLO) preprocess the image into structured text before the VLM sees it. Lower latency and higher accuracy than asking a general VLM to do OCR; harder to set up.

Who should avoid this stack

Anyone whose primary need is OCR. General VLMs are mediocre at OCR. Use specialized OCR models (Tesseract, Florence-2) for document text extraction; reserve VLMs for image-and-text reasoning.
Anyone needing real-time visual processing. Vision tokenization + encoder forward = 1-3 seconds before text generation starts. Real-time use cases need specialized vision models, not VLMs.
Anyone with strict 16GB VRAM ceiling. 32B-class VLMs don't fit. Drop to 7B-class VLMs (Qwen 2.5 VL 7B, Llama 3.2 Vision 11B) — useful but lower quality.
Anyone running concurrent multi-user vision queries. The KV-cache cost per query is high; few concurrent users on a 24GB card. Use a 5090 (32GB) or multiple 4090s for team-shared vision deployments.
Anyone whose image data is sensitive enough that even local model inference is concerning. VLMs don't exfiltrate data, but they consume VRAM unpredictably. For maximum-privacy workloads, dedicated air-gapped vision models with measured token budgets are safer.

Going deeper

Llama 4 Scout catalog entry — multimodal architecture, capabilities, benchmarks.
vLLM operational review — the runtime-specific operator detail; vision-language support landed in v0.7+.
Open WebUI operational review — the L1.5 review covering image-upload UX and provider abstraction.
Inference runtime ecosystem map — full landscape of vision-capable runtimes.
RTX 4090 workstation stack — the text-only equivalent, with the same hardware.