RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
BLK · IMAGE MODELSdiffusion · vision encoders · OCR

Local image models

Diffusion models that generate images, and vision encoders that understand them — both in one hub. FLUX.1 dev/schnell, SDXL-Turbo, Stable Diffusion 3.5 medium for generation; SigLIP, ColPali, Florence-2, GOT-OCR 2.0 for understanding.

Models curated
44
Vendors
18
Commercial OK
40/44
Benchmarked
0/44

Local image AI runs in two shapes: diffusion models that produce pixels from text, and vision encoders that read meaning out of pixels. Most catalog work focuses on the LLM side; this hub fixes the missing surface for both image-gen and vision encoders.

On the generation side: FLUX.1 [dev] (12B, non-commercial — the most-liked model on HuggingFace), FLUX.1 [schnell] (12B Apache-2.0 4-step distilled — the production pick), SDXL-Turbo (2.6B 1-step real-time, non-commercial), Stable Diffusion 3.5 medium (2.5B Community license — commercial OK up to $1M revenue).

On the encoder side: SigLIP-SO400M (428M, the vision tower behind PaliGemma/Idefics/most open VLMs), ColPali v1.3 (3B, visual-document RAG SOTA), Florence-2-large (770M unified caption/OCR/grounding/segmentation), GOT-OCR 2.0 (580M end-to-end formula and table OCR).

License posture matters a lot here. FLUX.1 [dev] is research-only; FLUX.1 [schnell] is fully commercial. SDXL-Turbo blocks commercial use; SD 3.5 medium has the $1M-revenue cap. Each row calls out the exact license trap.

FAM · OTHER

Other / from-scratch

14 models
FLUX.1 [dev]
12B params · Black Forest Labs
▸ High-fidelity text-to-image for research, internal prototyping, and personal projects on a 24GB workstation

12B-parameter rectified-flow transformer for text-to-image, guidance-distilled from the FLUX.1 [pro] teacher. Currently the most-liked model on Hugging Face (~12.9k likes). Sets a new open-weights bar for prompt adherenc

License
flux-1-dev-non-c
Context
—
FLUX.1 [schnell]
12B params · Black Forest Labs
▸ Apache-2.0 4-step text-to-image for commercial products on a workstation GPU

12B rectified-flow transformer, timestep-distilled to 1-4 sampling steps, released under Apache-2.0. Same architecture as FLUX.1 [dev] but trades a bit of fidelity for ~10x faster sampling and an unrestricted commercial

License
apache-2.0 · OK
Context
—
SigLIP SO400M (patch14-384)
428M params · Google
▸ Zero-shot image classification, image-text retrieval, or as a frozen vision tower for a custom VLM on edge/consumer hardware

428M-parameter Shape-Optimized vision-language encoder trained with the sigmoid (not softmax) contrastive loss on WebLI. Hits ~83% zero-shot ImageNet-1k top-1 at 384px — the strongest open contrastive encoder in its size

License
apache-2.0 · OK
Context
—
SDXL Turbo
2.6B params · Stability AI
▸ Real-time interactive text-to-image (~50-100ms/frame) on a consumer GPU for research and demos

2.6B SDXL backbone trained with Adversarial Diffusion Distillation (ADD), producing photorealistic 512px images in a single forward pass. Designed for real-time, interactive text-to-image.

License
stabilityai-non-
Context
—
Florence-2 Large
770M params · Microsoft
▸ Edge-tier unified caption / OCR / detection / grounding pipeline where you want one model instead of four

770M-parameter unified vision foundation model with a DaViT image encoder and BART-style seq2seq decoder. One model, one set of weights — handles captioning, OCR, region/grounding, segmentation, and dense detection via t

License
mit · OK
Context
—
Stable Diffusion 3.5 Medium
2.5B params · Stability AI
▸ Permissively-licensed text-to-image for small-business and indie commercial products on a 12-16GB consumer GPU

2.5B MMDiT-X with improved Querying Key Normalization and dual attention blocks at lower resolutions. Trained for 0.25-2MP output. Positioned as the mid-tier of the SD3.5 family, designed to run on consumer hardware whil

License
stabilityai-comm · OK
Context
—
SmolVLM Instruct
2.25B params · Hugging Face
▸ Lowest-VRAM open VLM for image captioning on consumer GPU

SmolVLM-Instruct is Hugging Face's compact vision-language model built on the Idefics3 architecture, pairing SmolLM2-1.7B-Instruct with a SigLIP-SO400M vision encoder. It is engineered for minimum VRAM footprint and ship

License
apache-2.0 · OK
Context
8K
InternVL 2.5 78B
78B params · OpenGVLab
▸ datacenter-tier permissive VLM

InternVL 2.5 flagship. Approaches frontier proprietary VLMs on document and OCR tasks.

License
MIT · OK
Context
32K
Molmo 7B-D
8B params · Allen Institute (AI2)
▸ open-research VLM with UI grounding

AI2's fully-open VLM. Trained on PixMo dataset; pointing capability for UI grounding.

License
Apache 2.0 · OK
Context
4K
Moondream 2
1.9B params · vikhyat (community)
▸ edge / phone-tier vision Q&A

Tiny vision-language model. ~1.9B; designed for edge / embedded multimodal use cases. Apache 2.0.

License
Apache 2.0 · OK
Context
2K
LLaVA-OneVision 7B
7B params · LLaVA Team
▸ permissively-licensed multi-image / video VLM

LLaVA-OneVision unified single-image / multi-image / video VLM on Qwen 2 base.

License
Apache 2.0 · OK
Context
32K
InternVL 2.5 26B
26B params · OpenGVLab
▸ permissively-licensed VLM at 24GB VRAM

InternVL 2.5 mid-tier — Shanghai AI Lab vision-language model with strong document and chart understanding.

License
MIT · OK
Context
32K
LLaVA 1.6 Mistral 7B
7B params · LLaVA Team
▸ consumer-tier vision-language with permissive license

LLaVA 1.6 on Mistral 7B base. Apache 2.0 vision-language with strong OCR.

License
Apache 2.0 · OK
Context
32K
Molmo 72B
72B params · Allen Institute (AI2)
▸ datacenter-tier open VLM for agent UI

Molmo flagship. Apache 2.0 VLM rivaling proprietary models on UI pointing and visual reasoning.

License
Apache 2.0 · OK
Context
4K
FAM · GEMMA

Gemma-based

12 models
Gemma 4 31B Dense
31B params · Google
▸ workstation-tier multilingual chat with permissive license

Google's flagship dense Gemma 4. Beats some 400B-class proprietary models on benchmarks. Targets the 24GB single-GPU sweet spot.

License
Gemma Terms of U · OK
Context
128K
Gemma 4 26B MoE
26B params · Google
▸ Gemma 4 MoE — workstation efficiency variant

MoE variant of Gemma 4. Faster per-token than the 31B dense at similar quality on most tasks.

License
Gemma Terms of U · OK
Context
128K
Gemma 3 27B
27B params · Google
judged 8.2/10
▸ Google's open-weight workstation-tier multilingual flagship — pre-Gemma-4 baseline

Pre-Gemma-4 flagship. Multimodal (4B+ variants), 128K context, 140 languages. Strong daily driver on 24GB cards.

License
Gemma Terms of U · OK
Context
128K
Gemma 4 E4B (Effective 4B)
4B params · Google
▸ edge-tier Gemma 4 — laptop friendly

Edge-class Gemma 4. The 'Effective 4B' branding signals it punches above its parameter count via training-data quality.

License
Gemma Terms of U · OK
Context
128K
Gemma 3 12B
12B params · Google
judged 7.9/10
▸ consumer-tier multilingual chat with vision support in 'it' variant

12B Gemma 3. Fits on 12GB consumer cards. Multimodal.

License
Gemma Terms of U · OK
Context
128K
Trendyol LLM Asure 12B
11.8B params · Trendyol
▸ Turkish business workflow assistants

Trendyol LLM Asure 12B is a Gemma 3 based multimodal instruct model for Turkish and English business workflows. The public Ollama build used in local testing is the alibayram GGUF distribution.

License
Gemma · OK
Context
128K
Gemma 3 4B
4B params · Google
judged 7.5/10
▸ edge-tier chat — Apple Silicon laptop friendly

4B Gemma 3 for edge. Multimodal.

License
Gemma Terms of U · OK
Context
128K
Gemma 4 E2B (Effective 2B)
2B params · Google
▸ phone-tier Gemma 4

Smallest Gemma 4. Designed for phones and Raspberry-Pi-class hardware.

License
Gemma Terms of U · OK
Context
128K
MedGemma 27B
27B params · Google
▸ medical-domain fine-tune of Gemma 3 27B

Medical-specialist Gemma fine-tune. Trained on de-identified medical literature and imaging. Research use under HAI-DEF terms.

License
Gemma Terms of U
Context
128K
ColPali v1.3
3B params · ColPali team (Illuin Technology)
▸ Visual-document retrieval for multi-page PDFs with charts, tables, and scans where OCR pipelines fail

3B-parameter visual document retriever built on PaliGemma-3B using a ColBERT-style late-interaction objective. Encodes a PDF page as a grid of patch embeddings, skipping OCR/layout parsing entirely. Sets SOTA on the ViDo

License
mit · OK
Context
—
PaliGemma 2 3B
3B params · Google
▸ task-specific VLM fine-tuning base

PaliGemma 2 — Gemma 2 base + SigLIP vision encoder. Designed for fine-tuning on specific vision tasks.

License
Gemma License · OK
Context
8K
PaliGemma 2 10B
10B params · Google
▸ VLM fine-tuning at 24GB VRAM

Mid-tier PaliGemma 2 fine-tuning base. Better baseline for complex vision tasks.

License
Gemma License · OK
Context
8K
FAM · QWEN

Qwen-based

5 models
Qwen2-VL 2B Instruct
2B params · Alibaba
▸ Lightweight document and chart understanding on a consumer GPU

Qwen2-VL 2B Instruct is Alibaba's compact vision-language model with native dynamic-resolution image handling and multimodal RoPE (M-RoPE) for video and multi-image inputs. It supports 32K-token context and is Apache-2.0

License
apache-2.0 · OK
Context
32K
Qwen 2.5-VL 7B
7B params · Alibaba
▸ consumer-tier OCR + image Q&A

Consumer-tier Qwen 2.5 VL. 7B + vision. Fits 8GB cards; the smallest practical multimodal Qwen.

License
Apache 2.0 · OK
Context
32K
Qwen 2-VL 7B
7B params · Alibaba
▸ consumer-tier multimodal — pre-2.5-VL baseline

Qwen 2 vision-language predecessor to Qwen 2.5-VL. Apache 2.0 with strong document Q&A.

License
Apache 2.0 · OK
Context
32K
Qwen 2.5-VL 72B
72B params · Alibaba
▸ frontier-tier multimodal serving

Qwen 2.5 vision-language flagship at 72B. Strong on document understanding + multi-image queries. Apache 2.0.

License
Apache 2.0 · OK
Context
32K
Qwen 2.5-VL 3B
3B params · Alibaba
▸ edge-tier multimodal

Smallest Qwen 2.5-VL. Edge-deployable VLM with strong document Q&A.

License
Qwen License · OK
Context
32K
FAM · LLAMA

Llama-based

5 models
Llama 3.2 11B Vision Instruct
11B params · Meta
▸ consumer-tier vision-language Llama

First-party multimodal Llama. Accepts images alongside text for VQA, document understanding, and chart reading. Runs on 12GB+ VRAM.

License
Llama 3.2 Commun · OK
Context
128K
Llama 4 Maverick
400B params · Meta
judged 8.7/10
▸ frontier-tier multimodal serving on multi-machine clusters

Meta's high-end Llama 4 sibling — 128 experts MoE built for performance over efficiency. Multilingual strength is its standout. Effectively a server-tier model; consumer hardware can't load it without aggressive quantiza

License
Llama 4 Communit · OK
Context
977K
Llama 3.2 90B Vision Instruct
90B params · Meta
▸ datacenter vision-language Llama at 70B-class

The 90B vision Llama. Best-in-class first-party multimodal open weight at the time of release. Workstation-class only.

License
Llama 3.2 Commun · OK
Context
128K
Llama 3.2 90B Vision
90B params · Meta
▸ datacenter-tier multimodal serving

Llama 3.2 multimodal at 90B. Datacenter-tier predecessor to Llama 4 Maverick. Strong visual reasoning.

License
Llama Community · OK
Context
128K
Llama 3.2 11B Vision
11B params · Meta
▸ consumer-tier multimodal — Llama-ecosystem migration path for vision workflows

Llama 3.2 multimodal at 11B. Consumer-tier multimodal predecessor to Llama 4 Scout.

License
Llama Community · OK
Context
128K
FAM · PHI

Phi-based

2 models
Phi-3.5 Vision
4.2B params · Microsoft
▸ edge-tier vision-language Phi

Multimodal Phi 3.5. Document and chart understanding at edge size. MIT licensed.

License
MIT · OK
Context
128K
Phi-4 Multimodal
14B params · Microsoft
▸ 16GB-consumer multimodal Q&A

Multimodal variant of Phi-4 14B. Vision + text. Smaller than Llama 4 Scout but covers most image-Q&A workflows; right-sized for 16GB consumer cards.

License
MIT · OK
Context
128K
FAM · MINICPM

minicpm

2 models
MiniCPM-V 2.6 8B
8B params · OpenBMB
▸ consumer multimodal document Q&A

Multimodal MiniCPM at 8B. Vision + text; strong on document Q&A for the size class.

License
MIT · OK
Context
32K
MiniCPM-V 3 8B
8B params · OpenBMB
▸ consumer multimodal document Q&A

MiniCPM-V successor. Multimodal at 8B with stronger document Q&A than 2.6.

License
MIT · OK
Context
32K
FAM · MISTRAL

Mistral-based

1 model
Pixtral 12B
12B params · Mistral AI
▸ consumer-tier vision-language Mistral

Mistral's multimodal entry. 12B parameters, vision + text, Apache 2.0. Good document and chart understanding.

License
Apache 2.0 · OK
Context
128K
FAM · STEPFUN

StepFun-based

1 model
GOT-OCR 2.0
580M params · StepFun AI
▸ Self-hosted OCR for printed formulas, tables, and dense scientific PDFs to LaTeX/Markdown

580M-parameter end-to-end OCR-2.0 model: a vision encoder paired with a Qwen-based decoder, trained specifically for general OCR including math formulas (LaTeX out), tables (Markdown/HTML out), sheet music, geometric sha

License
apache-2.0 · OK
Context
—
FAM · JANUS

janus

1 model
Janus-Pro 7B
7B params · DeepSeek AI
▸ consumer multimodal with image-generation

DeepSeek's multimodal 7B. Decoupled visual encoding for understanding vs generation — different from typical VLM design.

License
DeepSeek License · OK
Context
4K
FAM · GLM

GLM-based

1 model
GLM-4V 9B
13.9B params · Zhipu AI
▸ Chinese document VLM

GLM-4 with vision encoder. Strong on Chinese document Q&A; restricted commercial license.

License
GLM License
Context
8K
COVERAGE

Building an image pipeline?

Pair a diffusion model with a vision encoder for image → text → image loops. The OCR rows (Florence-2, GOT-OCR2) plus an embedding model from /embeddings.