Diffusion models that generate images, and vision encoders that understand them — both in one hub. FLUX.1 dev/schnell, SDXL-Turbo, Stable Diffusion 3.5 medium for generation; SigLIP, ColPali, Florence-2, GOT-OCR 2.0 for understanding.
Local image AI runs in two shapes: diffusion models that produce pixels from text, and vision encoders that read meaning out of pixels. Most catalog work focuses on the LLM side; this hub fixes the missing surface for both image-gen and vision encoders.
On the generation side: FLUX.1 [dev] (12B, non-commercial — the most-liked model on HuggingFace), FLUX.1 [schnell] (12B Apache-2.0 4-step distilled — the production pick), SDXL-Turbo (2.6B 1-step real-time, non-commercial), Stable Diffusion 3.5 medium (2.5B Community license — commercial OK up to $1M revenue).
On the encoder side: SigLIP-SO400M (428M, the vision tower behind PaliGemma/Idefics/most open VLMs), ColPali v1.3 (3B, visual-document RAG SOTA), Florence-2-large (770M unified caption/OCR/grounding/segmentation), GOT-OCR 2.0 (580M end-to-end formula and table OCR).
License posture matters a lot here. FLUX.1 [dev] is research-only; FLUX.1 [schnell] is fully commercial. SDXL-Turbo blocks commercial use; SD 3.5 medium has the $1M-revenue cap. Each row calls out the exact license trap.
12B-parameter rectified-flow transformer for text-to-image, guidance-distilled from the FLUX.1 [pro] teacher. Currently the most-liked model on Hugging Face (~12.9k likes). Sets a new open-weights bar for prompt adherenc
12B rectified-flow transformer, timestep-distilled to 1-4 sampling steps, released under Apache-2.0. Same architecture as FLUX.1 [dev] but trades a bit of fidelity for ~10x faster sampling and an unrestricted commercial
428M-parameter Shape-Optimized vision-language encoder trained with the sigmoid (not softmax) contrastive loss on WebLI. Hits ~83% zero-shot ImageNet-1k top-1 at 384px — the strongest open contrastive encoder in its size
2.6B SDXL backbone trained with Adversarial Diffusion Distillation (ADD), producing photorealistic 512px images in a single forward pass. Designed for real-time, interactive text-to-image.
770M-parameter unified vision foundation model with a DaViT image encoder and BART-style seq2seq decoder. One model, one set of weights — handles captioning, OCR, region/grounding, segmentation, and dense detection via t
2.5B MMDiT-X with improved Querying Key Normalization and dual attention blocks at lower resolutions. Trained for 0.25-2MP output. Positioned as the mid-tier of the SD3.5 family, designed to run on consumer hardware whil
SmolVLM-Instruct is Hugging Face's compact vision-language model built on the Idefics3 architecture, pairing SmolLM2-1.7B-Instruct with a SigLIP-SO400M vision encoder. It is engineered for minimum VRAM footprint and ship
InternVL 2.5 flagship. Approaches frontier proprietary VLMs on document and OCR tasks.
AI2's fully-open VLM. Trained on PixMo dataset; pointing capability for UI grounding.
Tiny vision-language model. ~1.9B; designed for edge / embedded multimodal use cases. Apache 2.0.
LLaVA-OneVision unified single-image / multi-image / video VLM on Qwen 2 base.
InternVL 2.5 mid-tier — Shanghai AI Lab vision-language model with strong document and chart understanding.
LLaVA 1.6 on Mistral 7B base. Apache 2.0 vision-language with strong OCR.
Molmo flagship. Apache 2.0 VLM rivaling proprietary models on UI pointing and visual reasoning.
Google's flagship dense Gemma 4. Beats some 400B-class proprietary models on benchmarks. Targets the 24GB single-GPU sweet spot.
MoE variant of Gemma 4. Faster per-token than the 31B dense at similar quality on most tasks.
Pre-Gemma-4 flagship. Multimodal (4B+ variants), 128K context, 140 languages. Strong daily driver on 24GB cards.
Edge-class Gemma 4. The 'Effective 4B' branding signals it punches above its parameter count via training-data quality.
12B Gemma 3. Fits on 12GB consumer cards. Multimodal.
Trendyol LLM Asure 12B is a Gemma 3 based multimodal instruct model for Turkish and English business workflows. The public Ollama build used in local testing is the alibayram GGUF distribution.
4B Gemma 3 for edge. Multimodal.
Smallest Gemma 4. Designed for phones and Raspberry-Pi-class hardware.
Medical-specialist Gemma fine-tune. Trained on de-identified medical literature and imaging. Research use under HAI-DEF terms.
3B-parameter visual document retriever built on PaliGemma-3B using a ColBERT-style late-interaction objective. Encodes a PDF page as a grid of patch embeddings, skipping OCR/layout parsing entirely. Sets SOTA on the ViDo
PaliGemma 2 — Gemma 2 base + SigLIP vision encoder. Designed for fine-tuning on specific vision tasks.
Mid-tier PaliGemma 2 fine-tuning base. Better baseline for complex vision tasks.
Qwen2-VL 2B Instruct is Alibaba's compact vision-language model with native dynamic-resolution image handling and multimodal RoPE (M-RoPE) for video and multi-image inputs. It supports 32K-token context and is Apache-2.0
Consumer-tier Qwen 2.5 VL. 7B + vision. Fits 8GB cards; the smallest practical multimodal Qwen.
Qwen 2 vision-language predecessor to Qwen 2.5-VL. Apache 2.0 with strong document Q&A.
Qwen 2.5 vision-language flagship at 72B. Strong on document understanding + multi-image queries. Apache 2.0.
Smallest Qwen 2.5-VL. Edge-deployable VLM with strong document Q&A.
First-party multimodal Llama. Accepts images alongside text for VQA, document understanding, and chart reading. Runs on 12GB+ VRAM.
Meta's high-end Llama 4 sibling — 128 experts MoE built for performance over efficiency. Multilingual strength is its standout. Effectively a server-tier model; consumer hardware can't load it without aggressive quantiza
The 90B vision Llama. Best-in-class first-party multimodal open weight at the time of release. Workstation-class only.
Llama 3.2 multimodal at 90B. Datacenter-tier predecessor to Llama 4 Maverick. Strong visual reasoning.
Llama 3.2 multimodal at 11B. Consumer-tier multimodal predecessor to Llama 4 Scout.
Multimodal Phi 3.5. Document and chart understanding at edge size. MIT licensed.
Multimodal variant of Phi-4 14B. Vision + text. Smaller than Llama 4 Scout but covers most image-Q&A workflows; right-sized for 16GB consumer cards.
Multimodal MiniCPM at 8B. Vision + text; strong on document Q&A for the size class.
MiniCPM-V successor. Multimodal at 8B with stronger document Q&A than 2.6.
Pair a diffusion model with a vision encoder for image → text → image loops. The OCR rows (Florence-2, GOT-OCR2) plus an embedding model from /embeddings.