Local image models

Local image AI runs in two shapes: diffusion models that produce pixels from text, and vision encoders that read meaning out of pixels. Most catalog work focuses on the LLM side; this hub fixes the missing surface for both image-gen and vision encoders.

On the generation side: FLUX.1 [dev] (12B, non-commercial — the most-liked model on HuggingFace), FLUX.1 [schnell] (12B Apache-2.0 4-step distilled — the production pick), SDXL-Turbo (2.6B 1-step real-time, non-commercial), Stable Diffusion 3.5 medium (2.5B Community license — commercial OK up to $1M revenue).

On the encoder side: SigLIP-SO400M (428M, the vision tower behind PaliGemma/Idefics/most open VLMs), ColPali v1.3 (3B, visual-document RAG SOTA), Florence-2-large (770M unified caption/OCR/grounding/segmentation), GOT-OCR 2.0 (580M end-to-end formula and table OCR).

License posture matters a lot here. FLUX.1 [dev] is research-only; FLUX.1 [schnell] is fully commercial. SDXL-Turbo blocks commercial use; SD 3.5 medium has the $1M-revenue cap. Each row calls out the exact license trap.

Other / from-scratch

Gemma-based

Qwen-based

Llama-based

Phi-based

minicpm

Mistral-based

StepFun-based

janus

GLM-based

Building an image pipeline?