Every open-weight model under 3.5B parameters that's worth running locally. Phone-class on-device assistants, laptop-class instruction followers, edge VLMs — all in one catalog with real VRAM math and license clarity.
Small language models are how local AI actually lands on phones, dev laptops without dGPUs, and battery-constrained devices. A 1B or 3B model that runs at 30 tok/s on a Pixel 9 beats a 70B model that takes 8 minutes to load on the same machine — operator math, not vendor math.
The frontier has shifted: Qwen 3 0.6B clears 20M HuggingFace downloads, Gemma 3 270M went viral on r/LocalLLaMA, and HuggingFace's own SmolLM2 family hits production speech assistants. This hub catalogs the ≤3.5B tier with the same depth of editorial we give the flagship 70B models — license trap, context ceiling, missing GGUF, all called out.
Each row links to /models/[slug] for the full operator notes: tested hardware, recommended quantization, and the prompting kit that worked in our test runs. Where we've benchmarked, the score sits next to the model. Where we haven't, the row says so — no fake numbers.
428M-parameter Shape-Optimized vision-language encoder trained with the sigmoid (not softmax) contrastive loss on WebLI. Hits ~83% zero-shot ImageNet-1k top-1 at 384px — the strongest open contrastive encoder in its size
SmolLM2-135M-Instruct is the smallest instruction-tuned model in Hugging Face's SmolLM2 family, a 135M-parameter Llama-architecture model trained for on-device deployment. It uses an 8K context window and is shipped with
770M-parameter unified vision foundation model with a DaViT image encoder and BART-style seq2seq decoder. One model, one set of weights — handles captioning, OCR, region/grounding, segmentation, and dense detection via t
TinyLlama 1.1B Chat v0.3 is a 1.1B-parameter chat model quantized to 4-bit AWQ by TheBloke. It uses the ChatML prompt format and fits comfortably in very low VRAM environments. Context is capped at 2048 tokens.
GPTQ-quantized build of TinyLlama 1.1B Chat v0.3, trained on SlimPajama, StarCoder, and OpenAssistant data. Runs in roughly 0.8 GB VRAM thanks to 4-bit quantization. English only, 2048-token context window.
Turkish-from-scratch language model trained by Ali Safaya (Koç University researcher). Named after the kanarya (Turkish for 'canary'). Trained on 250+ GB of Turkish text including Wikipedia, news, and books.
SmolLM2-360M-Instruct is the middle tier of the SmolLM2 instruct family, a 360M-parameter Llama-architecture model with an 8K context. It is shipped with ONNX and Transformers.js artifacts and aimed at on-device assistan
Turkish BART-style sequence-to-sequence model fine-tuned specifically for summarization. Not a chat model — purpose-built for input-document → Turkish-summary pipelines.
Smaller Kanarya variant — 750M parameters. Runs on CPU or 4GB GPU comfortably. Useful for low-resource Turkish text classification, embeddings, or completion tasks where latency matters more than quality.
GPT-2 Large architecture trained from scratch on Turkish. Reference baseline for measuring how much modern instruction-tuned models actually improve on the GPT-2 era.
SmolVLM-Instruct is Hugging Face's compact vision-language model built on the Idefics3 architecture, pairing SmolLM2-1.7B-Instruct with a SigLIP-SO400M vision encoder. It is engineered for minimum VRAM footprint and ship
A 124M-parameter GPT-2 base model trained on French Wikipedia (wiki40b/fr) and a CC-100/fr subset, with a 50,000-token BPE vocabulary. It generates French text but has no instruction-following capability. Context window
GPT-2 Spanish is a 124M-parameter model trained from scratch on 11.5GB of Spanish text (Wikipedia, books) with a custom Spanish BPE tokenizer. It generates Spanish prose but is not instruction-tuned — it completes text,
A 355M-parameter GPT-2 Medium trained from scratch on 11.5 GB of Spanish text (Wikipedia and books), with a BPE tokenizer built specifically for Spanish. Context window is 1024 tokens. Training data was not filtered for
OpenELM-3B-Instruct is Apple's 3-billion-parameter instruct model using a layer-wise scaled transformer with varying FFN multipliers and KV-head counts across 36 layers. It is released under the Apple Sample Code License
A 1.3B-parameter GPT-2-style model fine-tuned on Uzbek text for 50,000 steps on a single A100. Covers Uzbek, Russian, and English generation. It is a base model only — no instruction tuning.
A 175M-parameter GPT-2 model fine-tuned on Dostoevsky's digitized works, built on top of ruGPT3-small. Trained for five epochs, it generates Russian prose in a 19th-century literary register. Context tops out at 1024 tok
A 1.3B-parameter GPT model fine-tuned from ai-forever's mGPT base for Mongolian, with English and Russian also supported. Fine-tuning ran for 50,000 steps on Mongolian-specific data, yielding a validation perplexity of 4
A 0.5B Russian-language instruct model fine-tuned from Qwen2.5-0.5B on the GrandMaster-PRO-MAX dataset (~150k instructions). Vikhrmodels claims 4x efficiency over the base Qwen2.5-0.5B, and the quantized footprint lands
HuggingFace's small-model line at 3B. Apache 2.0. Designed for edge / educational deployments.
Hugging Face's SmolLM 2 at 360M. Apache 2.0; targets phone / Pi-class deployments.
BigCode's StarCoder 2 at 3B. Trained on The Stack v2 with 600+ programming languages.
OpenAI's flagship open speech-to-text model. 99 languages, MIT license. The de-facto open ASR baseline.
SmolLM 2 flagship. Open data + open weights at the edge tier.
BAAI's multilingual embedding flagship. Dense + sparse + ColBERT-style multi-vector. The de-facto open multilingual embedding pick.
Distilled Whisper Large v3. ~8x faster decode at near-equivalent accuracy on most languages.
Tiny vision-language model. ~1.9B; designed for edge / embedded multimodal use cases. Apache 2.0.
Qwen3-0.6B is the smallest dense model in Alibaba's Qwen3 generation, supporting a 40K-token context and dual-mode operation that toggles between explicit reasoning ('think') and fast direct response. It is post-trained
Qwen3-1.7B is the mid-tier dense model in Qwen3, sharing the same hybrid thinking architecture and 40K context as the 0.6B but with ~3x the parameters for noticeably stronger reasoning, math, and code. It targets the con
Qwen2-VL 2B Instruct is Alibaba's compact vision-language model with native dynamic-resolution image handling and multimodal RoPE (M-RoPE) for video and multi-image inputs. It supports 32K-token context and is Apache-2.0
Qwen 3.5 2B base with supervised fine-tuning on Turkish instruction-following data. Recent community fine-tune (early 2026) that bridges Qwen 3.5's strong multilingual base with Turkish-specific chat capability.
A 0.6B Qwen3 model fine-tuned on English-to-Hindi instruction pairs and quantized to GGUF. Fits in 370MB and runs on CPU-only hardware. Trained on 2,000 instruction pairs, so scope is narrow.
Smallest Qwen 2.5. Apache 2.0; phone / Pi-class deployment target.
Compact Qwen 2.5. The 1.5B Apache-2.0 baseline.
Mid-edge Qwen 2.5. Note: 3B variant uses Qwen License (not Apache 2.0).
Smallest Qwen 2.5-VL. Edge-deployable VLM with strong document Q&A.
Compact Qwen 2.5 Coder. Sweet spot for laptop autocomplete and small refactor agents.
Smallest Qwen 2.5 Coder. Targets edge / autocomplete on integrated GPUs and Apple Silicon laptops.
Gemma 3 270M is the smallest member of Google's Gemma 3 family, a 270-million-parameter text-only model designed for on-device deployment and task-specific fine-tuning. It carries the Gemma license and Google's acceptabl
Gemma 2 2B Instruct is Google's instruction-tuned 2B model from the Gemma 2 generation, trained with knowledge distillation from larger Gemma models. It targets the consumer-GPU and high-end mobile tier with an 8K contex
Smallest Gemma 4. Designed for phones and Raspberry-Pi-class hardware.
Smallest text-only Gemma 3 for phones and IoT.
3B-parameter visual document retriever built on PaliGemma-3B using a ColBERT-style late-interaction objective. Encodes a PDF page as a grid of patch embeddings, skipping OCR/layout parsing entirely. Sets SOTA on the ViDo
PaliGemma 2 — Gemma 2 base + SigLIP vision encoder. Designed for fine-tuning on specific vision tasks.
Lightweight 3B for edge and laptop deployment. Runs comfortably on 8GB VRAM at 30+ tok/s on Apple Silicon.
TinyLlama-1.1B-Chat-v1.0 is a 1.1B Llama-2-architecture model pretrained on 3 trillion tokens and chat-tuned on UltraChat and UltraFeedback. It was one of the earliest production-grade SLMs and remains a popular base mod
True edge-tier Llama. Runs on a phone or Raspberry Pi. Useful for classification, simple summarization, and on-device agents.
Salamandra 2B is a base-only transformer trained from scratch by Barcelona Supercomputing Center on 12.875 trillion tokens across 35 European languages and code. At 2.25B parameters and an 8192-token context window, it i
Salamandra 2B Instruct is a transformer model from BSC pretrained from scratch on 12.875 trillion tokens across 35 European languages and code. The instruct variant is fine-tuned for instruction following using the ChatM
Kumru 2B is a compact Turkish text-generation model from VNGRS. The Hugging Face config reports a Mistral-family architecture with an 8K context window, and the public Ollama build makes it a practical edge-speed Turkish
Mistral edge model at 3B. Designed for on-device inference with extended 128k context. Research license only.
EXAONE 3.5 2.4B Instruct is LG AI Research's bilingual English/Korean model built for low-resource devices. It handles up to 32K context tokens and shows competitive results on Korean-specific benchmarks like KoMT-Bench
LG AI's edge-tier EXAONE. Strong Korean / English. Research-only license.
Granite 3.1 2B Instruct is IBM's 2B-parameter dense instruct model with a 128K context window, post-trained for enterprise tasks including RAG, function calling, and structured citation generation. It is part of IBM's Ap
IBM Granite at 2B. Apache 2.0 enterprise-friendly small model with safety tuning.
The discovery pipeline sweeps HuggingFace for new sub-3.5B releases weekly. If you know one we missed, point us to the HF repo via contact.