RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
BLK · SMALL LANGUAGE MODELS≤ 3.5B params · edge · phone · laptop

Small Language Models

Every open-weight model under 3.5B parameters that's worth running locally. Phone-class on-device assistants, laptop-class instruction followers, edge VLMs — all in one catalog with real VRAM math and license clarity.

Models curated
62
Vendors
37
Commercial OK
57/62
Benchmarked
0/62

Small language models are how local AI actually lands on phones, dev laptops without dGPUs, and battery-constrained devices. A 1B or 3B model that runs at 30 tok/s on a Pixel 9 beats a 70B model that takes 8 minutes to load on the same machine — operator math, not vendor math.

The frontier has shifted: Qwen 3 0.6B clears 20M HuggingFace downloads, Gemma 3 270M went viral on r/LocalLLaMA, and HuggingFace's own SmolLM2 family hits production speech assistants. This hub catalogs the ≤3.5B tier with the same depth of editorial we give the flagship 70B models — license trap, context ceiling, missing GGUF, all called out.

Each row links to /models/[slug] for the full operator notes: tested hardware, recommended quantization, and the prompting kit that worked in our test runs. Where we've benchmarked, the score sits next to the model. Where we haven't, the row says so — no fake numbers.

FAM · OTHER

Other / from-scratch

27 models
SigLIP SO400M (patch14-384)
428M params · Google
▸ Zero-shot image classification, image-text retrieval, or as a frozen vision tower for a custom VLM on edge/consumer hardware

428M-parameter Shape-Optimized vision-language encoder trained with the sigmoid (not softmax) contrastive loss on WebLI. Hits ~83% zero-shot ImageNet-1k top-1 at 384px — the strongest open contrastive encoder in its size

License
apache-2.0 · OK
Context
—
SmolLM2 135M Instruct
135M params · Hugging Face
▸ In-browser WebGPU chat demo and edge autocomplete

SmolLM2-135M-Instruct is the smallest instruction-tuned model in Hugging Face's SmolLM2 family, a 135M-parameter Llama-architecture model trained for on-device deployment. It uses an 8K context window and is shipped with

License
apache-2.0 · OK
Context
8K
Florence-2 Large
770M params · Microsoft
▸ Edge-tier unified caption / OCR / detection / grounding pipeline where you want one model instead of four

770M-parameter unified vision foundation model with a DaViT image encoder and BART-style seq2seq decoder. One model, one set of weights — handles captioning, OCR, region/grounding, segmentation, and dense detection via t

License
mit · OK
Context
—
TinyLlama 1.1B Chat v0.3 AWQ
1.1B params · Zhang Peiyuan
judged 9.1/10
▸ Low-resource English chatbot prototyping

TinyLlama 1.1B Chat v0.3 is a 1.1B-parameter chat model quantized to 4-bit AWQ by TheBloke. It uses the ChatML prompt format and fits comfortably in very low VRAM environments. Context is capped at 2048 tokens.

License
apache-2.0 · OK
Context
2K
TinyLlama 1.1B Chat v0.3 GPTQ
1.1B params · TheBloke
judged 9.0/10
▸ Lightweight English chatbot on severely VRAM-constrained hardware

GPTQ-quantized build of TinyLlama 1.1B Chat v0.3, trained on SlimPajama, StarCoder, and OpenAssistant data. Runs in roughly 0.8 GB VRAM thanks to 4-bit quantization. English only, 2048-token context window.

License
apache-2.0 · OK
Context
2K
Kanarya 2B
2B params · asafaya

Turkish-from-scratch language model trained by Ali Safaya (Koç University researcher). Named after the kanarya (Turkish for 'canary'). Trained on 250+ GB of Turkish text including Wikipedia, news, and books.

License
Apache-2.0 · OK
Context
2K
SmolLM2 360M Instruct
360M params · Hugging Face
▸ On-device assistant for Raspberry Pi-class hardware

SmolLM2-360M-Instruct is the middle tier of the SmolLM2 instruct family, a 360M-parameter Llama-architecture model with an 8K context. It is shipped with ONNX and Transformers.js artifacts and aimed at on-device assistan

License
apache-2.0 · OK
Context
8K
VBART Large (Turkish Summarization)
400M params · VNGRS

Turkish BART-style sequence-to-sequence model fine-tuned specifically for summarization. Not a chat model — purpose-built for input-document → Turkish-summary pipelines.

License
Apache-2.0 · OK
Context
1K
Kanarya 750M
750M params · asafaya

Smaller Kanarya variant — 750M parameters. Runs on CPU or 4GB GPU comfortably. Useful for low-resource Turkish text classification, embeddings, or completion tasks where latency matters more than quality.

License
Apache-2.0 · OK
Context
2K
Turkish GPT-2 Large
700M params · ytu-ce-cosmos

GPT-2 Large architecture trained from scratch on Turkish. Reference baseline for measuring how much modern instruction-tuned models actually improve on the GPT-2 era.

License
MIT · OK
Context
1K
SmolVLM Instruct
2.25B params · Hugging Face
▸ Lowest-VRAM open VLM for image captioning on consumer GPU

SmolVLM-Instruct is Hugging Face's compact vision-language model built on the Idefics3 architecture, pairing SmolLM2-1.7B-Instruct with a SigLIP-SO400M vision encoder. It is engineered for minimum VRAM footprint and ship

License
apache-2.0 · OK
Context
8K
gpt2-base-french
124M params · ClassCat
judged 9.0/10
▸ French text completion research or academic prototyping

A 124M-parameter GPT-2 base model trained on French Wikipedia (wiki40b/fr) and a CC-100/fr subset, with a 50,000-token BPE vocabulary. It generates French text but has no instruction-following capability. Context window

License
cc-by-sa-4.0
Context
1K
GPT-2 Spanish
124M params · DeepESP
judged 9.3/10
▸ Spanish text completion or language modeling research

GPT-2 Spanish is a 124M-parameter model trained from scratch on 11.5GB of Spanish text (Wikipedia, books) with a custom Spanish BPE tokenizer. It generates Spanish prose but is not instruction-tuned — it completes text,

License
mit · OK
Context
1K
GPT-2 Spanish Medium
355M params · DeepESP
judged 9.2/10
▸ Lightweight Spanish text generation or fine-tuning base

A 355M-parameter GPT-2 Medium trained from scratch on 11.5 GB of Spanish text (Wikipedia and books), with a BPE tokenizer built specifically for Spanish. Context window is 1024 tokens. Training data was not filtered for

License
mit · OK
Context
1K
OpenELM 3B Instruct
3B params · Apple
▸ Academic study of layer-wise scaled transformer architectures

OpenELM-3B-Instruct is Apple's 3-billion-parameter instruct model using a layer-wise scaled transformer with varying FFN multipliers and KV-head counts across 36 layers. It is released under the Apple Sample Code License

License
apple-amlr
Context
2K
mGPT 1.3B Uzbek
1.3B params · ai-forever
judged 9.4/10
▸ Uzbek-language text generation and corpus experimentation

A 1.3B-parameter GPT-2-style model fine-tuned on Uzbek text for 50,000 steps on a single A100. Covers Uzbek, Russian, and English generation. It is a base model only — no instruction tuning.

License
mit · OK
Context
2K
Dostoevsky Doesn't Write It GPT2
175M params · evilfreelancer
judged 9.1/10
▸ Dostoevsky-style Russian prose generation for creative or novelty projects

A 175M-parameter GPT-2 model fine-tuned on Dostoevsky's digitized works, built on top of ruGPT3-small. Trained for five epochs, it generates Russian prose in a 19th-century literary register. Context tops out at 1024 tok

License
mit · OK
Context
1K
mGPT 1.3B Mongol
1.3B params · ai-forever
judged 9.3/10
▸ Mongolian-language text generation and basic NLP prototyping

A 1.3B-parameter GPT model fine-tuned from ai-forever's mGPT base for Mongolian, with English and Russian also supported. Fine-tuning ran for 50,000 steps on Mongolian-specific data, yielding a validation perplexity of 4

License
mit · OK
Context
2K
Vikhr Qwen 2.5 0.5B Instruct
500M params · Vikhrmodels
judged 9.0/10
▸ Russian-language mobile chatbot or on-device assistant

A 0.5B Russian-language instruct model fine-tuned from Qwen2.5-0.5B on the GrandMaster-PRO-MAX dataset (~150k instructions). Vikhrmodels claims 4x efficiency over the base Qwen2.5-0.5B, and the quantized footprint lands

License
apache-2.0 · OK
Context
4K
SmolLM 3 3B
3B params · HuggingFace
▸ edge-tier reasoning

HuggingFace's small-model line at 3B. Apache 2.0. Designed for edge / educational deployments.

License
Apache 2.0 · OK
Context
32K
SmolLM 2 360M Instruct
360M params · Hugging Face
▸ phone / Pi-class chat

Hugging Face's SmolLM 2 at 360M. Apache 2.0; targets phone / Pi-class deployments.

License
Apache 2.0 · OK
Context
8K
StarCoder 2 3B
3B params · BigCode
▸ edge-tier code completion

BigCode's StarCoder 2 at 3B. Trained on The Stack v2 with 600+ programming languages.

License
BigCode OpenRAIL · OK
Context
16K
Whisper Large v3
1.55B params · OpenAI
▸ open speech-to-text baseline

OpenAI's flagship open speech-to-text model. 99 languages, MIT license. The de-facto open ASR baseline.

License
MIT · OK
Context
—
SmolLM 2 1.7B Instruct
1.7B params · Hugging Face
▸ edge-tier Apache 2.0 baseline

SmolLM 2 flagship. Open data + open weights at the edge tier.

License
Apache 2.0 · OK
Context
8K
BGE M3
570M params · BAAI
▸ multilingual RAG embeddings

BAAI's multilingual embedding flagship. Dense + sparse + ColBERT-style multi-vector. The de-facto open multilingual embedding pick.

License
MIT · OK
Context
8K
Whisper Large v3 Turbo
810M params · OpenAI
▸ real-time / batch transcription

Distilled Whisper Large v3. ~8x faster decode at near-equivalent accuracy on most languages.

License
MIT · OK
Context
—
Moondream 2
1.9B params · vikhyat (community)
▸ edge / phone-tier vision Q&A

Tiny vision-language model. ~1.9B; designed for edge / embedded multimodal use cases. Apache 2.0.

License
Apache 2.0 · OK
Context
2K
FAM · QWEN

Qwen-based

11 models
Qwen 3 0.6B
600M params · Alibaba
▸ Sub-1B on-device chat and tool-calling agent on phones

Qwen3-0.6B is the smallest dense model in Alibaba's Qwen3 generation, supporting a 40K-token context and dual-mode operation that toggles between explicit reasoning ('think') and fast direct response. It is post-trained

License
apache-2.0 · OK
Context
40K
Qwen 3 1.7B
1.7B params · Alibaba
▸ Edge laptop assistant with reasoning that fits in 2GB VRAM

Qwen3-1.7B is the mid-tier dense model in Qwen3, sharing the same hybrid thinking architecture and 40K context as the 0.6B but with ~3x the parameters for noticeably stronger reasoning, math, and code. It targets the con

License
apache-2.0 · OK
Context
40K
Qwen2-VL 2B Instruct
2B params · Alibaba
▸ Lightweight document and chart understanding on a consumer GPU

Qwen2-VL 2B Instruct is Alibaba's compact vision-language model with native dynamic-resolution image handling and multimodal RoPE (M-RoPE) for video and multi-image inputs. It supports 32K-token context and is Apache-2.0

License
apache-2.0 · OK
Context
32K
Qwen 3.5 2B Turkish SFT
2B params · Tuguberk

Qwen 3.5 2B base with supervised fine-tuning on Turkish instruction-following data. Recent community fine-tune (early 2026) that bridges Qwen 3.5's strong multilingual base with Turkish-specific chat capability.

License
Apache-2.0 · OK
Context
32K
Qwen3 0.6B Hindi Instruct v1 GGUF
600M params · pankajpandey-dev
judged 9.1/10
▸ Simple Hindi instruction following on CPU-only devices

A 0.6B Qwen3 model fine-tuned on English-to-Hindi instruction pairs and quantized to GGUF. Fits in 370MB and runs on CPU-only hardware. Trained on 2,000 instruction pairs, so scope is narrow.

License
apache-2.0 · OK
Context
2K
Qwen 2.5 0.5B Instruct
500M params · Alibaba
▸ phone-tier Qwen baseline

Smallest Qwen 2.5. Apache 2.0; phone / Pi-class deployment target.

License
Apache 2.0 · OK
Context
32K
Qwen 2.5 1.5B Instruct
1.5B params · Alibaba
▸ edge-tier Apache 2.0 chat

Compact Qwen 2.5. The 1.5B Apache-2.0 baseline.

License
Apache 2.0 · OK
Context
32K
Qwen 2.5 3B Instruct
3B params · Alibaba
▸ edge-tier Qwen 2.5 chat

Mid-edge Qwen 2.5. Note: 3B variant uses Qwen License (not Apache 2.0).

License
Qwen License · OK
Context
32K
Qwen 2.5-VL 3B
3B params · Alibaba
▸ edge-tier multimodal

Smallest Qwen 2.5-VL. Edge-deployable VLM with strong document Q&A.

License
Qwen License · OK
Context
32K
Qwen 2.5 Coder 3B
3B params · Alibaba
▸ Apple Silicon laptop coding autocomplete

Compact Qwen 2.5 Coder. Sweet spot for laptop autocomplete and small refactor agents.

License
Apache 2.0 · OK
Context
32K
Qwen 2.5 Coder 1.5B
1.5B params · Alibaba
▸ IDE autocomplete on integrated GPUs

Smallest Qwen 2.5 Coder. Targets edge / autocomplete on integrated GPUs and Apple Silicon laptops.

License
Apache 2.0 · OK
Context
32K
FAM · GEMMA

Gemma-based

6 models
Gemma 3 270M
270M params · Google
▸ Fine-tuning base for sub-1W on-device classifiers and routers

Gemma 3 270M is the smallest member of Google's Gemma 3 family, a 270-million-parameter text-only model designed for on-device deployment and task-specific fine-tuning. It carries the Gemma license and Google's acceptabl

License
gemma · OK
Context
32K
Gemma 2 2B Instruct
2B params · Google
▸ Consumer-GPU local chat with strong safety defaults

Gemma 2 2B Instruct is Google's instruction-tuned 2B model from the Gemma 2 generation, trained with knowledge distillation from larger Gemma models. It targets the consumer-GPU and high-end mobile tier with an 8K contex

License
gemma · OK
Context
8K
Gemma 4 E2B (Effective 2B)
2B params · Google
▸ phone-tier Gemma 4

Smallest Gemma 4. Designed for phones and Raspberry-Pi-class hardware.

License
Gemma Terms of U · OK
Context
128K
Gemma 3 1B
1B params · Google
▸ phone-tier Gemma — smallest practical Gemma 3

Smallest text-only Gemma 3 for phones and IoT.

License
Gemma Terms of U · OK
Context
32K
ColPali v1.3
3B params · ColPali team (Illuin Technology)
▸ Visual-document retrieval for multi-page PDFs with charts, tables, and scans where OCR pipelines fail

3B-parameter visual document retriever built on PaliGemma-3B using a ColBERT-style late-interaction objective. Encodes a PDF page as a grid of patch embeddings, skipping OCR/layout parsing entirely. Sets SOTA on the ViDo

License
mit · OK
Context
—
PaliGemma 2 3B
3B params · Google
▸ task-specific VLM fine-tuning base

PaliGemma 2 — Gemma 2 base + SigLIP vision encoder. Designed for fine-tuning on specific vision tasks.

License
Gemma License · OK
Context
8K
FAM · LLAMA

Llama-based

5 models
Llama 3.2 3B Instruct
3B params · Meta
judged 7.4/10
▸ battery-powered laptop chat tier

Lightweight 3B for edge and laptop deployment. Runs comfortably on 8GB VRAM at 30+ tok/s on Apple Silicon.

License
Llama 3.2 Commun · OK
Context
128K
TinyLlama 1.1B Chat v1.0
1.1B params · TinyLlama
▸ Reproducible SLM research baseline and legacy llama.cpp deployments

TinyLlama-1.1B-Chat-v1.0 is a 1.1B Llama-2-architecture model pretrained on 3 trillion tokens and chat-tuned on UltraChat and UltraFeedback. It was one of the earliest production-grade SLMs and remains a popular base mod

License
apache-2.0 · OK
Context
2K
Llama 3.2 1B Instruct
1B params · Meta
judged 6.0/10
▸ edge / phone-tier chat — smallest practical Llama

True edge-tier Llama. Runs on a phone or Raspberry Pi. Useful for classification, simple summarization, and on-device agents.

License
Llama 3.2 Commun · OK
Context
128K
Salamandra 2B
2.25B params · BSC-LT
judged 9.4/10
▸ Fine-tuning base for Spanish or Catalan/Galician/Basque NLP tasks

Salamandra 2B is a base-only transformer trained from scratch by Barcelona Supercomputing Center on 12.875 trillion tokens across 35 European languages and code. At 2.25B parameters and an 8192-token context window, it i

License
apache-2.0 · OK
Context
8K
Salamandra 2B Instruct
2B params · BSC
judged 9.4/10
▸ Spanish and European multilingual instruction following on low-VRAM hardware

Salamandra 2B Instruct is a transformer model from BSC pretrained from scratch on 12.875 trillion tokens across 35 European languages and code. The instruct variant is fine-tuned for instruction following using the ChatM

License
apache-2.0 · OK
Context
8K
FAM · MISTRAL

Mistral-based

2 models
Kumru 2B
2.4B params · VNGRS
▸ fast Turkish edge chat

Kumru 2B is a compact Turkish text-generation model from VNGRS. The Hugging Face config reports a Mistral-family architecture with an 8K context window, and the public Ollama build makes it a practical edge-speed Turkish

License
Apache-2.0 · OK
Context
8K
Ministral 3B Instruct
3B params · Mistral AI
▸ edge-tier long-context — research only

Mistral edge model at 3B. Designed for on-device inference with extended 128k context. Research license only.

License
Mistral Research
Context
128K
FAM · EXAONE

EXAONE-based

2 models
EXAONE 3.5 2.4B Instruct
2.4B params · LG AI Research
judged 9.3/10
▸ Korean/English bilingual research prototyping on edge hardware

EXAONE 3.5 2.4B Instruct is LG AI Research's bilingual English/Korean model built for low-resource devices. It handles up to 32K context tokens and shows competitive results on Korean-specific benchmarks like KoMT-Bench

License
other (EXAONE AI
Context
32K
EXAONE 3.5 2.4B
2.4B params · LG AI Research
▸ edge-tier Korean chat

LG AI's edge-tier EXAONE. Strong Korean / English. Research-only license.

License
EXAONE AI Model
Context
32K
FAM · GRANITE

Granite-based

2 models
Granite 3.1 2B Instruct
2B params · IBM
▸ Enterprise RAG and tool-use with vendor indemnification

Granite 3.1 2B Instruct is IBM's 2B-parameter dense instruct model with a 128K context window, post-trained for enterprise tasks including RAG, function calling, and structured citation generation. It is part of IBM's Ap

License
apache-2.0 · OK
Context
128K
Granite 3.0 2B Instruct
2B params · IBM
▸ edge-tier IBM Granite

IBM Granite at 2B. Apache 2.0 enterprise-friendly small model with safety tuning.

License
Apache 2.0 · OK
Context
4K
FAM · STEPFUN

StepFun-based

1 model
GOT-OCR 2.0
580M params · StepFun AI
▸ Self-hosted OCR for printed formulas, tables, and dense scientific PDFs to LaTeX/Markdown

580M-parameter end-to-end OCR-2.0 model: a vision encoder paired with a Qwen-based decoder, trained specifically for general OCR including math formulas (LaTeX out), tables (Markdown/HTML out), sheet music, geometric sha

License
apache-2.0 · OK
Context
—
FAM · OLMO

OLMo-based

1 model
OLMo 2 1B Instruct
1B params · AllenAI
▸ Research baseline where full training reproducibility is required

OLMo 2 1B Instruct is AllenAI's 1-billion-parameter instruct model from the April 2025 OLMo 2 release, post-trained with RLVR on math. It is fully open: weights, training data, training code, and intermediate checkpoints

License
apache-2.0 · OK
Context
4K
FAM · FALCON

Falcon-based

1 model
Falcon 3 3B Instruct
3B params · TII
▸ Multilingual European chat where Falcon license is acceptable

Falcon 3 3B Instruct is TII's 3-billion-parameter instruct model from the Falcon 3 family, supporting English, French, Spanish, and Portuguese with a 32K context window. It uses the Llama architecture for runtime compati

License
falcon-llm-licen · OK
Context
32K
FAM · DEEPSEEK

DeepSeek-based

1 model
DeepSeek R1 Distill Qwen 1.5B
1.5B params · DeepSeek AI
▸ edge-tier reasoning

Smallest R1 distill. Surprisingly capable reasoning at 1.5B for its size class; right pick when you need reasoning AND edge deployment.

License
Apache 2.0 · OK
Context
128K
FAM · HERMES

hermes

1 model
Hermes 3 Llama 3.2 3B
3B params · Nous Research
▸ edge-tier instruction following

Nous Research's Hermes 3 fine-tune of Llama 3.2 3B. Strong general-instruction following at the 3B tier.

License
Llama Community · OK
Context
128K
FAM · DOLPHIN

dolphin

1 model
Dolphin 3.0 Llama 3.2 3B
3B params · Cognitive Computations
▸ creative / less-restricted generation at edge tier

Eric Hartford's Dolphin fine-tune at 3B. Less-censored than the base Llama; popular for unconstrained-generation use cases.

License
Llama Community · OK
Context
128K
FAM · RWKV

rwkv

1 model
RWKV 7 'Goose' 1.5B
1.5B params · RWKV community
▸ long-context edge inference where memory matters more than quality

RWKV 7 'Goose' at 1.5B. Linear-time inference architecture (constant memory regardless of context). Apache 2.0.

License
Apache 2.0 · OK
Context
1024K
COVERAGE

Don't see a model you'd run on your phone?

The discovery pipeline sweeps HuggingFace for new sub-3.5B releases weekly. If you know one we missed, point us to the HF repo via contact.