Gemma 3 12B
12B Gemma 3. Fits on 12GB consumer cards. Multimodal.
Positioning
The 12B sweet-spot member of the Gemma 3 family. Native multimodal in a footprint that fits 12 GB VRAM at Q4 — the only 12B-class model with serious multimodal in 2025.
Strengths
- Native vision + text at 12B — uniquely small for a multimodal model.
- 128K context with decent recall.
- Distilled writing quality from Gemini-class data.
Limitations
- Gemma license restrictiveness applies.
- Multimodal quality lags Pixtral 12B on dense visual reasoning.
- No thinking mode.
Real-world performance on RTX 4090
- Q4_K_M (7.6 GB): 80–95 tok/s decode, TTFT ~95 ms
- Q5_K_M (8.9 GB): 70–84 tok/s
- Q8_0 (13.4 GB): 50–62 tok/s
Should you run this locally?
Yes, for 12 GB VRAM owners who want multimodal in a single model. No, for text-only workloads where Qwen 2.5 14B is stronger, or vision-priority work where Pixtral 12B is the better pick.
How it compares
- vs Pixtral 12B → Pixtral wins on dense visual reasoning; Gemma 3 12B has stronger general text quality.
- vs Qwen 2.5 14B → Qwen wins on text capability; Gemma 3 12B has multimodal as a bonus.
- vs Mistral Nemo 12B → close text-quality call. Gemma adds multimodal; Nemo has cleaner license.
- vs Gemma 3 27B → 27B is meaningfully stronger; 12B is the constrained-VRAM pick.
Run this yourself
ollama pull gemma3:12b-it-q4_K_M
ollama run gemma3:12b-it-q4_K_M
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 3060 12 GB / 4060 Ti / 4090
›Why this rating
7.9/10 — solid 12B-class entry with native multimodal. Sits between Qwen 2.5 14B and Mistral Nemo 12B in capability; the multimodal feature is the differentiator. Loses points on license restrictiveness.
Overview
12B Gemma 3. Fits on 12GB consumer cards. Multimodal.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Multimodal at 12B
- 128K context
Weaknesses
- Gemma license restrictions
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 7.3 GB | 10 GB |
| Q8_0 | 13.0 GB | 16 GB |
Get the model
Ollama
One-line install
ollama run gemma3:12bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Benchmarks
Real measurements on real hardware. Numbers ship with the runner version, quant, and date.
| Hardware | Provenance | Quant | Ctx | Tokens / sec | TTFT | Date |
|---|---|---|---|---|---|---|
| NVIDIA GeForce RTX 3080 16GB (Mobile) | EditorialM | Q4_K_M | 4K | 43.3tok/s | 767 ms | Jun 2, 26 |
What to do next
Got this model running on real hardware? Share what you measured — the form arrives with the model pre-selected.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Gemma 3 12B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Gemma 3 12B?
Can I use Gemma 3 12B commercially?
What's the context length of Gemma 3 12B?
How do I install Gemma 3 12B with Ollama?
Does Gemma 3 12B support images?
Source: huggingface.co/google/gemma-3-12b-it
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Gemma 3 12B runs on your specific hardware before committing money.