Gemma 3 4B
4B Gemma 3 for edge. Multimodal.
Positioning
The 4B Gemma 3 with multimodal capability. Genuinely the best small-model pick when image input matters and VRAM is constrained — fits in under 4 GB at Q4.
Strengths
- Native multimodal at 4B — no other model in this size class does this credibly.
- Conversational quality materially better than Phi 3.5 Mini for general chat.
- 128K context even at this size.
Limitations
- Gemma license restrictiveness.
- Math and structured tasks weaker than Phi 3.5 Mini.
- Knowledge breadth narrow — small-model limitations are real.
Real-world performance on RTX 4090
- Q4_K_M (2.7 GB): 130–150 tok/s decode, TTFT under 50 ms
- Q5_K_M (3.2 GB): 115–135 tok/s
- Q8_0 (4.8 GB): 95–115 tok/s
Should you run this locally?
Yes, for edge devices with multimodal input requirements, 4–6 GB GPU owners who want chat + vision. No, for math/structured tasks (pick Phi 3.5 Mini), or where chat-only ≥ 8B is a better fit.
How it compares
- vs Phi-3.5 Mini (3.8B) → Gemma 3 4B wins on chat + multimodal; Phi wins on math + structured output.
- vs Llama 3.2 3B → similar text capability; Gemma adds multimodal.
- vs Gemma 3 1B → 4B is meaningfully smarter; 1B is for very tight constraints.
Run this yourself
ollama pull gemma3:4b-it-q4_K_M
ollama run gemma3:4b-it-q4_K_M
Settings: Q4_K_M GGUF, 8192 ctx, llama.cpp/CUDA, RTX 4090
›Why this rating
7.5/10 — best 4B-class general model when you want multimodal at edge size. Loses to Phi-3.5 Mini on math + structured tasks but beats it on chat naturalness.
Overview
4B Gemma 3 for edge. Multimodal.
How to run it
Gemma 3 4B is Google's smallest dense model — 4B parameters, designed for the "runs on literally anything" tier. Run at Q4_K_M via Ollama (ollama pull gemma3:4b) or llama.cpp with -ngl 999 -fa -c 4096. Q4_K_M file size ~2.5 GB on disk. Minimum VRAM: 2 GB — integrated graphics, phone GPU, or any discrete GPU. CPU-only works fine at 10-20 tok/s on modern laptop. Recommended: any device with 4+ GB RAM. Throughput: ~100-150+ tok/s on RTX 4090 at Q4_K_M — flies. ~15-25 tok/s CPU-only on MacBook Air. Gemma 3 architecture — well-supported everywhere. The 4B is the tiny-but-capable tier: good for text classification, extraction, simple Q&A, grammar correction, lightweight chat. Not for: complex reasoning, long-form generation, nuanced conversation, coding — 4B is the quality ceiling. Context: 8K advertised; practical at Q4 is 8K on any device. For step-up: Gemma 3 12B or Granite 3 MoE 3B-Active for MoE at similar active size with better quality. License: Gemma license (permissive, verify commercial terms). Gemma 3 4B is the default recommendation for edge/mobile/CPU-only when quality requirements are modest and speed + deployability matter most.
Hardware guidance
Minimum: 2 GB RAM CPU-only at Q4_K_M (~5-10 tok/s). Raspberry Pi 5 8GB: Q4 at 10-18 tok/s. Phone: runs via MLC LLM or llama.cpp Android at 5-10 tok/s. VRAM math: 4B dense, Q4_K_M ≈ 2.5 GB. KV cache at 4K: ~0.5 GB. Total: ~3 GB at 4K. Any discrete GPU from GTX 1050 upward fits comfortably. RTX 4090 24GB: absurdly over-provisioned — but delivers 150+ tok/s. CPU-only on modern laptop (16 GB RAM): 10-20 tok/s. Intel integrated graphics: 8-15 tok/s via llama.cpp Vulkan backend. MacBook Air M1 8GB: Q4 at 15-25 tok/s via MLX-LM. This is the most deployable model — runs on phones, Raspberry Pi, old laptops, AWS t3.micro. The hardware floor is essentially "any device made after 2018."
What breaks first
- 4B quality ceiling. The model's fundamental capability is limited by 4B parameters. Complex reasoning, nuanced understanding, and deep knowledge recall hit the wall hard. 2. Hallucination rate. Small models hallucinate more. Gemma 3 4B will confidently produce incorrect facts more often than 7B+ models. Don't treat it as a knowledge base. 3. Context window overclaim. 8K is advertised but usable context at 4B degrades past 4K tokens. The model loses track of earlier context. Keep prompts concise. 4. Not multilingual. Gemma 3 4B's multilingual quality is weak. English is strongest; other languages may produce broken or English-mixed outputs.
Runtime recommendation
Common beginner mistakes
Mistake: Deploying Gemma 3 4B for tasks that need reasoning depth. Fix: 4B is a lightweight model. For complex tasks, use at least Gemma 4 26B MoE or Qwen 3 32B. Mistake: Running 8K context and expecting coherent outputs. Fix: 4B quality degrades past 4K context. Keep prompts short and focused. Mistake: Using Gemma 3 4B as a production knowledge base. Fix: 4B models hallucinate more. Pair with RAG on vetted documents for factual tasks. Mistake: Over-provisioning hardware ("I need an RTX 4090 for this"). Fix: It runs on a phone. You don't need a powerful GPU. CPU-only inference at 10-20 tok/s is perfectly usable for chat.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Multimodal at 4B
- Edge-class
Weaknesses
- License restrictions
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 2.5 GB | 4 GB |
| Q8_0 | 4.4 GB | 6 GB |
Get the model
Ollama
One-line install
ollama run gemma3:4bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Benchmarks
Real measurements on real hardware. Numbers ship with the runner version, quant, and date.
| Hardware | Provenance | Quant | Ctx | Tokens / sec | TTFT | Date |
|---|---|---|---|---|---|---|
| NVIDIA GeForce RTX 3080 16GB (Mobile) | EditorialM | Q4_K_M | 4K | 97.7tok/s | 743 ms | Jun 2, 26 |
What to do next
Got this model running on real hardware? Share what you measured — the form arrives with the model pre-selected.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Gemma 3 4B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Gemma 3 4B?
Can I use Gemma 3 4B commercially?
What's the context length of Gemma 3 4B?
How do I install Gemma 3 4B with Ollama?
Does Gemma 3 4B support images?
Source: huggingface.co/google/gemma-3-4b-it
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Gemma 3 4B runs on your specific hardware before committing money.