gemma
4B parameters
Commercial OK
Multimodal
Reviewed June 2026

Gemma 3 4B

4B Gemma 3 for edge. Multimodal.

License: Gemma Terms of Use·Released Mar 12, 2025·Context: 131,072 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
7.5/10

Positioning

The 4B Gemma 3 with multimodal capability. Genuinely the best small-model pick when image input matters and VRAM is constrained — fits in under 4 GB at Q4.

Strengths

  • Native multimodal at 4B — no other model in this size class does this credibly.
  • Conversational quality materially better than Phi 3.5 Mini for general chat.
  • 128K context even at this size.

Limitations

  • Gemma license restrictiveness.
  • Math and structured tasks weaker than Phi 3.5 Mini.
  • Knowledge breadth narrow — small-model limitations are real.

Real-world performance on RTX 4090

  • Q4_K_M (2.7 GB): 130–150 tok/s decode, TTFT under 50 ms
  • Q5_K_M (3.2 GB): 115–135 tok/s
  • Q8_0 (4.8 GB): 95–115 tok/s

Should you run this locally?

Yes, for edge devices with multimodal input requirements, 4–6 GB GPU owners who want chat + vision. No, for math/structured tasks (pick Phi 3.5 Mini), or where chat-only ≥ 8B is a better fit.

How it compares

  • vs Phi-3.5 Mini (3.8B) → Gemma 3 4B wins on chat + multimodal; Phi wins on math + structured output.
  • vs Llama 3.2 3B → similar text capability; Gemma adds multimodal.
  • vs Gemma 3 1B → 4B is meaningfully smarter; 1B is for very tight constraints.

Run this yourself

ollama pull gemma3:4b-it-q4_K_M
ollama run gemma3:4b-it-q4_K_M
Settings: Q4_K_M GGUF, 8192 ctx, llama.cpp/CUDA, RTX 4090
Why this rating

7.5/10 — best 4B-class general model when you want multimodal at edge size. Loses to Phi-3.5 Mini on math + structured tasks but beats it on chat naturalness.

Overview

4B Gemma 3 for edge. Multimodal.

How to run it

Gemma 3 4B is Google's smallest dense model — 4B parameters, designed for the "runs on literally anything" tier. Run at Q4_K_M via Ollama (ollama pull gemma3:4b) or llama.cpp with -ngl 999 -fa -c 4096. Q4_K_M file size ~2.5 GB on disk. Minimum VRAM: 2 GB — integrated graphics, phone GPU, or any discrete GPU. CPU-only works fine at 10-20 tok/s on modern laptop. Recommended: any device with 4+ GB RAM. Throughput: ~100-150+ tok/s on RTX 4090 at Q4_K_M — flies. ~15-25 tok/s CPU-only on MacBook Air. Gemma 3 architecture — well-supported everywhere. The 4B is the tiny-but-capable tier: good for text classification, extraction, simple Q&A, grammar correction, lightweight chat. Not for: complex reasoning, long-form generation, nuanced conversation, coding — 4B is the quality ceiling. Context: 8K advertised; practical at Q4 is 8K on any device. For step-up: Gemma 3 12B or Granite 3 MoE 3B-Active for MoE at similar active size with better quality. License: Gemma license (permissive, verify commercial terms). Gemma 3 4B is the default recommendation for edge/mobile/CPU-only when quality requirements are modest and speed + deployability matter most.

Hardware guidance

Minimum: 2 GB RAM CPU-only at Q4_K_M (~5-10 tok/s). Raspberry Pi 5 8GB: Q4 at 10-18 tok/s. Phone: runs via MLC LLM or llama.cpp Android at 5-10 tok/s. VRAM math: 4B dense, Q4_K_M ≈ 2.5 GB. KV cache at 4K: ~0.5 GB. Total: ~3 GB at 4K. Any discrete GPU from GTX 1050 upward fits comfortably. RTX 4090 24GB: absurdly over-provisioned — but delivers 150+ tok/s. CPU-only on modern laptop (16 GB RAM): 10-20 tok/s. Intel integrated graphics: 8-15 tok/s via llama.cpp Vulkan backend. MacBook Air M1 8GB: Q4 at 15-25 tok/s via MLX-LM. This is the most deployable model — runs on phones, Raspberry Pi, old laptops, AWS t3.micro. The hardware floor is essentially "any device made after 2018."

What breaks first

  1. 4B quality ceiling. The model's fundamental capability is limited by 4B parameters. Complex reasoning, nuanced understanding, and deep knowledge recall hit the wall hard. 2. Hallucination rate. Small models hallucinate more. Gemma 3 4B will confidently produce incorrect facts more often than 7B+ models. Don't treat it as a knowledge base. 3. Context window overclaim. 8K is advertised but usable context at 4B degrades past 4K tokens. The model loses track of earlier context. Keep prompts concise. 4. Not multilingual. Gemma 3 4B's multilingual quality is weak. English is strongest; other languages may produce broken or English-mixed outputs.

Runtime recommendation

Ollama for quick-start on any device. llama.cpp for CPU-only, Vulkan integrated GPU, or edge deployment. MLX-LM for Apple Silicon. MLC LLM for Android/iOS. Gemma 3 4B is universally supported — pick the runtime that matches your hardware. CPU-only via llama.cpp is surprisingly fast at this size.

Common beginner mistakes

Mistake: Deploying Gemma 3 4B for tasks that need reasoning depth. Fix: 4B is a lightweight model. For complex tasks, use at least Gemma 4 26B MoE or Qwen 3 32B. Mistake: Running 8K context and expecting coherent outputs. Fix: 4B quality degrades past 4K context. Keep prompts short and focused. Mistake: Using Gemma 3 4B as a production knowledge base. Fix: 4B models hallucinate more. Pair with RAG on vetted documents for factual tasks. Mistake: Over-provisioning hardware ("I need an RTX 4090 for this"). Fix: It runs on a phone. You don't need a powerful GPU. CPU-only inference at 10-20 tok/s is perfectly usable for chat.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model
Gemma 3 12B12B
Consumer
Distilled / fine-tuned from this

Strengths

  • Multimodal at 4B
  • Edge-class

Weaknesses

  • License restrictions

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M2.5 GB4 GB
Q8_04.4 GB6 GB

Get the model

Ollama

One-line install

ollama run gemma3:4bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/google/gemma-3-4b-it

Source repository — direct quantization required.

Benchmarks

Real measurements on real hardware. Numbers ship with the runner version, quant, and date.

1 run on record
HardwareProvenanceQuantCtxTokens / secTTFTDate
NVIDIA GeForce RTX 3080 16GB (Mobile)
EditorialM
Q4_K_M4K
97.7tok/s
743 msJun 2, 26

What to do next

Got this model running on real hardware? Share what you measured — the form arrives with the model pre-selected.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Gemma 3 4B.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Gemma 3 4B?

4GB of VRAM is enough to run Gemma 3 4B at the Q4_K_M quantization (file size 2.5 GB). Higher-quality quantizations need more.

Can I use Gemma 3 4B commercially?

Yes — Gemma 3 4B ships under the Gemma Terms of Use, which permits commercial use. Always read the license text before deployment.

What's the context length of Gemma 3 4B?

Gemma 3 4B supports a context window of 131,072 tokens (about 131K).

How do I install Gemma 3 4B with Ollama?

Run `ollama pull gemma3:4b` to download, then `ollama run gemma3:4b` to start a chat session. The default quantization is Q4_K_M.

Does Gemma 3 4B support images?

Yes — Gemma 3 4B is multimodal and accepts text + vision inputs. Vision support requires a runner that handles its image-conditioning architecture.

Source: huggingface.co/google/gemma-3-4b-it

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify Gemma 3 4B runs on your specific hardware before committing money.