gemma

4B parameters

Commercial OK

Multimodal

Reviewed June 2026

Gemma 3 4B

4B Gemma 3 for edge. Multimodal.

License: Gemma Terms of Use·Released Mar 12, 2025·Context: 131,072 tokens

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

7.5/10

Positioning

The 4B Gemma 3 with multimodal capability. Genuinely the best small-model pick when image input matters and VRAM is constrained — fits in under 4 GB at Q4.

Strengths

Native multimodal at 4B — no other model in this size class does this credibly.
Conversational quality materially better than Phi 3.5 Mini for general chat.
128K context even at this size.

Limitations

Gemma license restrictiveness.
Math and structured tasks weaker than Phi 3.5 Mini.
Knowledge breadth narrow — small-model limitations are real.

Real-world performance on RTX 4090

Q4_K_M (2.7 GB): 130–150 tok/s decode, TTFT under 50 ms
Q5_K_M (3.2 GB): 115–135 tok/s
Q8_0 (4.8 GB): 95–115 tok/s

Should you run this locally?

Yes, for edge devices with multimodal input requirements, 4–6 GB GPU owners who want chat + vision. No, for math/structured tasks (pick Phi 3.5 Mini), or where chat-only ≥ 8B is a better fit.

How it compares

vs Phi-3.5 Mini (3.8B) → Gemma 3 4B wins on chat + multimodal; Phi wins on math + structured output.
vs Llama 3.2 3B → similar text capability; Gemma adds multimodal.
vs Gemma 3 1B → 4B is meaningfully smarter; 1B is for very tight constraints.

Run this yourself

ollama pull gemma3:4b-it-q4_K_M
ollama run gemma3:4b-it-q4_K_M

Settings: Q4_K_M GGUF, 8192 ctx, llama.cpp/CUDA, RTX 4090

›Why this rating

7.5/10 — best 4B-class general model when you want multimodal at edge size. Loses to Phi-3.5 Mini on math + structured tasks but beats it on chat naturalness.

Overview

4B Gemma 3 for edge. Multimodal.

How to run it

Gemma 3 4B is Google's smallest dense model — 4B parameters, designed for the "runs on literally anything" tier. Run at Q4_K_M via Ollama (ollama pull gemma3:4b) or llama.cpp with -ngl 999 -fa -c 4096. Q4_K_M file size ~2.5 GB on disk. Minimum VRAM: 2 GB — integrated graphics, phone GPU, or any discrete GPU. CPU-only works fine at 10-20 tok/s on modern laptop. Recommended: any device with 4+ GB RAM. Throughput: ~100-150+ tok/s on RTX 4090 at Q4_K_M — flies. ~15-25 tok/s CPU-only on MacBook Air. Gemma 3 architecture — well-supported everywhere. The 4B is the tiny-but-capable tier: good for text classification, extraction, simple Q&A, grammar correction, lightweight chat. Not for: complex reasoning, long-form generation, nuanced conversation, coding — 4B is the quality ceiling. Context: 8K advertised; practical at Q4 is 8K on any device. For step-up: Gemma 3 12B or Granite 3 MoE 3B-Active for MoE at similar active size with better quality. License: Gemma license (permissive, verify commercial terms). Gemma 3 4B is the default recommendation for edge/mobile/CPU-only when quality requirements are modest and speed + deployability matter most.

Hardware guidance

Minimum: 2 GB RAM CPU-only at Q4_K_M (~5-10 tok/s). Raspberry Pi 5 8GB: Q4 at 10-18 tok/s. Phone: runs via MLC LLM or llama.cpp Android at 5-10 tok/s. VRAM math: 4B dense, Q4_K_M ≈ 2.5 GB. KV cache at 4K: ~0.5 GB. Total: ~3 GB at 4K. Any discrete GPU from GTX 1050 upward fits comfortably. RTX 4090 24GB: absurdly over-provisioned — but delivers 150+ tok/s. CPU-only on modern laptop (16 GB RAM): 10-20 tok/s. Intel integrated graphics: 8-15 tok/s via llama.cpp Vulkan backend. MacBook Air M1 8GB: Q4 at 15-25 tok/s via MLX-LM. This is the most deployable model — runs on phones, Raspberry Pi, old laptops, AWS t3.micro. The hardware floor is essentially "any device made after 2018."

What breaks first

4B quality ceiling. The model's fundamental capability is limited by 4B parameters. Complex reasoning, nuanced understanding, and deep knowledge recall hit the wall hard. 2. Hallucination rate. Small models hallucinate more. Gemma 3 4B will confidently produce incorrect facts more often than 7B+ models. Don't treat it as a knowledge base. 3. Context window overclaim. 8K is advertised but usable context at 4B degrades past 4K tokens. The model loses track of earlier context. Keep prompts concise. 4. Not multilingual. Gemma 3 4B's multilingual quality is weak. English is strongest; other languages may produce broken or English-mixed outputs.

Runtime recommendation

Ollama for quick-start on any device. llama.cpp for CPU-only, Vulkan integrated GPU, or edge deployment. MLX-LM for Apple Silicon. MLC LLM for Android/iOS. Gemma 3 4B is universally supported — pick the runtime that matches your hardware. CPU-only via llama.cpp is surprisingly fast at this size.

Common beginner mistakes

Mistake: Deploying Gemma 3 4B for tasks that need reasoning depth. Fix: 4B is a lightweight model. For complex tasks, use at least Gemma 4 26B MoE or Qwen 3 32B. Mistake: Running 8K context and expecting coherent outputs. Fix: 4B quality degrades past 4K context. Keep prompts short and focused. Mistake: Using Gemma 3 4B as a production knowledge base. Fix: 4B models hallucinate more. Pair with RAG on vetted documents for factual tasks. Mistake: Over-provisioning hardware ("I need an RTX 4090 for this"). Fix: It runs on a phone. You don't need a powerful GPU. CPU-only inference at 10-20 tok/s is perfectly usable for chat.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model

Gemma 3 12B12B

Consumer

Family siblings (gemma-3)

Gemma 3 1B1B

Edge

Gemma 3 4B4B

You are here

Trendyol LLM Asure 12B11.8B

Distilled / fine-tuned from this

Gemma 3 1B1B

Edge

Strengths

Multimodal at 4B
Edge-class

Weaknesses

License restrictions

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	2.5 GB	4 GB
Q8_0	4.4 GB	6 GB

Get the model

Ollama

One-line install

ollama run gemma3:4bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/google/gemma-3-4b-it

Source repository — direct quantization required.

Benchmarks

Real measurements on real hardware. Numbers ship with the runner version, quant, and date.

1 run on record

Hardware	Provenance	Quant	Ctx	Tokens / sec	TTFT	Date
NVIDIA GeForce RTX 3080 16GB (Mobile)	EditorialM	Q4_K_M	4K	97.7tok/s	743 ms	Jun 2, 26

What to do next

Got this model running on real hardware? Share what you measured — the form arrives with the model pre-selected.

Submit a benchmark for Gemma 3 4B

OrBrowse the benchmark roadmap Compare hardware options

Hardware that runs this

Cards with enough VRAM for at least one quantization of Gemma 3 4B.

NVIDIA B300 (Blackwell Ultra)

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier

Models in the same parameter band as this one

Step up

More capable — bigger memory footprint

Step down

Smaller — faster, runs on weaker hardware

Frequently asked

What's the minimum VRAM to run Gemma 3 4B?

4GB of VRAM is enough to run Gemma 3 4B at the Q4_K_M quantization (file size 2.5 GB). Higher-quality quantizations need more.

Can I use Gemma 3 4B commercially?

Yes — Gemma 3 4B ships under the Gemma Terms of Use, which permits commercial use. Always read the license text before deployment.

What's the context length of Gemma 3 4B?

Gemma 3 4B supports a context window of 131,072 tokens (about 131K).

How do I install Gemma 3 4B with Ollama?

Run `ollama pull gemma3:4b` to download, then `ollama run gemma3:4b` to start a chat session. The default quantization is Q4_K_M.

Does Gemma 3 4B support images?

Yes — Gemma 3 4B is multimodal and accepts text + vision inputs. Vision support requires a runner that handles its image-conditioning architecture.

Source: huggingface.co/google/gemma-3-4b-it

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Alternatives

Gemma 3 1B Gemma 3 27B Gemma 3 12B Trendyol LLM Asure 12B

Before you buy

Verify Gemma 3 4B runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →