Gemma 4 E2B (Effective 2B)
Smallest Gemma 4. Designed for phones and Raspberry-Pi-class hardware.
Positioning
Gemma 4 E2B (Effective 2B) is the smallest entry in Google's Gemma 4 family, a dense 2-billion-parameter model released under the Gemma Terms of Use. With a 131,072-token context window, it is explicitly designed for edge deployment—phones, Raspberry Pi, and similar low-power hardware. Its compact size and permissive license make it a candidate for on-device applications where privacy and offline capability are priorities.
Strengths
- Extremely compact footprint: At 2B parameters, the model fits comfortably on consumer hardware. Quantized versions range from ~4 GB (FP16) down to ~0.7 GB (Q2_K), enabling deployment on devices with limited RAM.
- Long context for an edge model: A 131K token context window is unusually large for a 2B-parameter model, allowing it to process substantial documents or conversation histories on-device.
- Permissive licensing for commercial use: The Gemma Terms of Use allow broad commercial deployment, making it suitable for integration into products without restrictive licensing.
- Designed for low-power hardware: Google explicitly targets phones and Raspberry-Pi-class devices, meaning the architecture is optimized for inference on ARM CPUs, mobile GPUs, and other constrained environments.
Limitations
- Small parameter count limits capability: As a 2B dense model, it will not match the reasoning depth or knowledge breadth of larger models. Operators should expect higher perplexity and narrower competence on complex tasks.
- No community benchmarks available: We do not yet have independent measurements for this model. Published vendor metrics should be treated as best-case, and real-world performance may vary significantly.
- KV cache overhead at full context: With 131K context, the KV cache can dominate memory. At FP16, the cache alone may exceed 2 GB, pushing total memory requirements well beyond the model weights. Quantization helps but careful memory budgeting is required.
- Limited ecosystem maturity: As a new model, tooling (e.g., llama.cpp support, quantization scripts, community fine-tunes) may lag behind more established edge models like Gemma 2 or Phi-3.
What it takes to run this locally
Model file sizes by quantization:
- FP16: ~4 GB
- Q8_0: ~2 GB
- Q6_K: ~1.6 GB
- Q5_K_M: ~1.4 GB
- Q4_K_M: ~1.1 GB
- Q3_K_M: ~1.0 GB
- Q2_K: ~0.7 GB
Add ~30-50% for KV cache and framework overhead at typical context lengths. For full 131K context, the KV cache alone can be significant—plan for additional memory. Deployment class: edge. A single 4-8 GB GPU or a modern phone SoC (e.g., Apple A-series, Snapdragon 8 Gen) can run quantized versions. Raspberry Pi 4/5 with 4-8 GB RAM can run Q4_K_M or smaller quantizations.
Should you run this locally?
Yes if you need a permissively licensed, small model for on-device inference where privacy, offline capability, and low power consumption are critical. Ideal for mobile apps, IoT, or embedded systems that require long-context understanding without cloud connectivity.
No if your task demands strong reasoning, factual accuracy, or broad knowledge—larger models (e.g., Gemma 4 27B or other 7B+ models) will likely serve better. Also avoid if you need mature community tooling or verified benchmarks; this model is early in its lifecycle.
Catalog cross-links
- Gemma 4 27B
- Gemma 2 2B
- Raspberry Pi 5
Overview
Smallest Gemma 4. Designed for phones and Raspberry-Pi-class hardware.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Phone-class footprint
- Multimodal
Weaknesses
- Limited reasoning
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 1.3 GB | 3 GB |
| Q8_0 | 2.2 GB | 4 GB |
Get the model
Ollama
One-line install
ollama run gemma4:e2bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Benchmarks
Real measurements on real hardware. Numbers ship with the runner version, quant, and date.
| Hardware | Provenance | Quant | Ctx | Tokens / sec | TTFT | Date |
|---|---|---|---|---|---|---|
| NVIDIA GeForce RTX 3080 16GB (Mobile) | EditorialM | Q4_K_M | 4K | 99.1tok/s | 792 ms | Jun 2, 26 |
What to do next
Got this model running on real hardware? Share what you measured — the form arrives with the model pre-selected.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Gemma 4 E2B (Effective 2B).
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Gemma 4 E2B (Effective 2B)?
Can I use Gemma 4 E2B (Effective 2B) commercially?
What's the context length of Gemma 4 E2B (Effective 2B)?
How do I install Gemma 4 E2B (Effective 2B) with Ollama?
Does Gemma 4 E2B (Effective 2B) support images?
Source: huggingface.co/google/gemma-4-e2b-it
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Gemma 4 E2B (Effective 2B) runs on your specific hardware before committing money.