gemma
26B parameters
Commercial OK
Multimodal
Reviewed June 2026

Gemma 4 26B MoE

MoE variant of Gemma 4. Faster per-token than the 31B dense at similar quality on most tasks.

License: Gemma Terms of Use·Released Apr 2, 2026·Context: 131,072 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
unrated

Positioning

Gemma 4 26B MoE is Google's Mixture-of-Experts variant of the Gemma 4 family, designed to deliver faster per-token inference than the dense 31B sibling while maintaining similar quality on most tasks. Released under the Gemma Terms of Use, it offers a 131,072-token context window and targets workstation-class deployments. Its MoE architecture means only a subset of parameters are active per token, making inference more efficient than a dense model of comparable total parameter count.

Strengths

  • MoE efficiency: With 26B total parameters but fewer active per token, inference cost is closer to a dense ~10B-12B model, enabling faster generation on workstation hardware.
  • Long context: The 131K context window suits document analysis, codebase understanding, and multi-turn conversations without truncation.
  • Permissive licensing: The Gemma Terms of Use allow broad commercial use, making it a strong choice for enterprise deployment.
  • Quantization-friendly: At Q4_K_M (14.6 GB) or Q3_K_M (12.7 GB), the model fits comfortably on a single 24GB GPU with room for KV cache overhead.

Limitations

  • No independent benchmarks available: We do not have community-reported benchmark scores for this model. Published vendor metrics should be treated as best-case until verified by third parties.
  • Workstation-class only: The model requires at least 24GB of VRAM even at low quantizations, excluding consumer-grade 12GB GPUs.
  • KV cache overhead: At full 131K context, the KV cache can add 30-50% memory overhead, potentially requiring higher quantizations or shorter contexts on limited hardware.
  • MoE routing overhead: While per-token speed is improved, MoE models can have higher memory bandwidth demands and may underperform on tasks requiring dense reasoning if routing is suboptimal.

What it takes to run this locally

Quantized sizes (disk): FP16 ~52 GB, Q8_0 ~28 GB, Q6_K ~21.4 GB, Q5_K_M ~18.5 GB, Q4_K_M ~14.6 GB, Q3_K_M ~12.7 GB, Q2_K ~8.5 GB. Add ~30-50% for KV cache and framework overhead at typical context lengths. The model is best suited for workstation deployment (single 24GB+ GPU). Q4_K_M or Q3_K_M quantizations are practical on a 24GB card; Q2_K may fit on 12GB but with significant quality loss. Multi-GPU setups can run higher precision variants.

Should you run this locally?

Yes if you need a permissively licensed MoE model with long context for workstation-class hardware, and you value faster per-token inference over the dense 31B variant. No if you lack a 24GB+ GPU or require verified third-party benchmarks before adoption.

Catalog cross-links

  • Gemma 4 31B Dense
  • Gemma 4 9B
  • Google Gemma family

Overview

MoE variant of Gemma 4. Faster per-token than the 31B dense at similar quality on most tasks.

How to run it

Gemma 4 26B MoE is Google's Mixture-of-Experts model with 26B total parameters (~7B active per token). The MoE architecture gives it quality closer to a ~15B dense model with the generation speed of a ~7B model. Run at Q4_K_M via Ollama (ollama pull gemma4:26b-moe) or llama.cpp with -ngl 999 -fa -c 16384. Q4_K_M file size ~15 GB on disk. Minimum VRAM: 12 GB — RTX 4070 (12GB) at Q4_K_M with expert offload for 4K context. RTX 4090 24GB: Q4_K_M comfortably at 32K context with all experts in VRAM. Recommended: RTX 4090 24GB at Q4_K_M. Throughput: ~50-80 tok/s on RTX 4090 at Q4_K_M. The 7B active makes generation fast — close to 7B dense speeds. Gemma architecture with MoE — verify llama.cpp MoE support for Gemma specifically. Use for: general chat, reasoning, coding — efficiently bridging the gap between 7B and 30B dense quality with 7B speed. Context: Gemma 4's 32K+; practical at Q4 on 24 GB is 32K+. License: Gemma license (verify commercial terms). For larger Gemma MoE: none in the 4 family. For dense Gemma: Gemma 4 31B if available.

Hardware guidance

Minimum: RTX 3060 12GB at Q3_K_M with expert offload. Recommended: RTX 4090 24GB at Q4_K_M (32K+ context, all experts in VRAM). VRAM math: 26B total MoE, ~7B active. Q4_K_M ≈ 15 GB for full weights. Expert offload: ~5-7 GB active experts in VRAM, rest in RAM. KV cache at 16K: ~5 GB. Total with all experts in VRAM: ~20 GB — fits 24 GB GPUs comfortably. RTX 3080 10GB: Q3_K_M with expert offload. RTX 4080 16GB: Q4_K_M with all experts on-GPU, 8K context. MacBook Pro M4 Pro 24GB+: Q4 at 20-35 tok/s. Cloud: A10 24GB at Q4_K_M. This is one of the most efficient models in its quality tier — ~15B dense quality at 7B active speed on consumer GPUs.

What breaks first

  1. Gemma MoE architecture support. Gemma's MoE implementation may differ from Mixtral/DeepSeek MoE. Verify llama.cpp supports Gemma 4 MoE specifically. 2. Expert offload stall. With experts in RAM, routing to a RAM-resident expert adds 30-80ms latency. On 12 GB GPUs with DDR4 RAM, this is noticeable. Use DDR5 RAM to minimize penalty. 3. Ollama tag availability. Gemma 4 MoE is new — may not be in Ollama's catalog yet. Use raw llama.cpp with conversion from hf. 4. Q3 quant on MoE. MoE routing gates become noisier at low quants. The model may route to suboptimal experts at Q3, degrading output quality more than a dense model at the same quant.

Runtime recommendation

llama.cpp with -ngl 999 for local use (verify Gemma MoE support). Ollama if tag exists. vLLM for serving. Gemma MoE benefits from expert-parallel scheduling in vLLM. Standard Gemma + MoE support required — verify your runtime version supports both.

Common beginner mistakes

Mistake: Assuming 26B MoE = 26B dense VRAM requirements. Fix: Q4 is 15 GB. With expert offload, only ~5-7 GB needs to be in VRAM. The model fits 12 GB GPUs. Do the math. Mistake: Expecting 26B dense quality. Fix: MoE quality is closer to a ~15B dense model. The 26B total doesn't translate to 26B dense performance. The active subset (7B) drives quality. Mistake: Confusing Gemma 4 26B MoE with Gemma 4 31B dense. Fix: Different architectures entirely. 26B MoE has 7B active, 31B dense has 31B active. The 31B is higher quality but slower. Pick based on your quality/speed tradeoff. Mistake: Using standard Gemma GGUF conversion for MoE. Fix: MoE requires MoE-aware conversion. Use Gemma-4-MoE-specific scripts or pre-converted GGUFs.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Strengths

  • MoE speed advantage
  • Multilingual
  • Multimodal

Weaknesses

  • MoE routing variance

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M16.0 GB20 GB

Get the model

Ollama

One-line install

ollama run gemma4:26b-moeRead our Ollama review →

HuggingFace

Original weights

huggingface.co/google/gemma-4-26b-moe-it

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Gemma 4 26B MoE.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Gemma 4 26B MoE?

20GB of VRAM is enough to run Gemma 4 26B MoE at the Q4_K_M quantization (file size 16.0 GB). Higher-quality quantizations need more.

Can I use Gemma 4 26B MoE commercially?

Yes — Gemma 4 26B MoE ships under the Gemma Terms of Use, which permits commercial use. Always read the license text before deployment.

What's the context length of Gemma 4 26B MoE?

Gemma 4 26B MoE supports a context window of 131,072 tokens (about 131K).

How do I install Gemma 4 26B MoE with Ollama?

Run `ollama pull gemma4:26b-moe` to download, then `ollama run gemma4:26b-moe` to start a chat session. The default quantization is Q4_K_M.

Does Gemma 4 26B MoE support images?

Yes — Gemma 4 26B MoE is multimodal and accepts text + vision inputs. Vision support requires a runner that handles its image-conditioning architecture.

Source: huggingface.co/google/gemma-4-26b-moe-it

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify Gemma 4 26B MoE runs on your specific hardware before committing money.