gemma
31B parameters
Commercial OK
Multimodal
Reviewed June 2026

Gemma 4 31B Dense

Google's flagship dense Gemma 4. Beats some 400B-class proprietary models on benchmarks. Targets the 24GB single-GPU sweet spot.

License: Gemma Terms of Use·Released Apr 2, 2026·Context: 131,072 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
unrated

Positioning

Gemma 4 31B Dense is Google's late-2025 / early-2026 mid-tier flagship — a 31B dense model that targets the same "fits a single 24-GB card at Q4" tier that the Qwen 3 30B-A3B MoE occupies, but via dense compute instead of MoE routing. Dense vs MoE at this tier is a real operator-grade choice: dense gives more predictable behavior + better edge-case handling at the cost of slower decode. The operator-grade question for our readers isn't "is Gemma 4 good?" — it's competitive — but "do you prefer dense reliability or MoE speed at the 32B-class tier?"

Strengths

  • Fits a single 24-GB consumer card at Q4 (~17-18 GB), leaves comfortable headroom for 16K+ context. Single RTX 3090, RTX 4090, or RX 7900 XTX runs it natively.
  • Strong multilingual + reasoning combo. Gemma 4 series leverages Google's Gemini-team training discipline. Per published Google benchmarks, Gemma 4 31B lands in the same band as Qwen 3 32B on most evals — independent third-party head-to-head data is still sparse.
  • Dense architecture means predictable behavior. No MoE expert-routing surprises — every prompt activates the full 31B. For production pipelines that need consistent latency + output quality, this matters.
  • Permissive Gemma license. Less restrictive than Llama 4's MAU clause but with some Google-specific restrictions (verify terms for your specific use case). Gemma license terms.
  • Good tool-call support. Gemma 4's instruction tuning includes structured output + function-calling formats. Works cleanly with Aider and other agent tools at this tier.

Limitations

  • Slower decode than Qwen 3 30B-A3B on the same hardware. Dense 31B activates 31B parameters per token vs MoE's ~3B. Real-world: Gemma 4 31B at ~25-40 tok/s on RTX 4090 vs Qwen 3 30B-A3B at ~80-110 tok/s. The 2-3× speed gap is the dense-vs-MoE tax.
  • English bias is meaningfully stronger than Qwen's. For non-English daily-driver work, Qwen models still win.
  • Knowledge cutoff is mid-2025. For current-events / recent-API workloads, augment with RAG.
  • Context window practical ceiling ~32K despite spec advertising 128K — quality drops past 32K in our internal testing.
  • Coding-specific evaluations trail Qwen 2.5 Coder 32B by 5-10 percentage points on HumanEval / SWE-bench Verified.

Real-world performance on RTX 4090 (24 GB)

  • Q4_K_M (~18 GB): ~25-40 tok/s decode, TTFT 80-150 ms on 1K prompts. Acceptable for daily use; meaningfully slower than MoE alternatives at the same param count.
  • Q5_K_M (~22 GB): ~22-32 tok/s, marginal quality bump, less context room.
  • Q8_0 (~33 GB partial-offload): ~12-18 tok/s. Quality bump over Q4 is small; rarely worth the speed loss on local hardware.
  • Compare with: rented A100 80GB datacenter setup runs Gemma 4 31B at ~70-90 tok/s with FP16. Local consumer hardware is 30-50% the speed.

Should you run this locally?

Yes, if you have a 24-GB consumer GPU and prefer dense-model predictability over MoE speed. Production pipelines that benefit from consistent per-token latency, RAG systems where output stability matters, and English-first daily-driver workflows — Gemma 4 31B is the right pick.

Yes, if you specifically need Gemma family behavior (Google ecosystem integration, Gemma-tuned LoRAs, family of models that includes vision Gemma variants).

No, for anyone whose primary workload is throughput-bound. Qwen 3 30B-A3B at 2-3× the decode speed is dramatically better $/throughput on the same hardware tier.

Probably not, for anyone whose primary workload is non-English (Qwen wins consistently on multilingual).

Probably not, for anyone whose primary workload is coding (Qwen 2.5 Coder 32B at the same VRAM tier wins on coding-specific evals).

How it compares

  • vs Qwen 3 30B-A3B (MoE, same tier) → Qwen 3 30B-A3B wins decisively on speed (~2-3× faster decode). Gemma 4 wins on dense-architecture predictability + Google-ecosystem fit. Pick MoE for speed; pick dense for production reliability. Same hardware footprint, different operator priorities.
  • vs Qwen 3 32B (dense Qwen) → similar dense architecture, similar VRAM footprint. Qwen 3 32B has better multilingual + slight edge on most reasoning benchmarks. Gemma 4 has better tool-calling + Google-ecosystem fit. Coin flip with edge to Qwen if multilingual matters.
  • vs Qwen 2.5 32B Instruct (prior-gen Qwen dense) → Q2.5 32B is older but has more deployment weight. Gemma 4 is newer with marginal quality improvements. New deployments should probably pick Gemma 4 OR Qwen 3 32B; existing Q2.5 32B deployments don't need to upgrade.
  • vs Llama 3.3 70B Instruct → Llama 3.3 70B at Q4 needs ~40 GB; doesn't fit single 24-GB. If you have 32GB+ tier hardware (5090, dual 3090, Mac M-Max 64+), Llama 3.3 70B leads Gemma 4 31B on most quality benchmarks (size advantage, expected). Gemma wins on accessibility — fits a card half the size at half the price.
  • vs DeepSeek R1 Distill Qwen 32B → R1-Distill specializes in reasoning chain-of-thought. Gemma 4 31B is generalist. Pick R1 Distill for math + logic puzzles; pick Gemma 4 for general daily-driver work.
  • vs Phi 4 14B → Phi 4 fits 16-GB cards; Gemma 4 needs 24 GB. Phi 4 wins on smaller-card accessibility; Gemma 4 wins on the workloads where 31B-class capability matters.

Run this yourself

# RTX 4090 / 3090 / 7900 XTX — single-card 24 GB
ollama pull gemma4:31b-instruct-q4_K_M
ollama run gemma4:31b-instruct-q4_K_M

# Or via llama.cpp directly:
llama-server -m gemma-4-31b-instruct-Q4_K_M.gguf \
  --ctx-size 16384 -ngl 999 --temp 0.7

# For multi-user serving via vLLM:
vllm serve google/gemma-4-31b-instruct \
  --tensor-parallel-size 1 --max-model-len 16384
Quant: Q4_K_M GGUF Context: 16384 (KV cache f16, ~2 GB additional) Backend: llama.cpp via Ollama, CUDA 12.x Hardware: RTX 4090, NVIDIA driver 555+

Overview

Google's flagship dense Gemma 4. Beats some 400B-class proprietary models on benchmarks. Targets the 24GB single-GPU sweet spot.

Execution notes

L1.25 enriched

Operator notes

Gemma 4 31B is Google's workstation-tier dense flagship for May 2026. It's the operator default when you want strong multilingual coverage on a single 24-32 GB card with the Gemma License (commercial-OK with the usual safety-policy attached).

What makes it the workstation-tier dense pick:

  • Strong multilingual coverage — outperforms similar-size Llama / Qwen siblings on non-English benchmarks.
  • Q4_K_M fits 24 GB VRAM with 8K context; comfortable on single RTX 4090 / 5090 / 6000 Ada.
  • 131K context — long enough for most agent harness use cases.
  • Gemma License — commercial OK; verify the safety-policy clause for your deployment.

Deployment notes

For workstation-tier deployments running multilingual workloads, Gemma 4 31B is the dense default. Pair with:

  • vLLM + AWQ-INT4 for production serving. Single-card deployment shape.
  • Ollama + Q4_K_M for solo-developer setups.
  • llama.cpp + Q4_K_M for the Apple Silicon path via MLX adapter.

For more efficient throughput at a slight quality cost, drop to Gemma 4 26B MoE — same workstation tier, MoE active params keep tok/s competitive.

For consumer-tier (8-16 GB) deployments, drop to Gemma 3 12B — the Gemma 3 family is still the operator default at the consumer tier as of May 2026.

For the L1.25-enriched workstation Gemma alternative (Gemma 3 lineage), see Gemma 3 27B — the predecessor with more operator history.

Runtime compatibility

  • vLLM ✓ excellent. AWQ-INT4 supported.
  • SGLang ✓ excellent. Strong on agent-harness workloads with stable system prompts.
  • Ollama ✓ excellent. Q4_K_M GGUF available at release.
  • llama.cpp ✓ excellent. Native GGUF.
  • MLX-LM ✓ excellent. Apple Silicon unified memory holds 8K context comfortably at MLX-4bit on M3 Max / Ultra.
  • TensorRT-LLM ✓ supported but rarely justified at workstation scale.

Quantization suitability

Q4_K_M is the operational default. AWQ-INT4 for vLLM deployments. Q5_K_M is feasible on 32 GB cards (RTX 6000 Ada, A6000) for slightly tighter quality.

Avoid Q3 — Gemma 4 regresses more on multilingual quality at low bit-widths than Gemma 3 did. The non-English token distributions are more sensitive.

Best use cases

  • Workstation-tier multilingual chat — German / French / Spanish / Japanese / Korean depth on a single 24 GB card.
  • General-purpose chat with permissive license — Gemma License is commercial-friendly with the safety-policy caveat.
  • RAG over multilingual corpora — pair with BGE M3 embeddings and a multilingual reranker.

When to use a different model

  • Code-first workloads: Qwen 2.5 Coder 32B is sharper on coding at the same workstation tier.
  • Reasoning-first workloads: Qwen 3 32B (L1.25-enriched) — thinking-mode toggle gives per-call reasoning depth control.
  • Apache 2.0 hard requirement: Qwen 2.5 32B at the same workstation tier.
  • MoE efficiency variant: Gemma 4 26B MoE.
  • Smaller hardware: Gemma 3 12B — consumer-tier sibling.

Failure modes specific to this model

  1. Safety-policy clause. Gemma License's safety policy is more restrictive than Apache 2.0 / MIT; verify your deployment fits before committing.
  2. Refusal patterns. Gemma 4 retains the family's relatively conservative refusal posture — particularly on creative-writing edge cases. Set system-prompt overrides cautiously.
  3. Quant-sensitivity on multilingual. Q3-class quants regress non-English quality more than English. Stick to Q4_K_M minimum.

Going deeper

Reviewed May 6, 2026 by Fredoline Eruo

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Distilled / fine-tuned from this

Strengths

  • Top dense ~30B model
  • Strong multilingual
  • 128K context

Weaknesses

  • Gemma license has use restrictions

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M18.0 GB24 GB
Q8_033.0 GB38 GB

Get the model

Ollama

One-line install

ollama run gemma4:31bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/google/gemma-4-31b-it

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Gemma 4 31B Dense.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Gemma 4 31B Dense?

24GB of VRAM is enough to run Gemma 4 31B Dense at the Q4_K_M quantization (file size 18.0 GB). Higher-quality quantizations need more.

Can I use Gemma 4 31B Dense commercially?

Yes — Gemma 4 31B Dense ships under the Gemma Terms of Use, which permits commercial use. Always read the license text before deployment.

What's the context length of Gemma 4 31B Dense?

Gemma 4 31B Dense supports a context window of 131,072 tokens (about 131K).

How do I install Gemma 4 31B Dense with Ollama?

Run `ollama pull gemma4:31b` to download, then `ollama run gemma4:31b` to start a chat session. The default quantization is Q4_K_M.

Does Gemma 4 31B Dense support images?

Yes — Gemma 4 31B Dense is multimodal and accepts text + vision inputs. Vision support requires a runner that handles its image-conditioning architecture.

Source: huggingface.co/google/gemma-4-31b-it

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify Gemma 4 31B Dense runs on your specific hardware before committing money.