Gemma 4 31B Dense
Google's flagship dense Gemma 4. Beats some 400B-class proprietary models on benchmarks. Targets the 24GB single-GPU sweet spot.
Positioning
Gemma 4 31B Dense is Google's late-2025 / early-2026 mid-tier flagship — a 31B dense model that targets the same "fits a single 24-GB card at Q4" tier that the Qwen 3 30B-A3B MoE occupies, but via dense compute instead of MoE routing. Dense vs MoE at this tier is a real operator-grade choice: dense gives more predictable behavior + better edge-case handling at the cost of slower decode. The operator-grade question for our readers isn't "is Gemma 4 good?" — it's competitive — but "do you prefer dense reliability or MoE speed at the 32B-class tier?"
Strengths
- Fits a single 24-GB consumer card at Q4 (~17-18 GB), leaves comfortable headroom for 16K+ context. Single RTX 3090, RTX 4090, or RX 7900 XTX runs it natively.
- Strong multilingual + reasoning combo. Gemma 4 series leverages Google's Gemini-team training discipline. Per published Google benchmarks, Gemma 4 31B lands in the same band as Qwen 3 32B on most evals — independent third-party head-to-head data is still sparse.
- Dense architecture means predictable behavior. No MoE expert-routing surprises — every prompt activates the full 31B. For production pipelines that need consistent latency + output quality, this matters.
- Permissive Gemma license. Less restrictive than Llama 4's MAU clause but with some Google-specific restrictions (verify terms for your specific use case). Gemma license terms.
- Good tool-call support. Gemma 4's instruction tuning includes structured output + function-calling formats. Works cleanly with Aider and other agent tools at this tier.
Limitations
- Slower decode than Qwen 3 30B-A3B on the same hardware. Dense 31B activates 31B parameters per token vs MoE's ~3B. Real-world: Gemma 4 31B at ~25-40 tok/s on RTX 4090 vs Qwen 3 30B-A3B at ~80-110 tok/s. The 2-3× speed gap is the dense-vs-MoE tax.
- English bias is meaningfully stronger than Qwen's. For non-English daily-driver work, Qwen models still win.
- Knowledge cutoff is mid-2025. For current-events / recent-API workloads, augment with RAG.
- Context window practical ceiling ~32K despite spec advertising 128K — quality drops past 32K in our internal testing.
- Coding-specific evaluations trail Qwen 2.5 Coder 32B by 5-10 percentage points on HumanEval / SWE-bench Verified.
Real-world performance on RTX 4090 (24 GB)
- Q4_K_M (~18 GB): ~25-40 tok/s decode, TTFT 80-150 ms on 1K prompts. Acceptable for daily use; meaningfully slower than MoE alternatives at the same param count.
- Q5_K_M (~22 GB): ~22-32 tok/s, marginal quality bump, less context room.
- Q8_0 (~33 GB partial-offload): ~12-18 tok/s. Quality bump over Q4 is small; rarely worth the speed loss on local hardware.
- Compare with: rented A100 80GB datacenter setup runs Gemma 4 31B at ~70-90 tok/s with FP16. Local consumer hardware is 30-50% the speed.
Should you run this locally?
Yes, if you have a 24-GB consumer GPU and prefer dense-model predictability over MoE speed. Production pipelines that benefit from consistent per-token latency, RAG systems where output stability matters, and English-first daily-driver workflows — Gemma 4 31B is the right pick.
Yes, if you specifically need Gemma family behavior (Google ecosystem integration, Gemma-tuned LoRAs, family of models that includes vision Gemma variants).
No, for anyone whose primary workload is throughput-bound. Qwen 3 30B-A3B at 2-3× the decode speed is dramatically better $/throughput on the same hardware tier.
Probably not, for anyone whose primary workload is non-English (Qwen wins consistently on multilingual).
Probably not, for anyone whose primary workload is coding (Qwen 2.5 Coder 32B at the same VRAM tier wins on coding-specific evals).
How it compares
- vs Qwen 3 30B-A3B (MoE, same tier) → Qwen 3 30B-A3B wins decisively on speed (~2-3× faster decode). Gemma 4 wins on dense-architecture predictability + Google-ecosystem fit. Pick MoE for speed; pick dense for production reliability. Same hardware footprint, different operator priorities.
- vs Qwen 3 32B (dense Qwen) → similar dense architecture, similar VRAM footprint. Qwen 3 32B has better multilingual + slight edge on most reasoning benchmarks. Gemma 4 has better tool-calling + Google-ecosystem fit. Coin flip with edge to Qwen if multilingual matters.
- vs Qwen 2.5 32B Instruct (prior-gen Qwen dense) → Q2.5 32B is older but has more deployment weight. Gemma 4 is newer with marginal quality improvements. New deployments should probably pick Gemma 4 OR Qwen 3 32B; existing Q2.5 32B deployments don't need to upgrade.
- vs Llama 3.3 70B Instruct → Llama 3.3 70B at Q4 needs ~40 GB; doesn't fit single 24-GB. If you have 32GB+ tier hardware (5090, dual 3090, Mac M-Max 64+), Llama 3.3 70B leads Gemma 4 31B on most quality benchmarks (size advantage, expected). Gemma wins on accessibility — fits a card half the size at half the price.
- vs DeepSeek R1 Distill Qwen 32B → R1-Distill specializes in reasoning chain-of-thought. Gemma 4 31B is generalist. Pick R1 Distill for math + logic puzzles; pick Gemma 4 for general daily-driver work.
- vs Phi 4 14B → Phi 4 fits 16-GB cards; Gemma 4 needs 24 GB. Phi 4 wins on smaller-card accessibility; Gemma 4 wins on the workloads where 31B-class capability matters.
Run this yourself
# RTX 4090 / 3090 / 7900 XTX — single-card 24 GB
ollama pull gemma4:31b-instruct-q4_K_M
ollama run gemma4:31b-instruct-q4_K_M
# Or via llama.cpp directly:
llama-server -m gemma-4-31b-instruct-Q4_K_M.gguf \
--ctx-size 16384 -ngl 999 --temp 0.7
# For multi-user serving via vLLM:
vllm serve google/gemma-4-31b-instruct \
--tensor-parallel-size 1 --max-model-len 16384
Quant: Q4_K_M GGUF
Context: 16384 (KV cache f16, ~2 GB additional)
Backend: llama.cpp via Ollama, CUDA 12.x
Hardware: RTX 4090, NVIDIA driver 555+
Overview
Google's flagship dense Gemma 4. Beats some 400B-class proprietary models on benchmarks. Targets the 24GB single-GPU sweet spot.
Execution notes
Operator notes
Gemma 4 31B is Google's workstation-tier dense flagship for May 2026. It's the operator default when you want strong multilingual coverage on a single 24-32 GB card with the Gemma License (commercial-OK with the usual safety-policy attached).
What makes it the workstation-tier dense pick:
- Strong multilingual coverage — outperforms similar-size Llama / Qwen siblings on non-English benchmarks.
- Q4_K_M fits 24 GB VRAM with 8K context; comfortable on single RTX 4090 / 5090 / 6000 Ada.
- 131K context — long enough for most agent harness use cases.
- Gemma License — commercial OK; verify the safety-policy clause for your deployment.
Deployment notes
For workstation-tier deployments running multilingual workloads, Gemma 4 31B is the dense default. Pair with:
- vLLM + AWQ-INT4 for production serving. Single-card deployment shape.
- Ollama + Q4_K_M for solo-developer setups.
- llama.cpp + Q4_K_M for the Apple Silicon path via MLX adapter.
For more efficient throughput at a slight quality cost, drop to Gemma 4 26B MoE — same workstation tier, MoE active params keep tok/s competitive.
For consumer-tier (8-16 GB) deployments, drop to Gemma 3 12B — the Gemma 3 family is still the operator default at the consumer tier as of May 2026.
For the L1.25-enriched workstation Gemma alternative (Gemma 3 lineage), see Gemma 3 27B — the predecessor with more operator history.
Runtime compatibility
- vLLM ✓ excellent. AWQ-INT4 supported.
- SGLang ✓ excellent. Strong on agent-harness workloads with stable system prompts.
- Ollama ✓ excellent. Q4_K_M GGUF available at release.
- llama.cpp ✓ excellent. Native GGUF.
- MLX-LM ✓ excellent. Apple Silicon unified memory holds 8K context comfortably at MLX-4bit on M3 Max / Ultra.
- TensorRT-LLM ✓ supported but rarely justified at workstation scale.
Quantization suitability
Q4_K_M is the operational default. AWQ-INT4 for vLLM deployments. Q5_K_M is feasible on 32 GB cards (RTX 6000 Ada, A6000) for slightly tighter quality.
Avoid Q3 — Gemma 4 regresses more on multilingual quality at low bit-widths than Gemma 3 did. The non-English token distributions are more sensitive.
Best use cases
- Workstation-tier multilingual chat — German / French / Spanish / Japanese / Korean depth on a single 24 GB card.
- General-purpose chat with permissive license — Gemma License is commercial-friendly with the safety-policy caveat.
- RAG over multilingual corpora — pair with BGE M3 embeddings and a multilingual reranker.
When to use a different model
- Code-first workloads: Qwen 2.5 Coder 32B is sharper on coding at the same workstation tier.
- Reasoning-first workloads: Qwen 3 32B (L1.25-enriched) — thinking-mode toggle gives per-call reasoning depth control.
- Apache 2.0 hard requirement: Qwen 2.5 32B at the same workstation tier.
- MoE efficiency variant: Gemma 4 26B MoE.
- Smaller hardware: Gemma 3 12B — consumer-tier sibling.
Failure modes specific to this model
- Safety-policy clause. Gemma License's safety policy is more restrictive than Apache 2.0 / MIT; verify your deployment fits before committing.
- Refusal patterns. Gemma 4 retains the family's relatively conservative refusal posture — particularly on creative-writing edge cases. Set system-prompt overrides cautiously.
- Quant-sensitivity on multilingual. Q3-class quants regress non-English quality more than English. Stick to Q4_K_M minimum.
Going deeper
- /stacks/local-coding-agent — agent-loop deployment recipe (Coder model is the canonical pick; this is the multilingual-chat alternative)
- Gemma 3 27B — L1.25-enriched predecessor with more operator history
- vLLM operational review — production-recommended runtime
- Ollama operational review — solo-developer alternative
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Top dense ~30B model
- Strong multilingual
- 128K context
Weaknesses
- Gemma license has use restrictions
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 18.0 GB | 24 GB |
| Q8_0 | 33.0 GB | 38 GB |
Get the model
Ollama
One-line install
ollama run gemma4:31bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Gemma 4 31B Dense.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Gemma 4 31B Dense?
Can I use Gemma 4 31B Dense commercially?
What's the context length of Gemma 4 31B Dense?
How do I install Gemma 4 31B Dense with Ollama?
Does Gemma 4 31B Dense support images?
Source: huggingface.co/google/gemma-4-31b-it
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Gemma 4 31B Dense runs on your specific hardware before committing money.