Nemotron 3 Nano (30B-A3B)
NVIDIA's hybrid Mamba-2 + Transformer MoE for on-device agents. 30B total / 3B active. 1M-token context window with reasoning ON/OFF modes and 4× faster inference than the previous Nemotron Nano.
Positioning
Nemotron 3 Nano is NVIDIA's late-2025 / 2026 reasoning-MoE in the same 30B-A3B parameter footprint as Qwen 3 30B-A3B — 30B total params, ~3B active per token. Where Qwen 3 30B-A3B is a generalist MoE, Nemotron 3 Nano is reasoning-tuned + tool-use-tuned, distilled from larger Nemotron internal training. Together they're the two strongest 24-GB-card-friendly MoE picks in 2026 — and they're not interchangeable. The operator-grade question is "do you prioritize reasoning + tool calling (Nemotron) or generalist + multilingual (Qwen)?"
Strengths
- Fits a single 24-GB consumer GPU at Q4 (~17-19 GB), comfortable headroom for context. Same hardware footprint as Qwen 3 30B-A3B — runs natively on RTX 3090 / 4090 / 7900 XTX.
- Reasoning-first training. Nemotron's RLHF + chain-of-thought distillation produces noticeably stronger behavior on multi-step problems vs general-purpose MoE alternatives. Per NVIDIA's published benchmarks, Nemotron 3 Nano lands ahead of Qwen 3 30B-A3B on GSM8K-Hard / MATH / GPQA-Diamond — the size of the gap is NVIDIA's claim, not an independent reproduction.
- Strong tool-use behavior. Trained explicitly for function-calling format adherence — the agent loop (Aider + Continue + custom function-calling) lands more reliably than with most 30B-class alternatives.
- NVIDIA Open Model License — permissive, commercial-friendly, with mild attribution requirements. Verify the license for your specific use case.
- Excellent vLLM + TensorRT-LLM optimization. NVIDIA-trained models get day-zero tensor-parallel + FP8 paths in vLLM and TensorRT-LLM. For production serving, this matters.
Limitations
- Multilingual performance trails Qwen. Nemotron is English-strong but doesn't match Qwen's 60+ language coverage or Chinese-specific quality.
- Coding-specific evals trail Qwen 2.5 Coder 32B — Nemotron is reasoning-tuned, not coding-tuned. For pure code work, the dedicated coding model wins by 5-10 pp on HumanEval / SWE-bench.
- MoE expert routing inherits the standard 30B-A3B MoE caveats — occasional "off" responses on edge cases vs dense 32B alternatives. Less of an issue with mature vLLM routing.
- Knowledge cutoff is mid-2025. For current-events / 2026-specific workloads, augment with RAG.
- Less battle-tested than Qwen 3 30B-A3B. Released later; community deployment weight is lower.
Real-world performance on RTX 4090 (24 GB)
- Q4_K_M (~17-18 GB): ~75-105 tok/s decode, TTFT ~80-150 ms on 1K prompts. Comparable MoE efficiency profile to Qwen 3 30B-A3B.
- Q5_K_M (~21 GB): ~60-85 tok/s. Quality bump is meaningful for reasoning tasks; speed loss is modest.
- Q8_0 (~30 GB partial-offload): ~22-35 tok/s. Quality bump over Q4 noticeable; speed loss large.
- vLLM FP8 on Hopper datacenter (rented): ~150-220 tok/s — production-tier.
Should you run this locally?
Yes, for anyone with a 24-GB GPU whose primary workload is reasoning, math, multi-step problem solving, or agent-style tool-calling. Nemotron 3 Nano is the right pick over Qwen 3 30B-A3B when reasoning quality matters more than multilingual breadth.
Yes, for vLLM production serving — Nemotron's NVIDIA-tuned tensor-parallel + FP8 paths give it a serving advantage over Qwen 3 30B-A3B at the same VRAM tier.
No, for anyone running a sub-16-GB card. Use Qwen 3 8B or smaller dense models at smaller-card tiers.
Probably not, for pure coding workflows (Qwen 2.5 Coder 32B wins on coding-specific evals).
Probably not, for non-English-heavy work (Qwen 3 30B-A3B wins on multilingual + Chinese).
How it compares
- vs Qwen 3 30B-A3B (same MoE footprint, generalist) → Nemotron wins on reasoning + tool-calling + NVIDIA-stack optimization. Qwen wins on multilingual + community deployment weight. Same hardware tier, different operator priorities. Pick Nemotron for agentic / reasoning workloads; Qwen for multilingual / general daily-driver.
- vs Qwen 2.5 Coder 32B (dense coding specialist) → Coder 32B wins on coding-specific evals at 24 GB. Nemotron 3 Nano wins on speed (MoE efficiency: 80+ tok/s vs Coder 32B's ~35-50 tok/s on RTX 4090). Pick Coder for serious code work; Nemotron for mixed reasoning + lighter coding.
- vs DeepSeek R1 Distill Qwen 32B → R1 Distill is reasoning-specialist via Qwen distillation. Nemotron is reasoning-specialist via NVIDIA distillation. Similar quality on reasoning benchmarks; tool-use slightly better on Nemotron, raw reasoning depth slightly better on R1 Distill. Coin flip with edge to whichever's better-supported in your stack.
- vs Llama 3.3 70B Instruct → Llama 3.3 70B at Q4 needs 40+ GB; doesn't fit single 24-GB card. If you have 32+GB hardware, Llama 3.3 70B beats Nemotron on most benchmarks — but Nemotron wins by being the larger model that actually fits the consumer-card tier.
- vs Gemma 4 31B → Gemma is dense (slower decode), Nemotron is MoE (faster decode). Gemma has Google ecosystem fit; Nemotron has NVIDIA ecosystem fit. For NVIDIA-heavy stacks (vLLM TP, TensorRT-LLM serving), Nemotron is the natural pick.
Run this yourself
# RTX 4090 / 3090 / 7900 XTX — single-card 24 GB
ollama pull nemotron-3-nano:30b-a3b-q4_K_M
ollama run nemotron-3-nano:30b-a3b-q4_K_M
# vLLM production-tier (preferred for Nemotron):
vllm serve nvidia/Nemotron-3-Nano-30B-A3B-Instruct \
--tensor-parallel-size 1 --max-model-len 32768
# llama.cpp direct:
llama-server -m nemotron-3-nano-30b-a3b-Q4_K_M.gguf \
--ctx-size 32768 -ngl 999 --temp 0.7
Quant: Q4_K_M GGUF
Context: 32768 (KV cache f16, ~2 GB additional)
Backend: vLLM preferred (NVIDIA-tuned), Ollama / llama.cpp acceptable
Hardware: RTX 4090, NVIDIA driver 555+, CUDA 12.4+
Overview
NVIDIA's hybrid Mamba-2 + Transformer MoE for on-device agents. 30B total / 3B active. 1M-token context window with reasoning ON/OFF modes and 4× faster inference than the previous Nemotron Nano.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- 1M-token context
- Reasoning toggle
- Hybrid Mamba architecture
Weaknesses
- Newer architecture — runner support varies
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 18.0 GB | 22 GB |
| Q8_0 | 32.0 GB | 36 GB |
Get the model
Ollama
One-line install
ollama run nemotron3:nanoRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Nemotron 3 Nano (30B-A3B).
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Nemotron 3 Nano (30B-A3B)?
Can I use Nemotron 3 Nano (30B-A3B) commercially?
What's the context length of Nemotron 3 Nano (30B-A3B)?
How do I install Nemotron 3 Nano (30B-A3B) with Ollama?
Source: huggingface.co/nvidia/Nemotron-3-Nano
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Nemotron 3 Nano (30B-A3B) runs on your specific hardware before committing money.