Nemotron 3 Super (120B-A12B)
Workstation-tier Nemotron 3. 120B total / 12B active. 5× higher throughput than the prior Super, 1M context, designed for multi-agent applications.
Positioning
NVIDIA's Nemotron 3 Super (120B-A12B) is a Mixture-of-Experts (MoE) model with 120B total parameters and 12B active per token. Released under the NVIDIA Open Model License, it targets datacenter-tier reasoning and multi-agent applications. Its 1M-token context window and architecture designed for high throughput distinguish it in the open-weight landscape, though it requires substantial hardware.
Strengths
- Massive context window: 1M tokens enables processing of entire codebases, long documents, or multi-turn agent conversations without truncation.
- MoE efficiency: With only 12B active parameters per token, inference cost is closer to a dense 12B-30B model than a dense 120B model, reducing per-token compute.
- NVIDIA ecosystem tuning: Designed for NVIDIA hardware and multi-agent workflows, likely benefiting from vendor-optimized kernels and libraries.
- Permissive commercial license: The NVIDIA Open Model License allows commercial use, making it suitable for enterprise deployment.
Limitations
- Extreme hardware requirements: Even at Q4_K_M (67.5 GB), the model requires multiple high-end GPUs; FP16 (240 GB) demands datacenter-class multi-GPU setups.
- No community benchmarks available: We lack independent measurements for this model. Operators should treat vendor-published metrics as best-case until third-party validation emerges.
- KV cache overhead: At 1M context, KV cache can add 30-50% or more to memory requirements, potentially exceeding 100 GB even with quantization.
- Narrow deployment class: Not suitable for consumer or workstation hardware; requires datacenter infrastructure (e.g., multi-A100/H100 nodes).
What it takes to run this locally
Quantized sizes (disk): FP16 ~240 GB, Q8_0 ~128 GB, Q6_K ~99 GB, Q5_K_M ~85.5 GB, Q4_K_M ~67.5 GB, Q3_K_M ~58.5 GB, Q2_K ~39 GB. Add ~30-50% for KV cache and framework overhead at typical context lengths. This model is firmly in the datacenter deployment class — expect multiple high-memory GPUs (e.g., 8× A100 80GB) even with aggressive quantization.
Should you run this locally?
Yes if you have access to multi-GPU datacenter hardware, need a 1M context window for long-context or multi-agent applications, and require a permissive commercial license. No if you lack the infrastructure, need single-GPU inference, or prefer models with extensive community benchmarks and tooling.
Catalog cross-links
- NVIDIA Nemotron-4 340B
- Mixtral 8x22B
- DeepSeek-V2
Overview
Workstation-tier Nemotron 3. 120B total / 12B active. 5× higher throughput than the prior Super, 1M context, designed for multi-agent applications.
How to run it
Nemotron-3-Super is NVIDIA's 51B dense model in the Nemotron family. Run at Q4_K_M via Ollama (ollama pull nemotron:3-super) or llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size 29 GB on disk. Minimum VRAM: 32 GB — RTX 5090 (32GB) at Q4_K_M with 4K context. RTX 4090 24GB: Q3_K_M (22 GB) or Q4_K_M with KV cache offload. Recommended: RTX 4090 24GB at Q4_K_M with Q8 KV cache offloaded to RAM (works for 8K context). Throughput: ~20-35 tok/s on RTX 4090 at Q4_K_M; ~30-45 tok/s on RTX 5090. Standard Llama/Nemotron architecture — broad ecosystem support. For serving: vLLM on single A10 24GB at AWQ-INT4. Context: 32K max; practical at Q4 on 24 GB is 8-16K. 51B dense is the sweet spot — strong quality with consumer GPU accessibility. Nemotron models are NVIDIA's instruction-tuned suite with focus on coding, math, and agent tasks.
Hardware guidance
Minimum: RTX 3090 24GB at Q3_K_M (4K context). Recommended: RTX 4090 24GB at Q4_K_M with KV offload (8-16K context). Optimal: RTX 5090 32GB at Q4_K_M (16-32K context, no offload). VRAM math: 51B dense, Q4_K_M ~0.7 bytes/param → ~29 GB. KV cache at 8K: ~8-12 GB. Total: ~37-41 GB at 8K. RTX 5090 32GB: must offload KV cache to RAM for >4K context. Use llama.cpp --no-kv-offload to keep KV on GPU for speed (fits batch=1 at 2K). For >4K, offload KV to RAM: adds latency but enables context. MacBook Pro M4 Max 36GB+: Q4_K_M at 6-10 tok/s. RTX 3060 12GB: Q2_K only, quality degraded. Cloud: single A10 24GB at AWQ or RTX 4090 at Q4_K_M.
What breaks first
- KV cache offload latency. Offloading KV cache to RAM on 24 GB cards adds 10-30% latency overhead. Generation becomes RAM-bandwidth-bound for the KV component. Keep context under 4K to keep KV on GPU. 2. Q3_K_M quality on code/math. Nemotron-3 is tuned for reasoning. At Q3_K_M, code generation and math reasoning degrade more than general chat — the reasoning-specialized weights are more sensitive to quantization. 3. Chat template mismatch. Nemotron uses a custom chat template different from standard Llama 3 templates. Using the wrong template produces garbled or repetitive output. Verify in tokenizer_config.json. 4. FP16 inference precision expectations. NVIDIA tuned Nemotron-3 at BF16 — Q4_K_M may show different behavior on edge cases. Test your specific prompts.
Runtime recommendation
Ollama for quick-start — Nemotron-3 is in Ollama's catalog. llama.cpp for fine-grained control (KV offload, context tuning). vLLM for serving. Nemotron uses standard Llama architecture — all major runtimes support it. NVIDIA's own TensorRT-LLM is the optimal path on NVIDIA GPUs but requires more setup.
Common beginner mistakes
Mistake: Pulling Ollama's default tag assuming Q4_K_M. Fix: Ollama defaults vary. Run ollama show nemotron:3-super to verify quantization. Q8_0 requires 58 GB — OOM on 24 GB GPUs. Mistake: Using Llama 3 chat template with Nemotron. Fix: Nemotron uses a custom template. Check the model card on Hugging Face for the correct format or use Ollama's built-in template. Mistake: Running at 32K context on 24 GB GPU. Fix: KV cache at 32K is 30-40 GB plus 29 GB weights = 59-69 GB total. OOM. Start at -c 4096. Mistake: Disabling flash attention. Fix: Flash attention saves 20-30% VRAM on KV cache. Always enable with -fa in llama.cpp.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- 5× throughput vs prior gen
- 1M context
- Multi-agent design
Weaknesses
- Server / multi-GPU only
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 72.0 GB | 84 GB |
Get the model
Ollama
One-line install
ollama run nemotron3:superRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Nemotron 3 Super (120B-A12B).
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Nemotron 3 Super (120B-A12B)?
Can I use Nemotron 3 Super (120B-A12B) commercially?
What's the context length of Nemotron 3 Super (120B-A12B)?
How do I install Nemotron 3 Super (120B-A12B) with Ollama?
Source: huggingface.co/nvidia/Nemotron-3-Super
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Nemotron 3 Super (120B-A12B) runs on your specific hardware before committing money.