other

120B parameters

Commercial OK

Reviewed June 2026

Nemotron 3 Super (120B-A12B)

Workstation-tier Nemotron 3. 120B total / 12B active. 5× higher throughput than the prior Super, 1M context, designed for multi-agent applications.

License: NVIDIA Open Model License·Released Feb 15, 2026·Context: 1,000,000 tokens

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

unrated

Positioning

NVIDIA's Nemotron 3 Super (120B-A12B) is a Mixture-of-Experts (MoE) model with 120B total parameters and 12B active per token. Released under the NVIDIA Open Model License, it targets datacenter-tier reasoning and multi-agent applications. Its 1M-token context window and architecture designed for high throughput distinguish it in the open-weight landscape, though it requires substantial hardware.

Strengths

Massive context window: 1M tokens enables processing of entire codebases, long documents, or multi-turn agent conversations without truncation.
MoE efficiency: With only 12B active parameters per token, inference cost is closer to a dense 12B-30B model than a dense 120B model, reducing per-token compute.
NVIDIA ecosystem tuning: Designed for NVIDIA hardware and multi-agent workflows, likely benefiting from vendor-optimized kernels and libraries.
Permissive commercial license: The NVIDIA Open Model License allows commercial use, making it suitable for enterprise deployment.

Limitations

Extreme hardware requirements: Even at Q4_K_M (67.5 GB), the model requires multiple high-end GPUs; FP16 (240 GB) demands datacenter-class multi-GPU setups.
No community benchmarks available: We lack independent measurements for this model. Operators should treat vendor-published metrics as best-case until third-party validation emerges.
KV cache overhead: At 1M context, KV cache can add 30-50% or more to memory requirements, potentially exceeding 100 GB even with quantization.
Narrow deployment class: Not suitable for consumer or workstation hardware; requires datacenter infrastructure (e.g., multi-A100/H100 nodes).

What it takes to run this locally

Quantized sizes (disk): FP16 ~240 GB, Q8_0 ~128 GB, Q6_K ~99 GB, Q5_K_M ~85.5 GB, Q4_K_M ~67.5 GB, Q3_K_M ~58.5 GB, Q2_K ~39 GB. Add ~30-50% for KV cache and framework overhead at typical context lengths. This model is firmly in the datacenter deployment class — expect multiple high-memory GPUs (e.g., 8× A100 80GB) even with aggressive quantization.

Should you run this locally?

Yes if you have access to multi-GPU datacenter hardware, need a 1M context window for long-context or multi-agent applications, and require a permissive commercial license. No if you lack the infrastructure, need single-GPU inference, or prefer models with extensive community benchmarks and tooling.

Catalog cross-links

NVIDIA Nemotron-4 340B
Mixtral 8x22B
DeepSeek-V2

Overview

Workstation-tier Nemotron 3. 120B total / 12B active. 5× higher throughput than the prior Super, 1M context, designed for multi-agent applications.

How to run it

Nemotron-3-Super is NVIDIA's 51B dense model in the Nemotron family. Run at Q4_K_M via Ollama (ollama pull nemotron:3-super) or llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size 29 GB on disk. Minimum VRAM: 32 GB — RTX 5090 (32GB) at Q4_K_M with 4K context. RTX 4090 24GB: Q3_K_M (22 GB) or Q4_K_M with KV cache offload. Recommended: RTX 4090 24GB at Q4_K_M with Q8 KV cache offloaded to RAM (works for 8K context). Throughput: ~20-35 tok/s on RTX 4090 at Q4_K_M; ~30-45 tok/s on RTX 5090. Standard Llama/Nemotron architecture — broad ecosystem support. For serving: vLLM on single A10 24GB at AWQ-INT4. Context: 32K max; practical at Q4 on 24 GB is 8-16K. 51B dense is the sweet spot — strong quality with consumer GPU accessibility. Nemotron models are NVIDIA's instruction-tuned suite with focus on coding, math, and agent tasks.

Hardware guidance

Minimum: RTX 3090 24GB at Q3_K_M (4K context). Recommended: RTX 4090 24GB at Q4_K_M with KV offload (8-16K context). Optimal: RTX 5090 32GB at Q4_K_M (16-32K context, no offload). VRAM math: 51B dense, Q4_K_M ~0.7 bytes/param → ~29 GB. KV cache at 8K: ~8-12 GB. Total: ~37-41 GB at 8K. RTX 5090 32GB: must offload KV cache to RAM for >4K context. Use llama.cpp --no-kv-offload to keep KV on GPU for speed (fits batch=1 at 2K). For >4K, offload KV to RAM: adds latency but enables context. MacBook Pro M4 Max 36GB+: Q4_K_M at 6-10 tok/s. RTX 3060 12GB: Q2_K only, quality degraded. Cloud: single A10 24GB at AWQ or RTX 4090 at Q4_K_M.

What breaks first

KV cache offload latency. Offloading KV cache to RAM on 24 GB cards adds 10-30% latency overhead. Generation becomes RAM-bandwidth-bound for the KV component. Keep context under 4K to keep KV on GPU. 2. Q3_K_M quality on code/math. Nemotron-3 is tuned for reasoning. At Q3_K_M, code generation and math reasoning degrade more than general chat — the reasoning-specialized weights are more sensitive to quantization. 3. Chat template mismatch. Nemotron uses a custom chat template different from standard Llama 3 templates. Using the wrong template produces garbled or repetitive output. Verify in tokenizer_config.json. 4. FP16 inference precision expectations. NVIDIA tuned Nemotron-3 at BF16 — Q4_K_M may show different behavior on edge cases. Test your specific prompts.

Runtime recommendation

Ollama for quick-start — Nemotron-3 is in Ollama's catalog. llama.cpp for fine-grained control (KV offload, context tuning). vLLM for serving. Nemotron uses standard Llama architecture — all major runtimes support it. NVIDIA's own TensorRT-LLM is the optimal path on NVIDIA GPUs but requires more setup.

Common beginner mistakes

Mistake: Pulling Ollama's default tag assuming Q4_K_M. Fix: Ollama defaults vary. Run ollama show nemotron:3-super to verify quantization. Q8_0 requires 58 GB — OOM on 24 GB GPUs. Mistake: Using Llama 3 chat template with Nemotron. Fix: Nemotron uses a custom template. Check the model card on Hugging Face for the correct format or use Ollama's built-in template. Mistake: Running at 32K context on 24 GB GPU. Fix: KV cache at 32K is 30-40 GB plus 29 GB weights = 59-69 GB total. OOM. Start at -c 4096. Mistake: Disabling flash attention. Fix: Flash attention saves 20-30% VRAM on KV cache. Always enable with -fa in llama.cpp.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model

Nemotron 3 Nano (30B-A3B)30B

Consumer

Family siblings (nemotron-3)

Nemotron 3 Nano 9B9B

Consumer

Nemotron 3 Nano (30B-A3B)30B

Consumer

Nemotron 3 Super 49B49B

Workstation

Nemotron 3 Super (120B-A12B)120B

You are here

Strengths

5× throughput vs prior gen
1M context
Multi-agent design

Weaknesses

Server / multi-GPU only

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	72.0 GB	84 GB

Get the model

Ollama

One-line install

ollama run nemotron3:superRead our Ollama review →

HuggingFace

Original weights

huggingface.co/nvidia/Nemotron-3-Super

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Nemotron 3 Super (120B-A12B).

NVIDIA B300 (Blackwell Ultra)

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier

Models in the same parameter band as this one

Step up

More capable — bigger memory footprint

No verdicted models in the next tier up yet.

Step down

Smaller — faster, runs on weaker hardware

Frequently asked

What's the minimum VRAM to run Nemotron 3 Super (120B-A12B)?

84GB of VRAM is enough to run Nemotron 3 Super (120B-A12B) at the Q4_K_M quantization (file size 72.0 GB). Higher-quality quantizations need more.

Can I use Nemotron 3 Super (120B-A12B) commercially?

Yes — Nemotron 3 Super (120B-A12B) ships under the NVIDIA Open Model License, which permits commercial use. Always read the license text before deployment.

What's the context length of Nemotron 3 Super (120B-A12B)?

Nemotron 3 Super (120B-A12B) supports a context window of 1,000,000 tokens (about 1000K).

How do I install Nemotron 3 Super (120B-A12B) with Ollama?

Run `ollama pull nemotron3:super` to download, then `ollama run nemotron3:super` to start a chat session. The default quantization is Q4_K_M.

Source: huggingface.co/nvidia/Nemotron-3-Super

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Alternatives

Nemotron 3 Super 49B Nemotron 3 Nano 9B Nemotron 3 Nano (30B-A3B)

Before you buy

Verify Nemotron 3 Super (120B-A12B) runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →