Nemotron 3 Ultra (550B-A55B)
NVIDIA Nemotron 3 Ultra (550B-A55B) is a frontier-scale open-weight reasoning model from NVIDIA (Hugging Face `nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16`, 2026-06), with 550B total / 55B active parameters. Its "LatentMoE" architecture interleaves Mamba-2, MoE, and select attention layers with multi-token-prediction, trained on an NVFP4 recipe; it supports up to 1M-token context and 10 languages. Released under the Linux Foundation OpenMDW License v1.1 (commercial use permitted), text-only. Vendor-reported benchmarks (via NVIDIA's Nemo Evaluator SDK) include SWE-bench Verified 70.7, GPQA (no tools) 87.0, and RULER@1M 94.7 — not independently verified. The smaller Nemotron 3 Nano and Super tiers are far more practical to run on local hardware.
Positioning
Nemotron 3 Ultra (550B-A55B) is NVIDIA's frontier-scale open-weight reasoning model (June 2026) — and notably the most capable open model from a US lab, released fully open (weights + recipes) under the Linux Foundation's OpenMDW-1.1 commercial license.
What stands out
The architecture is the interesting part: a "LatentMoE" hybrid interleaving Mamba-2, MoE, and select attention layers with multi-token prediction, trained on an NVFP4 recipe. The NVFP4 checkpoints make a 550B model unusually tractable on Blackwell hardware, and 1M context is supported (NVFP4-on-Blackwell; 262K in BF16). A fully-open US-lab model with permissive licensing is rare and valuable for teams that want provenance.
Honest caveats
All benchmarks are NVIDIA-reported (via their Nemo Evaluator SDK) and not independently verified. Per Artificial Analysis it still trails the leading Chinese open models on the intelligence index. At 550B it is datacenter-class; the smaller Nemotron 3 Super (120B) and Nemotron 3 Nano (30B) tiers are the locally-practical choices for most readers.
Verdict
Run it if you need a fully-open, commercially-licensed frontier reasoner with documented training recipes and you have Blackwell / NVFP4 infrastructure. Most local users should start with Nemotron 3 Super or Nano instead — same family, same openness, hardware you can actually buy.
Overview
NVIDIA Nemotron 3 Ultra (550B-A55B) is a frontier-scale open-weight reasoning model from NVIDIA (Hugging Face `nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16`, 2026-06), with 550B total / 55B active parameters. Its "LatentMoE" architecture interleaves Mamba-2, MoE, and select attention layers with multi-token-prediction, trained on an NVFP4 recipe; it supports up to 1M-token context and 10 languages. Released under the Linux Foundation OpenMDW License v1.1 (commercial use permitted), text-only. Vendor-reported benchmarks (via NVIDIA's Nemo Evaluator SDK) include SWE-bench Verified 70.7, GPQA (no tools) 87.0, and RULER@1M 94.7 — not independently verified. The smaller Nemotron 3 Nano and Super tiers are far more practical to run on local hardware.
Strengths
Weaknesses
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Nemotron 3 Ultra (550B-A55B).
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
Can I use Nemotron 3 Ultra (550B-A55B) commercially?
What's the context length of Nemotron 3 Ultra (550B-A55B)?
Source: huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Nemotron 3 Ultra (550B-A55B) runs on your specific hardware before committing money.