RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
  1. >
  2. Home
  3. /Models
  4. /Nemotron 3 Super 49B
other
49B parameters
Commercial OK
·Reviewed May 2026

Nemotron 3 Super 49B

Nemotron 3 mid-tier. 49B dense; fits 32GB cards with AWQ. NVIDIA stack alignment carries through.

License: NVIDIA Open Model License·Released Jan 22, 2026·Context: 131,072 tokens

Overview

Nemotron 3 mid-tier. 49B dense; fits 32GB cards with AWQ. NVIDIA stack alignment carries through.

How to run it

Nemotron-3-Super 49B is NVIDIA's 49B dense model — the mid-tier Nemotron optimized for single consumer GPU deployment. Run at Q4_K_M via Ollama (ollama pull nemotron:3-super-49b) or llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size ~28 GB on disk. Minimum VRAM: 24 GB — RTX 4090 (24GB) at Q4_K_M with KV offload for 8K context. RTX 3090 24GB: same. Recommended: RTX 4090 24GB at Q4_K_M (8K context with KV offload). Throughput: ~25-40 tok/s on RTX 4090 at Q4_K_M; ~35-55 tok/s on RTX 5090. Standard Llama/Nemotron architecture — broad compatibility. The 49B size is a sweet spot: strong quality (close to 70B tier) with 24 GB GPU accessibility. Use for: general chat, reasoning, coding, agent tasks — CPU-constrained operators who want 70B-class quality on a single consumer GPU. Nemotron models are NVIDIA's instruction-tuned suite with focus on structured outputs and tool-calling. Context: 32K advertised; practical at Q4 on 24 GB is 8-16K. For 70B variant, see Nemotron-3-Super 51B.

Hardware guidance

Minimum: RTX 3060 12GB at Q3_K_M with aggressive KV offload. Recommended: RTX 4090 24GB at Q4_K_M (8K context). Optimal: RTX 5090 32GB at Q4_K_M (16K+ context, no offload). VRAM math: 49B dense, Q4_K_M ≈ 28 GB. KV cache at 8K: ~8 GB. Total: ~36 GB at 8K. RTX 4090 24GB: Q4 weights 28 GB — must offload KV to RAM for any context. With KV offload, ~28 GB VRAM + KV in system RAM. Gen speed drops 10-20% with KV offload. RTX 5090 32GB: Q4 fits + ~4 GB for KV = 4K context on-GPU. RTX 3090 24GB: same as 4090 — KV offload. MacBook Pro M4 Max 36GB+: Q4 at 6-12 tok/s. Cloud: A10 24GB at Q4_K_M with KV offload — works well. AWQ-INT4 drops weights to ~25 GB, easier 8K fit on 32 GB.

What breaks first

  1. KV cache offload penalty. On 24 GB GPUs, KV offload to RAM is mandatory for >2K context. This adds 10-20% latency overhead. Use CUDA malloc async to reduce the penalty. 2. **Nemotron chat template. Same issue as all Nemotrons — custom template different from standard Llama. Wrong template = degraded instruction-following. 3. Q3 quality on code/math. Same quant sensitivity pattern as Nemotron-3-Super 51B — reasoning tasks degrade more at Q3 than general chat. Use Q4_K_M minimum. 4. Ollama tag naming confusion. Nemotron 49B may be tagged as nemotron:49b, nemotron-super:49b, or similar. Verify exact tag before pulling.

Runtime recommendation

Ollama for quick-start. llama.cpp with explicit KV offload config for 24 GB GPUs. TensorRT-LLM for maximum throughput on NVIDIA GPUs. Standard Llama architecture — any stack works. MLX-LM on Apple Silicon for unified memory efficiency.

Common beginner mistakes

Mistake: Expecting Q4_K_M (28 GB) to fit entirely on 24 GB GPU with 8K context. Fix: 28 GB weights + 8 GB KV = 36 GB. Must offload KV to RAM. Use --no-kv-offload=false or accept 2K context on-GPU. Mistake: Confusing 49B and 51B Nemotron variants. Fix: They're different models in the same family. 51B is the original Super; 49B is a different size point. Check the hf repo for the specific model. Mistake: Disabling flash attention on tight VRAM. Fix: Flash attention saves 20-30% KV cache. Always enable with -fa. Mistake: Using default Llama chat template. Fix: Nemotron has custom template. Verify on hf tokenizer_config.json.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model
Nemotron 3 Nano 9B9B
Consumer
Family siblings (nemotron-3)
Nemotron 3 Nano 9B9B
Consumer
Nemotron 3 Nano (30B-A3B)30B
Consumer
Nemotron 3 Super 49B49B
You are here
Nemotron 3 Super (120B-A12B)120B
Datacenter

Strengths

  • 32GB-VRAM workstation deployment
  • NVIDIA tool-call discipline

Weaknesses

  • Less battle-tested than Llama 70B class

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
AWQ-INT428.0 GB32 GB

Get the model

HuggingFace

Original weights

huggingface.co/nvidia/Nemotron-3-Super-49B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Nemotron 3 Super 49B.

NVIDIA GB200 NVL72
13824GB · nvidia
AMD Instinct MI355X
288GB · amd
AMD Instinct MI325X
256GB · amd
AMD Instinct MI300X
192GB · amd
NVIDIA B200
192GB · nvidia
NVIDIA H100 NVL
188GB · nvidia
NVIDIA H200
141GB · nvidia
Intel Gaudi 3
128GB · intel

Frequently asked

What's the minimum VRAM to run Nemotron 3 Super 49B?

32GB of VRAM is enough to run Nemotron 3 Super 49B at the AWQ-INT4 quantization (file size 28.0 GB). Higher-quality quantizations need more.

Can I use Nemotron 3 Super 49B commercially?

Yes — Nemotron 3 Super 49B ships under the NVIDIA Open Model License, which permits commercial use. Always read the license text before deployment.

What's the context length of Nemotron 3 Super 49B?

Nemotron 3 Super 49B supports a context window of 131,072 tokens (about 131K).

Source: huggingface.co/nvidia/Nemotron-3-Super-49B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware
  • RTX 3090 vs RTX 5080 (24 vs 16 GB) →
  • Used 3090 vs 4090 →
Buyer guides
  • Best GPU for local AI — 32B-class models →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
  • Will it run on my hardware? →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Recommended hardware
  • NVIDIA GB200 NVL72 →
  • AMD Instinct MI355X →
  • AMD Instinct MI325X →
  • AMD Instinct MI300X →
  • NVIDIA B200 →
Alternatives
Nemotron 3 Nano 9BNemotron 3 Nano (30B-A3B)Nemotron 3 Super (120B-A12B)
Before you buy

Verify Nemotron 3 Super 49B runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier
Models in the same parameter band as this one
  • Qwen 3 30B-A3B
    qwen · 30B
    unrated
  • Gemma 4 31B Dense
    gemma · 31B
    unrated
  • Nemotron 3 Nano (30B-A3B)
    other · 30B
    unrated
  • DeepSeek Coder V3
    deepseek · 33B
    unrated
Step up
More capable — bigger memory footprint
  • Llama 3.3 70B Instruct
    llama · 70B
    9.1/10
  • DeepSeek R1 Distill Llama 70B
    deepseek · 70B
    9.0/10
Step down
Smaller — faster, runs on weaker hardware
  • DeepSeek V3 Lite (16B MoE)
    deepseek · 16B
    unrated
  • Mistral Small 3 24B
    mistral · 24B
    8.4/10