other

49B parameters

Commercial OK

Reviewed May 2026

Nemotron 3 Super 49B

Nemotron 3 mid-tier. 49B dense; fits 32GB cards with AWQ. NVIDIA stack alignment carries through.

License: NVIDIA Open Model License·Released Jan 22, 2026·Context: 131,072 tokens

Overview

Nemotron 3 mid-tier. 49B dense; fits 32GB cards with AWQ. NVIDIA stack alignment carries through.

How to run it

Nemotron-3-Super 49B is NVIDIA's 49B dense model — the mid-tier Nemotron optimized for single consumer GPU deployment. Run at Q4_K_M via Ollama (ollama pull nemotron:3-super-49b) or llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size ~28 GB on disk. Minimum VRAM: 24 GB — RTX 4090 (24GB) at Q4_K_M with KV offload for 8K context. RTX 3090 24GB: same. Recommended: RTX 4090 24GB at Q4_K_M (8K context with KV offload). Throughput: ~25-40 tok/s on RTX 4090 at Q4_K_M; ~35-55 tok/s on RTX 5090. Standard Llama/Nemotron architecture — broad compatibility. The 49B size is a sweet spot: strong quality (close to 70B tier) with 24 GB GPU accessibility. Use for: general chat, reasoning, coding, agent tasks — CPU-constrained operators who want 70B-class quality on a single consumer GPU. Nemotron models are NVIDIA's instruction-tuned suite with focus on structured outputs and tool-calling. Context: 32K advertised; practical at Q4 on 24 GB is 8-16K. For 70B variant, see Nemotron-3-Super 51B.

Hardware guidance

Minimum: RTX 3060 12GB at Q3_K_M with aggressive KV offload. Recommended: RTX 4090 24GB at Q4_K_M (8K context). Optimal: RTX 5090 32GB at Q4_K_M (16K+ context, no offload). VRAM math: 49B dense, Q4_K_M ≈ 28 GB. KV cache at 8K: ~8 GB. Total: ~36 GB at 8K. RTX 4090 24GB: Q4 weights 28 GB — must offload KV to RAM for any context. With KV offload, ~28 GB VRAM + KV in system RAM. Gen speed drops 10-20% with KV offload. RTX 5090 32GB: Q4 fits + ~4 GB for KV = 4K context on-GPU. RTX 3090 24GB: same as 4090 — KV offload. MacBook Pro M4 Max 36GB+: Q4 at 6-12 tok/s. Cloud: A10 24GB at Q4_K_M with KV offload — works well. AWQ-INT4 drops weights to ~25 GB, easier 8K fit on 32 GB.

What breaks first

KV cache offload penalty. On 24 GB GPUs, KV offload to RAM is mandatory for >2K context. This adds 10-20% latency overhead. Use CUDA malloc async to reduce the penalty. 2. **Nemotron chat template. Same issue as all Nemotrons — custom template different from standard Llama. Wrong template = degraded instruction-following. 3. Q3 quality on code/math. Same quant sensitivity pattern as Nemotron-3-Super 51B — reasoning tasks degrade more at Q3 than general chat. Use Q4_K_M minimum. 4. Ollama tag naming confusion. Nemotron 49B may be tagged as nemotron:49b, nemotron-super:49b, or similar. Verify exact tag before pulling.

Runtime recommendation

Ollama for quick-start. llama.cpp with explicit KV offload config for 24 GB GPUs. TensorRT-LLM for maximum throughput on NVIDIA GPUs. Standard Llama architecture — any stack works. MLX-LM on Apple Silicon for unified memory efficiency.

Common beginner mistakes

Mistake: Expecting Q4_K_M (28 GB) to fit entirely on 24 GB GPU with 8K context. Fix: 28 GB weights + 8 GB KV = 36 GB. Must offload KV to RAM. Use --no-kv-offload=false or accept 2K context on-GPU. Mistake: Confusing 49B and 51B Nemotron variants. Fix: They're different models in the same family. 51B is the original Super; 49B is a different size point. Check the hf repo for the specific model. Mistake: Disabling flash attention on tight VRAM. Fix: Flash attention saves 20-30% KV cache. Always enable with -fa. Mistake: Using default Llama chat template. Fix: Nemotron has custom template. Verify on hf tokenizer_config.json.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model

Nemotron 3 Nano 9B9B

Consumer

Family siblings (nemotron-3)

Nemotron 3 Nano 9B9B

Consumer

Nemotron 3 Nano (30B-A3B)30B

Consumer

Nemotron 3 Super 49B49B

You are here

Nemotron 3 Super (120B-A12B)120B

Datacenter

Strengths

32GB-VRAM workstation deployment
NVIDIA tool-call discipline

Weaknesses

Less battle-tested than Llama 70B class

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
AWQ-INT4	28.0 GB	32 GB

Get the model

HuggingFace

Original weights

huggingface.co/nvidia/Nemotron-3-Super-49B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Nemotron 3 Super 49B.

Frequently asked

What's the minimum VRAM to run Nemotron 3 Super 49B?

32GB of VRAM is enough to run Nemotron 3 Super 49B at the AWQ-INT4 quantization (file size 28.0 GB). Higher-quality quantizations need more.

Can I use Nemotron 3 Super 49B commercially?

Yes — Nemotron 3 Super 49B ships under the NVIDIA Open Model License, which permits commercial use. Always read the license text before deployment.

What's the context length of Nemotron 3 Super 49B?

Nemotron 3 Super 49B supports a context window of 131,072 tokens (about 131K).

Source: huggingface.co/nvidia/Nemotron-3-Super-49B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Alternatives

Nemotron 3 Nano 9B Nemotron 3 Nano (30B-A3B)Nemotron 3 Super (120B-A12B)

Before you buy

Verify Nemotron 3 Super 49B runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →

other

49B parameters

Commercial OK

Reviewed May 2026

Nemotron 3 Super 49B

Nemotron 3 mid-tier. 49B dense; fits 32GB cards with AWQ. NVIDIA stack alignment carries through.

License: NVIDIA Open Model License·Released Jan 22, 2026·Context: 131,072 tokens

Overview

Nemotron 3 mid-tier. 49B dense; fits 32GB cards with AWQ. NVIDIA stack alignment carries through.

How to run it

Hardware guidance

What breaks first

KV cache offload penalty. On 24 GB GPUs, KV offload to RAM is mandatory for >2K context. This adds 10-20% latency overhead. Use CUDA malloc async to reduce the penalty. 2. **Nemotron chat template. Same issue as all Nemotrons — custom template different from standard Llama. Wrong template = degraded instruction-following. 3. Q3 quality on code/math. Same quant sensitivity pattern as Nemotron-3-Super 51B — reasoning tasks degrade more at Q3 than general chat. Use Q4_K_M minimum. 4. Ollama tag naming confusion. Nemotron 49B may be tagged as nemotron:49b, nemotron-super:49b, or similar. Verify exact tag before pulling.

Runtime recommendation

Common beginner mistakes

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model

Nemotron 3 Nano 9B9B

Consumer

Family siblings (nemotron-3)

Nemotron 3 Nano 9B9B

Consumer

Nemotron 3 Nano (30B-A3B)30B

Consumer

Nemotron 3 Super 49B49B

You are here

Nemotron 3 Super (120B-A12B)120B

Datacenter