Nemotron

by NVIDIA

NVIDIA's reasoning-tuned family. Nemotron-3, Nemotron-4 lineage. NVIDIA-aligned tooling integration (NeMo, TensorRT-LLM); strong on agentic + reasoning workloads.

Best entry point for local use

Start with Nemotron-3 Nano 8B at Q4_K_M via Ollama — fits on single RTX 3060 12GB at ~5 GB VRAM. Nemotron-3 Nano is NVIDIA's instruction-tuned 8B built on the Llama-3.1 architecture with additional NVIDIA-curated instruction data — it scores IFEval 81.2%, competitive with Llama 3.3 70B on instruction-following accuracy despite 8× fewer parameters. This makes it the best sub-10B model for structured output generation (JSON, function calls, tool-use). For chat quality, Nemotron-3 8B outperforms Llama 3.1 8B on AlpacaEval and MT-Bench by measurable margins. The model is optimized for NVIDIA hardware with FlashAttention-2 — expect 35+ tok/s on RTX 4090. Skip Nemotron-4 (closed-weight) — it's API-only. Skip older Nemotron variants — Nano is the current generation and replaces the 15B/43B predecessors.

Deployment guidance

For single-user local: Ollama + nemotron:8b Q4_K_M on RTX 4090 24 GB — achieves 35+ tok/s with FA2. For maximum NVIDIA throughput: TensorRT-LLM 0.12.0+ with FP8 on L40S — build engine from HuggingFace checkpoint (~20 min build time, ~55 tok/s decode). For multi-user serving: vLLM 0.6.3+ with AWQ 4-bit on L4 24 GB — serves ~800 concurrent requests due to small model footprint. For structured generation (JSON mode, function calling): SGLang v0.2.5+ with constrained decoding — Nemotron's instruction-tuning makes it particularly responsive to grammar-constrained generation. The model uses Llama-3.1 chat template — any Llama-compatible pipeline works without modification. Nemotron is released under the NVIDIA Open Model License — permissive for research and commercial use but review specific terms for redistribution.

Featured models

Models in this family with our verdicts

Nemotron 3 Nano (30B-A3B)

Recommended runtimes

TensorRT-LLM vLLM

Related families

Llama Qwen

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Runtimes that fit

Alternatives

Llama Qwen

Before you buy

Verify Nemotron runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →