by NVIDIA
NVIDIA's reasoning-tuned family. Nemotron-3, Nemotron-4 lineage. NVIDIA-aligned tooling integration (NeMo, TensorRT-LLM); strong on agentic + reasoning workloads.
Start with Nemotron-3 Nano 8B at Q4_K_M via Ollama — fits on single RTX 3060 12GB at ~5 GB VRAM. Nemotron-3 Nano is NVIDIA's instruction-tuned 8B built on the Llama-3.1 architecture with additional NVIDIA-curated instruction data — it scores IFEval 81.2%, competitive with Llama 3.3 70B on instruction-following accuracy despite 8× fewer parameters. This makes it the best sub-10B model for structured output generation (JSON, function calls, tool-use). For chat quality, Nemotron-3 8B outperforms Llama 3.1 8B on AlpacaEval and MT-Bench by measurable margins. The model is optimized for NVIDIA hardware with FlashAttention-2 — expect 35+ tok/s on RTX 4090. Skip Nemotron-4 (closed-weight) — it's API-only. Skip older Nemotron variants — Nano is the current generation and replaces the 15B/43B predecessors.
For single-user local: Ollama + nemotron:8b Q4_K_M on RTX 4090 24 GB — achieves 35+ tok/s with FA2. For maximum NVIDIA throughput: TensorRT-LLM 0.12.0+ with FP8 on L40S — build engine from HuggingFace checkpoint (~20 min build time, ~55 tok/s decode). For multi-user serving: vLLM 0.6.3+ with AWQ 4-bit on L4 24 GB — serves ~800 concurrent requests due to small model footprint. For structured generation (JSON mode, function calling): SGLang v0.2.5+ with constrained decoding — Nemotron's instruction-tuning makes it particularly responsive to grammar-constrained generation. The model uses Llama-3.1 chat template — any Llama-compatible pipeline works without modification. Nemotron is released under the NVIDIA Open Model License — permissive for research and commercial use but review specific terms for redistribution.
Models in this family with our verdicts
Verify Nemotron runs on your specific hardware before committing money.