Hermes 4 Llama 3.3 70B
Nous Research's Hermes 4 fine-tune of Llama 3.3 70B. Strong on instruction following and creative tasks; community-favored alternative to base Llama.
Overview
Nous Research's Hermes 4 fine-tune of Llama 3.3 70B. Strong on instruction following and creative tasks; community-favored alternative to base Llama.
How to run it
Hermes 4 Llama 3.3 70B is Nous Research's instruction-tuned model based on Llama 3.3 70B. Hermes is Nous's long-running fine-tune lineage focused on long-form reasoning, structured outputs, and agentic behavior. Run at Q4_K_M via Ollama (ollama pull hermes4:70b) or llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size ~40 GB on disk. Minimum VRAM: 48 GB — RTX A6000 (48GB) at Q4_K_M for 4K context. RTX 4090 24GB: Q3_K_M with KV offload. Recommended: A100 80GB at AWQ-INT4. Throughput: ~15-25 tok/s on A6000 at Q4_K_M. Hermes 4 is known for strong chat, reasoning, and structured output (JSON function-calling). Nous's training emphasizes coherent long-form outputs (>2K tokens), chain-of-thought reasoning, and tool-use. Standard Llama 3.3 architecture — full compatibility. License: usually Apache 2.0 for Hermes (verify). Use for: agent workflows, structured JSON output, complex multi-step reasoning, long-form content generation. Context: Llama 3.3 128K (practical 4-8K on 48 GB).
Hardware guidance
Minimum: RTX 3090 24GB at Q3_K_M (4K). Recommended: RTX A6000 48GB at Q4_K_M (8K). Optimal: A100 80GB at AWQ-INT4. VRAM math: identical to Llama 3.3 70B — 70B at Q4 ≈ 40 GB. KV cache at 8K: ~10 GB. Total ~50 GB. A6000 48GB: borderline. Dual RTX 4090 48 GB: Q4 at 8K — viable. RTX 4090 24GB single: Q3 with KV offload. RTX 5090 32GB: Q4 requires KV offload. Mac Studio M4 Max 64GB: Q4 at 5-10 tok/s. Cloud: A100 80GB at $5-10/hr. AWQ-INT4 on A100 enables 32K context. Hermes 4's long-form generation creates more output tokens — ensure your max_tokens setting is high enough (2K-4K).
What breaks first
- Chat template specificity. Hermes 4 uses Nous's custom chat template. Different from standard Llama 3.3, Mistral, or OpenAI templates. Using the wrong template degrades output quality significantly. Verify on the hf repo. 2. Long-form degradation at Q3. Hermes 4's strength is coherent long-form outputs. At Q3, coherence over 2K+ tokens degrades more than shorter outputs. The longer the generation, the more quantization errors compound. 3. Tool-calling format drift. Hermes 4's function-calling format may differ from industry "standard" function-calling APIs. Validate the JSON schema output matches your parser's expectations. 4. Over-generation. Hermes 4 is trained for long-form outputs — it may generate excessively long responses for simple prompts. Set appropriate max_tokens and use stop sequences.
Runtime recommendation
Common beginner mistakes
Mistake: Using Llama 3.3's default chat template with Hermes 4. Fix: Hermes 4 uses Nous's custom template. Verify in tokenizer_config.json on the hf repo. Wrong template = garbled outputs. Mistake: Setting max_tokens too low for Hermes. Fix: Hermes is optimized for long-form generation. Set max_tokens to 2K-4K for best results. Short max_tokens truncates the model's reasoning. Mistake: Expecting Hermes 4 to match newer Hermes versions. Fix: Hermes 4 on Llama 3.3 70B is different from Hermes 3 on Llama 3.1 70B. Different base models, different fine-tuning. Expect different behavior. Mistake: Using function-calling without validating the JSON schema. Fix: Hermes 4's function-calling format may differ from OpenAI's. Test and validate the JSON output format before deploying.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Strong instruction tuning at 70B
- Active Nous community
Weaknesses
- Llama Community License unchanged
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| AWQ-INT4 | 40.0 GB | 48 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Hermes 4 Llama 3.3 70B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Hermes 4 Llama 3.3 70B?
Can I use Hermes 4 Llama 3.3 70B commercially?
What's the context length of Hermes 4 Llama 3.3 70B?
Source: huggingface.co/NousResearch/Hermes-4-Llama-3.3-70B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Hermes 4 Llama 3.3 70B runs on your specific hardware before committing money.