Hermes 4 70B FP8
Hermes 4 is a 70B reasoning model from NousResearch, built on Llama-3.1-70B with FP8 quantization to cut memory overhead. It supports explicit `<think>` reasoning segments and structured output, and was post-trained on roughly 5M samples (~60B tokens) targeting math, code, and STEM. No specific Arabic training was included.
If you need a capable reasoning model for math, code, or strict JSON output and already have the hardware, Hermes 4 FP8 is a reasonable pick — the post-training corpus is substantial and the hybrid think-mode is genuinely useful. For Arabic-region deployments, proceed with caution: there is no documented Arabic fine-tuning, so you are relying on whatever Arabic capability bled through Llama-3.1's base training. The low HuggingFace engagement for a 70B model is a mild flag worth noting. Hedge — solid for English STEM workloads, verify Arabic quality yourself before shipping.
›Why this rating
Auto-generated rating (Opus 4.7 judge, claude-opus-4-7). Overall 9.05/10. License (llama3) matches the HF card exactly and commercial use is correctly flagged. Metadata aligns with the model card: 70B Llama-3.1 base, FP8 variant, NousResearch vendor. Description and verdict are honest and operator-voiced, correctly calling out the lack of Arabic-specific training and flagging FP8 quality tradeoffs and VRAM realities. The useCases array including 'arabic' is questionable since the description itself says no Arabic training was done — this is a mild inconsistency. bestUseCase is reasonably specific (STEM reasoning + structured extraction). Just barely clears the 9.0 bar.
Flags: - useCases includes 'arabic' despite description explicitly stating no Arabic-specific training — inconsistent signaling - contextLength 128000 is inherited from Llama-3.1 base; not explicitly confirmed in the excerpt shown
Overview
Hermes 4 is a 70B reasoning model from NousResearch, built on Llama-3.1-70B with FP8 quantization to cut memory overhead. It supports explicit `<think>` reasoning segments and structured output, and was post-trained on roughly 5M samples (~60B tokens) targeting math, code, and STEM. No specific Arabic training was included.
Strengths
- Hybrid reasoning mode: model can expose step-by-step thinking via <think> tags before final answer
- Large post-training run — 5M samples, ~60B tokens — with documented gains in math, code, and STEM
- Reliable structured output and JSON schema adherence
- FP8 quantization reduces VRAM demand compared to BF16 at the same parameter count
Weaknesses
- No Arabic-specific training data reported — Arabic quality is untested and likely uneven
- FP8 quantization introduces potential quality degradation versus full BF16
- 70B parameters still demands serious hardware even with FP8 (expect 40–48 GB VRAM minimum)
- Low adoption signal: 47K downloads and 29 likes on HuggingFace for a 70B model
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 38.5 GB | 49 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Hermes 4 70B FP8.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Hermes 4 70B FP8?
Can I use Hermes 4 70B FP8 commercially?
What's the context length of Hermes 4 70B FP8?
Source: huggingface.co/NousResearch/Hermes-4-70B-FP8
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Hermes 4 70B FP8 runs on your specific hardware before committing money.