Hermes 4 70B FP8

Hermes 4 is a 70B reasoning model from NousResearch, built on Llama-3.1-70B with FP8 quantization to cut memory overhead. It supports explicit `<think>` reasoning segments and structured output, and was post-trained on roughly 5M samples (~60B tokens) targeting math, code, and STEM. No specific Arabic training was included.

License: llama3·Context: 128,000 tokens

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED MAY 28, 2026

9.1/10

If you need a capable reasoning model for math, code, or strict JSON output and already have the hardware, Hermes 4 FP8 is a reasonable pick — the post-training corpus is substantial and the hybrid think-mode is genuinely useful. For Arabic-region deployments, proceed with caution: there is no documented Arabic fine-tuning, so you are relying on whatever Arabic capability bled through Llama-3.1's base training. The low HuggingFace engagement for a 70B model is a mild flag worth noting. Hedge — solid for English STEM workloads, verify Arabic quality yourself before shipping.

›Why this rating

Auto-generated rating (Opus 4.7 judge, claude-opus-4-7). Overall 9.05/10. License (llama3) matches the HF card exactly and commercial use is correctly flagged. Metadata aligns with the model card: 70B Llama-3.1 base, FP8 variant, NousResearch vendor. Description and verdict are honest and operator-voiced, correctly calling out the lack of Arabic-specific training and flagging FP8 quality tradeoffs and VRAM realities. The useCases array including 'arabic' is questionable since the description itself says no Arabic training was done — this is a mild inconsistency. bestUseCase is reasonably specific (STEM reasoning + structured extraction). Just barely clears the 9.0 bar.

Flags: - useCases includes 'arabic' despite description explicitly stating no Arabic-specific training — inconsistent signaling - contextLength 128000 is inherited from Llama-3.1 base; not explicitly confirmed in the excerpt shown

Overview

Strengths

Hybrid reasoning mode: model can expose step-by-step thinking via <think> tags before final answer
Large post-training run — 5M samples, ~60B tokens — with documented gains in math, code, and STEM
Reliable structured output and JSON schema adherence
FP8 quantization reduces VRAM demand compared to BF16 at the same parameter count

Weaknesses

No Arabic-specific training data reported — Arabic quality is untested and likely uneven
FP8 quantization introduces potential quality degradation versus full BF16
70B parameters still demands serious hardware even with FP8 (expect 40–48 GB VRAM minimum)
Low adoption signal: 47K downloads and 29 likes on HuggingFace for a 70B model