llama
70B parameters
Commercial OK
Reviewed May 2026

Hermes 4 70B FP8

Hermes 4 is a 70B reasoning model from NousResearch, built on Llama-3.1-70B with FP8 quantization to cut memory overhead. It supports explicit `<think>` reasoning segments and structured output, and was post-trained on roughly 5M samples (~60B tokens) targeting math, code, and STEM. No specific Arabic training was included.

License: llama3·Context: 128,000 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED MAY 28, 2026
9.1/10

If you need a capable reasoning model for math, code, or strict JSON output and already have the hardware, Hermes 4 FP8 is a reasonable pick — the post-training corpus is substantial and the hybrid think-mode is genuinely useful. For Arabic-region deployments, proceed with caution: there is no documented Arabic fine-tuning, so you are relying on whatever Arabic capability bled through Llama-3.1's base training. The low HuggingFace engagement for a 70B model is a mild flag worth noting. Hedge — solid for English STEM workloads, verify Arabic quality yourself before shipping.

Why this rating

Auto-generated rating (Opus 4.7 judge, claude-opus-4-7). Overall 9.05/10. License (llama3) matches the HF card exactly and commercial use is correctly flagged. Metadata aligns with the model card: 70B Llama-3.1 base, FP8 variant, NousResearch vendor. Description and verdict are honest and operator-voiced, correctly calling out the lack of Arabic-specific training and flagging FP8 quality tradeoffs and VRAM realities. The useCases array including 'arabic' is questionable since the description itself says no Arabic training was done — this is a mild inconsistency. bestUseCase is reasonably specific (STEM reasoning + structured extraction). Just barely clears the 9.0 bar.

Flags: - useCases includes 'arabic' despite description explicitly stating no Arabic-specific training — inconsistent signaling - contextLength 128000 is inherited from Llama-3.1 base; not explicitly confirmed in the excerpt shown

Overview

Hermes 4 is a 70B reasoning model from NousResearch, built on Llama-3.1-70B with FP8 quantization to cut memory overhead. It supports explicit `<think>` reasoning segments and structured output, and was post-trained on roughly 5M samples (~60B tokens) targeting math, code, and STEM. No specific Arabic training was included.

Strengths

  • Hybrid reasoning mode: model can expose step-by-step thinking via <think> tags before final answer
  • Large post-training run — 5M samples, ~60B tokens — with documented gains in math, code, and STEM
  • Reliable structured output and JSON schema adherence
  • FP8 quantization reduces VRAM demand compared to BF16 at the same parameter count

Weaknesses

  • No Arabic-specific training data reported — Arabic quality is untested and likely uneven
  • FP8 quantization introduces potential quality degradation versus full BF16
  • 70B parameters still demands serious hardware even with FP8 (expect 40–48 GB VRAM minimum)
  • Low adoption signal: 47K downloads and 29 likes on HuggingFace for a 70B model

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M38.5 GB49 GB

Get the model

HuggingFace

Original weights

huggingface.co/NousResearch/Hermes-4-70B-FP8

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Hermes 4 70B FP8.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Hermes 4 70B FP8?

49GB of VRAM is enough to run Hermes 4 70B FP8 at the Q4_K_M quantization (file size 38.5 GB). Higher-quality quantizations need more.

Can I use Hermes 4 70B FP8 commercially?

Yes — Hermes 4 70B FP8 ships under the llama3, which permits commercial use. Always read the license text before deployment.

What's the context length of Hermes 4 70B FP8?

Hermes 4 70B FP8 supports a context window of 128,000 tokens (about 128K).

Source: huggingface.co/NousResearch/Hermes-4-70B-FP8

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify Hermes 4 70B FP8 runs on your specific hardware before committing money.