Hermes 3 Llama 3.1 8B

Positioning

Hermes 3 is the uncensored / less-aligned alternative on the Llama 3.1 8B base. Right pick for security research, red-team work, technical writing on dual-use topics, or any case where the base Llama's refusal layer gets in the way of legitimate work.

Strengths

Refusals dramatically reduced vs base Llama 3.1 8B without losing instruction quality.
Same VRAM, same Llama license — drop-in replacement.
Tool-use compatibility preserved.

Limitations

Niche use case — most users don't need this; default to Llama 3.1 8B.
Slightly weaker on creative writing than base Llama (alignment training adds polish).
Reduced refusals can be too eager — produces content that requires judgment to use.

Real-world performance on RTX 4090

Q4_K_M (4.6 GB): 90–110 tok/s decode
Q5_K_M (5.6 GB): 80–95 tok/s
Q8_0 (8.5 GB): 65–80 tok/s

Should you run this locally?

Yes, for security/research work where base Llama's refusals are blocking legitimate tasks. No, for general chat — the base Llama 3.1 8B is the right default.

How it compares

vs Llama 3.1 8B (base) → Hermes 3 is base Llama minus alignment layer. Pick base for general use, Hermes for technical/research work.
vs Hermes 3 Llama 3.1 70B → 70B is meaningfully smarter at higher VRAM cost.
vs Dolphin 3.0 Mistral 24B → similar philosophy, different base model. Dolphin is bigger and on Apache base.

Run this yourself

ollama pull nous-hermes-3:8b-llama-3.1-q4_K_M
ollama run nous-hermes-3:8b-llama-3.1-q4_K_M

Settings: Q4_K_M GGUF, 8192 ctx, llama.cpp/CUDA, RTX 4090

Quantization	File size	VRAM required
Q4_K_M	4.9 GB	6 GB
Q8_0	8.5 GB	10 GB

Quantization

File size

VRAM required

Q4_K_M

4.9 GB

6 GB

Q8_0

8.5 GB

10 GB

Frequently asked

What's the minimum VRAM to run Hermes 3 Llama 3.1 8B?

6GB of VRAM is enough to run Hermes 3 Llama 3.1 8B at the Q4_K_M quantization (file size 4.9 GB). Higher-quality quantizations need more.

Can I use Hermes 3 Llama 3.1 8B commercially?

Yes — Hermes 3 Llama 3.1 8B ships under the Llama 3.1 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Hermes 3 Llama 3.1 8B?

Hermes 3 Llama 3.1 8B supports a context window of 131,072 tokens (about 131K).

How do I install Hermes 3 Llama 3.1 8B with Ollama?

Run `ollama pull hermes3:8b` to download, then `ollama run hermes3:8b` to start a chat session. The default quantization is Q4_K_M.

Overview

Strengths

Weaknesses

Quantization variants

Get the model

Ollama

HuggingFace

Hardware that runs this

Models worth comparing

Frequently asked

What's the minimum VRAM to run Hermes 3 Llama 3.1 8B?

Can I use Hermes 3 Llama 3.1 8B commercially?

What's the context length of Hermes 3 Llama 3.1 8B?

How do I install Hermes 3 Llama 3.1 8B with Ollama?