llama
8B parameters
Commercial OK

Llama 3.1 8B Instruct

Meta's small flagship. Strong general reasoning, 128K context, broad multilingual. The default first try for most local-AI use cases on consumer hardware.

License: Llama 3.1 Community License·Released Jul 23, 2024·Context: 131,072 tokens
Our verdict
By Fredoline Eruo·Last verified May 6, 2026
8.7/10
Positioning

The default 8B-class model for anyone who wants a permissive, English-strong, runs-everywhere chat assistant. If you have an RTX 3060 12 GB or anything stronger, this is the model you start with — it's the one the entire local-LLM tutorial ecosystem is calibrated against.

Strengths
  • Fits everything: Q4_K_M is 4.6 GB. Runs on a 6 GB card with reduced context, comfortably on 8 GB+, and at full 128K context on a 12 GB+ card with KV cache trimming.
  • Instruction following is excellent: handles multi-turn, system prompts, JSON-mode-via-prompt, and tool-call-style outputs without the brittleness Mistral 7B shows.
  • Genuinely permissive license: the Llama 3.1 Community License allows commercial use up to 700M MAUs — which is everyone reading this.
Limitations
  • Math and code are average, not strong. For coding work, Qwen 2.5 Coder 7B is meaningfully better.
  • 128K context is nominal, not real — quality starts degrading past ~32K tokens, and effective recall over very long inputs is weaker than the spec suggests.
  • Alignment refusals are noticeable in technical domains (security research, pen-testing tutorials). Hermes-3-8B is a good uncensored alternative on the same base.
Real-world performance on RTX 4090
  • Q4_K_M (4.6 GB): 95–115 tok/s decode, TTFT under 80 ms on a 1K prompt
  • Q5_K_M (5.6 GB): 88–100 tok/s
  • Q8_0 (8.5 GB): 70–82 tok/s — the quality bump over Q5 is small; rarely worth the speed loss
Should you run this locally?

Yes, for general assistant work, summarization, drafting, RAG pipelines, and as the chat model behind tooling/agents that need a fast, predictable backbone. No, for serious code generation (use Qwen 2.5 Coder), heavy reasoning (use QwQ 32B or DeepSeek R1 Distill), or non-English tasks where Qwen 2.5 7B is consistently stronger.

How it compares
  • vs Qwen 2.5 7B → Qwen wins on knowledge breadth and multilingual tasks; Llama wins on instruction reliability and ecosystem maturity. Coin flip with the edge to Qwen if you're comfortable using it.
  • vs Mistral 7B v0.3 → Llama wins decisively on instruction following and long-context behavior. Mistral 7B is the previous default; there's no reason to start there now.
  • vs Phi-3.5 Mini (3.8B) → Llama is far more capable; Phi is the right pick only when VRAM is genuinely tight (sub-6 GB cards).
  • vs Llama 3.2 3B → Llama 3.1 8B is materially better at almost everything but uses ~2× the VRAM. The 3B is for VRAM-constrained edge devices.
Run this yourself
ollama pull llama3.1:8b-instruct-q4_K_M
ollama run llama3.1:8b-instruct-q4_K_M
Settings used in the timing range above Quant: Q4_K_M GGUF Context: 8192 (KV cache f16) Backend: llama.cpp via Ollama, CUDA 12.4 GPU: RTX 4090, driver 555.99
Why this rating

8.7/10 — the boring, correct answer for almost every "I have an 8 GB GPU and want a chat model" question. Loses points only because Qwen 2.5 7B has overtaken it on raw capability per parameter.

Overview

Meta's small flagship. Strong general reasoning, 128K context, broad multilingual. The default first try for most local-AI use cases on consumer hardware.

Strengths

  • 128K context
  • Excellent instruction following
  • Strong tool/function calling

Weaknesses

  • Refusals on edge use cases
  • Slower than 3B siblings
  • No vision

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M4.9 GB6 GB
Q5_K_M5.7 GB7 GB
Q8_08.5 GB10 GB
FP1616.1 GB18 GB

Get the model

Ollama

One-line install

ollama run llama3.1:8bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

Source repository — direct quantization required.

Benchmarks

Real measurements on real hardware. Numbers ship with the runner version, quant, and date.

1 run on record
HardwareConf.QuantCtxTokens / secVRAMTTFTDate
NVIDIA GeForce RTX 4090(Ollama)MQ4_K_M8K
104.7tok/s
5.4 GB78 msApr 22, 26

Hardware that runs this

Cards with enough VRAM for at least one quantization of Llama 3.1 8B Instruct.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Llama 3.1 8B Instruct?

6GB of VRAM is enough to run Llama 3.1 8B Instruct at the Q4_K_M quantization (file size 4.9 GB). Higher-quality quantizations need more.

Can I use Llama 3.1 8B Instruct commercially?

Yes — Llama 3.1 8B Instruct ships under the Llama 3.1 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Llama 3.1 8B Instruct?

Llama 3.1 8B Instruct supports a context window of 131,072 tokens (about 131K).

How do I install Llama 3.1 8B Instruct with Ollama?

Run `ollama pull llama3.1:8b` to download, then `ollama run llama3.1:8b` to start a chat session. The default quantization is Q4_K_M.

Source: huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.