Llama 4 Scout

Llama 4 Scout

Meta's 2026 flagship MoE model. 109B total parameters with only 17B active per forward pass and a record 10-million-token context window — unmatched in production at any tier. Built for long-document workflows, RAG over entire codebases, and continuous-context agents.

License: Llama 4 Community License·Released Apr 5, 2026·Context: 10,000,000 tokens

Positioning

Llama 4 Scout is Meta's "small flagship" of the new generation — natively multimodal, MoE architecture (109B total, 17B active), and the first Llama with a serious long-context story. It's the Llama 3.3 70B replacement for users with ~64 GB VRAM or unified memory.

Strengths

Native vision-language — single model handles image+text without a separate adapter, unlike Llama 3.2 11B Vision's bolted-on approach.
MoE active parameters (17B) keep tokens/sec respectable at flagship quality — ~30–38 tok/s on a 4090 at Q4 with offload.
Architectural long context that genuinely works further than 32K — recall stays competitive into 100K territory in practice.

Limitations

109B total params mean Q4 is ~62 GB — needs dual high-VRAM cards, an A6000-class workstation, or Apple Silicon with 96 GB+ unified memory.
License added new clauses vs Llama 3 — review the AUP if you ship at scale.
Vision quality is solid but not best-in-class — Pixtral and Qwen 2.5 VL still edge it on dense OCR and chart understanding.

Real-world performance on RTX 4090

Q4_K_M (62 GB) — heavy offload required: 8–14 tok/s, only practical with 64 GB+ system RAM
Q5_K_M (74 GB) — workstation only
Q8_0 (~110 GB) — Mac Studio territory

Should you run this locally?

Yes, for workstation rigs (dual 4090, A6000, RTX 6000 Ada) and high-RAM Mac Studios. Excellent native multimodal model. No, for single-card consumer setups — at Q4 you're CPU-offloaded; at lower quants, quality erodes faster than usual on MoE.

How it compares

vs Llama 3.3 70B → Scout is multimodal and has better architectural long context; Llama 3.3 70B is faster on a single 24 GB card. Pick Scout if you have the memory and want vision; otherwise stick with 3.3 70B.
vs Llama 4 Maverick → Maverick is the bigger sibling (400B/17B active). Same active compute but Maverick has a much larger expert pool — better quality if you can afford the disk + memory.
vs Qwen 2.5 VL 72B → Qwen 2.5 VL is stronger on dense visual reasoning; Scout is more usable as a general assistant. Different jobs.

Run this yourself

ollama pull llama4:scout
ollama run llama4:scout

Settings: Q4_K_M GGUF, 16384 ctx, --n-gpu-layers 30 of 49, RTX 4090 + 64 GB DDR5

Quantization	File size	VRAM required
Q4_K_M	65.0 GB	80 GB
Q5_K_M	78.0 GB	95 GB
FP16	218.0 GB	240 GB

Quantization

File size

VRAM required

Q4_K_M

65.0 GB

80 GB

Q5_K_M

78.0 GB

95 GB

FP16

218.0 GB

240 GB

Frequently asked

What's the minimum VRAM to run Llama 4 Scout?

80GB of VRAM is enough to run Llama 4 Scout at the Q4_K_M quantization (file size 65.0 GB). Higher-quality quantizations need more.

Can I use Llama 4 Scout commercially?

Yes — Llama 4 Scout ships under the Llama 4 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Llama 4 Scout?

Llama 4 Scout supports a context window of 10,000,000 tokens (about 10000K).

How do I install Llama 4 Scout with Ollama?

Run `ollama pull llama4:scout` to download, then `ollama run llama4:scout` to start a chat session. The default quantization is Q4_K_M.

Overview

Strengths

Weaknesses

Quantization variants

Get the model

Ollama

HuggingFace

Hardware that runs this

Models worth comparing

Frequently asked

What's the minimum VRAM to run Llama 4 Scout?

Can I use Llama 4 Scout commercially?

What's the context length of Llama 4 Scout?

How do I install Llama 4 Scout with Ollama?