Llama 4 Scout
Meta's 2026 flagship MoE model. 109B total parameters with only 17B active per forward pass and a record 10-million-token context window — unmatched in production at any tier. Built for long-document workflows, RAG over entire codebases, and continuous-context agents.
Llama 4 Scout is Meta's "small flagship" of the new generation — natively multimodal, MoE architecture (109B total, 17B active), and the first Llama with a serious long-context story. It's the Llama 3.3 70B replacement for users with ~64 GB VRAM or unified memory.
Strengths- Native vision-language — single model handles image+text without a separate adapter, unlike Llama 3.2 11B Vision's bolted-on approach.
- MoE active parameters (17B) keep tokens/sec respectable at flagship quality — ~30–38 tok/s on a 4090 at Q4 with offload.
- Architectural long context that genuinely works further than 32K — recall stays competitive into 100K territory in practice.
- 109B total params mean Q4 is ~62 GB — needs dual high-VRAM cards, an A6000-class workstation, or Apple Silicon with 96 GB+ unified memory.
- License added new clauses vs Llama 3 — review the AUP if you ship at scale.
- Vision quality is solid but not best-in-class — Pixtral and Qwen 2.5 VL still edge it on dense OCR and chart understanding.
- Q4_K_M (62 GB) — heavy offload required: 8–14 tok/s, only practical with 64 GB+ system RAM
- Q5_K_M (74 GB) — workstation only
- Q8_0 (~110 GB) — Mac Studio territory
Yes, for workstation rigs (dual 4090, A6000, RTX 6000 Ada) and high-RAM Mac Studios. Excellent native multimodal model. No, for single-card consumer setups — at Q4 you're CPU-offloaded; at lower quants, quality erodes faster than usual on MoE.
How it compares- vs Llama 3.3 70B → Scout is multimodal and has better architectural long context; Llama 3.3 70B is faster on a single 24 GB card. Pick Scout if you have the memory and want vision; otherwise stick with 3.3 70B.
- vs Llama 4 Maverick → Maverick is the bigger sibling (400B/17B active). Same active compute but Maverick has a much larger expert pool — better quality if you can afford the disk + memory.
- vs Qwen 2.5 VL 72B → Qwen 2.5 VL is stronger on dense visual reasoning; Scout is more usable as a general assistant. Different jobs.
ollama pull llama4:scout
ollama run llama4:scout
Settings: Q4_K_M GGUF, 16384 ctx, --n-gpu-layers 30 of 49, RTX 4090 + 64 GB DDR5
›Why this rating
8.4/10 — the smallest Llama 4 is the model most local users will actually run, with native multimodality and a 10M-context architecture. Loses points only because real-world recall over the full advertised context is still imperfect.
Overview
Meta's 2026 flagship MoE model. 109B total parameters with only 17B active per forward pass and a record 10-million-token context window — unmatched in production at any tier. Built for long-document workflows, RAG over entire codebases, and continuous-context agents.
Strengths
- 10M token context (industry-leading)
- Efficient MoE — runs at 17B-active speed
- Strong tool/function calling
Weaknesses
- Total weights still need 65GB+ VRAM at Q4
- Long-context attention is RAM-hungry
- Newer than Llama 3.x — less ecosystem battle-testing
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 65.0 GB | 80 GB |
| Q5_K_M | 78.0 GB | 95 GB |
| FP16 | 218.0 GB | 240 GB |
Get the model
Ollama
One-line install
ollama run llama4:scoutRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Llama 4 Scout.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Llama 4 Scout?
Can I use Llama 4 Scout commercially?
What's the context length of Llama 4 Scout?
How do I install Llama 4 Scout with Ollama?
Source: huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.