Llama 4 Scout
Meta's 2026 flagship MoE model. 109B total parameters with only 17B active per forward pass and a record 10-million-token context window — unmatched in production at any tier. Built for long-document workflows, RAG over entire codebases, and continuous-context agents.
Positioning
Llama 4 Scout is Meta's "small flagship" of the new generation — natively multimodal, MoE architecture (109B total, 17B active), and the first Llama with a serious long-context story. It's the Llama 3.3 70B replacement for users with ~64 GB VRAM or unified memory.
Strengths
- Native vision-language — single model handles image+text without a separate adapter, unlike Llama 3.2 11B Vision's bolted-on approach.
- MoE active parameters (17B) keep tokens/sec respectable at flagship quality — ~30–38 tok/s on a 4090 at Q4 with offload.
- Architectural long context that genuinely works further than 32K — recall stays competitive into 100K territory in practice.
Limitations
- 109B total params mean Q4 is ~62 GB — needs dual high-VRAM cards, an A6000-class workstation, or Apple Silicon with 96 GB+ unified memory.
- License added new clauses vs Llama 3 — review the AUP if you ship at scale.
- Vision quality is solid but not best-in-class — Pixtral and Qwen 2.5 VL still edge it on dense OCR and chart understanding.
Real-world performance on RTX 4090
- Q4_K_M (62 GB) — heavy offload required: 8–14 tok/s, only practical with 64 GB+ system RAM
- Q5_K_M (74 GB) — workstation only
- Q8_0 (~110 GB) — Mac Studio territory
Should you run this locally?
Yes, for workstation rigs (dual 4090, A6000, RTX 6000 Ada) and high-RAM Mac Studios. Excellent native multimodal model. No, for single-card consumer setups — at Q4 you're CPU-offloaded; at lower quants, quality erodes faster than usual on MoE.
How it compares
- vs Llama 3.3 70B → Scout is multimodal and has better architectural long context; Llama 3.3 70B is faster on a single 24 GB card. Pick Scout if you have the memory and want vision; otherwise stick with 3.3 70B.
- vs Llama 4 Maverick → Maverick is the bigger sibling (400B/17B active). Same active compute but Maverick has a much larger expert pool — better quality if you can afford the disk + memory.
- vs Qwen 2.5 VL 72B → Qwen 2.5 VL is stronger on dense visual reasoning; Scout is more usable as a general assistant. Different jobs.
Run this yourself
ollama pull llama4:scout
ollama run llama4:scout
Settings: Q4_K_M GGUF, 16384 ctx, --n-gpu-layers 30 of 49, RTX 4090 + 64 GB DDR5
›Why this rating
8.4/10 — the smallest Llama 4 is the model most local users will actually run, with native multimodality and a 10M-context architecture. Loses points only because real-world recall over the full advertised context is still imperfect.
Overview
Meta's 2026 flagship MoE model. 109B total parameters with only 17B active per forward pass and a record 10-million-token context window — unmatched in production at any tier. Built for long-document workflows, RAG over entire codebases, and continuous-context agents.
Featured in this stack
The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.
- Stack · L3·Workstation tier·Role: Primary multimodal model (text + vision)Build a local vision-model stack (May 2026)
Llama 4 Scout is the multimodal flagship in the open Llama 4 family. Strong image understanding combined with the same reasoning quality the text-only Llama 4 line delivers. The pick when you need image-grounded analysis at frontier-tier quality.
Execution notes
Operator notes
Llama 4 Scout is the production multimodal flagship in the Llama 4 line for the workstation-cluster tier (datacenter-class deployment, not consumer-tier). Apache-equivalent Llama Community License. Strong on image-text reasoning at the workstation-cluster scale; the right pick when you need multimodal capability above the consumer Pixtral / Qwen-VL tier but don't need Maverick's frontier-cluster cost.
Deployment notes
The /stacks/local-vision-model recipe pairs this model with vLLM on RTX 4090 (tight, AWQ-INT4 fits with vision-token headroom carefully managed). Production-tier deployment is 2x A100 80GB or 1x H100 with comfortable headroom for multi-image queries.
For consumer-tier multimodal, drop to Pixtral 12B or Qwen 2.5-VL 7B — same multimodal capability shape, smaller hardware envelope.
For frontier-tier multimodal, Llama 4 Maverick is the parent — better image understanding at the multi-node cluster cost.
Runtime compatibility
- vLLM ✓ excellent. Vision-language support landed in v0.7+; multi-image inputs handled cleanly.
- SGLang ✓ good. Vision support younger than vLLM's — verify on your specific multi-image patterns.
- Ollama ✓ partial. Vision-model support landed but vLLM has the production lead for multimodal workloads.
- TensorRT-LLM ✗ partial. Multimodal support exists but the per-model recompile friction is severe for vision.
- MLX-LM ✓ partial. Apple Silicon multimodal path is younger; Pixtral and Qwen 2.5-VL have stronger MLX integration.
Vision token economics
Vision-language models tokenize images as long sequences of vision tokens — Llama 4 Scout uses approximately 512 vision tokens per 1024×1024 image at default resolution. Multi-image queries (5 images at 512 tokens each) consume 2560 tokens of context just on images, before the user prompt or model response. Plan KV-cache budget accordingly.
Best use cases
- Production multimodal serving at the workstation-cluster tier — image-Q&A, document understanding, visual reasoning.
- Multi-image queries — the model handles 5-10 images per query cleanly.
- Document layout + OCR + reasoning combined — strong on mixed-content workloads.
- Llama-ecosystem migration path — drop-in for teams already on Llama 3 with multimodal needs.
When to use a different model
- Consumer-tier multimodal (16-24GB VRAM): use Pixtral 12B or Qwen 2.5-VL 7B.
- Frontier-tier multimodal: use Llama 4 Maverick — same family, larger.
- Apple Silicon multimodal: Pixtral 12B has stronger MLX integration today.
- OCR-first workloads: dedicated OCR models (Florence-2, MiniCPM-V) often beat general VLMs at text extraction.
- Apache 2.0 license required: Pixtral 12B or Qwen 2.5-VL 72B — clean Apache 2.0.
Failure modes specific to this model
- OOM on multi-image queries. A 5-image query blows past KV-cache budgets sized for text-only workloads. Lower `--max-num-seqs` to 4 or 2.
- Image-format mismatch. Some vision models require RGB; obscure formats (TIFF, RAW) fail. Pre-convert client-side.
- Resolution silently downsampled. Default vision encoder downsamples; high-detail tasks (small text OCR) need explicit higher-resolution model variants.
- Mixed-modality tool-call format. Some agent harnesses can't handle text+image content blocks correctly.
Going deeper
- /stacks/local-vision-model — the canonical deployment recipe
- Llama 4 Maverick — the parent / larger family member
- Pixtral 12B, Qwen 2.5-VL 7B — consumer-tier multimodal alternatives
- vLLM operational review — multimodal serving runtime
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- 10M token context (industry-leading)
- Efficient MoE — runs at 17B-active speed
- Strong tool/function calling
Weaknesses
- Total weights still need 65GB+ VRAM at Q4
- Long-context attention is RAM-hungry
- Newer than Llama 3.x — less ecosystem battle-testing
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 65.0 GB | 80 GB |
| Q5_K_M | 78.0 GB | 95 GB |
| FP16 | 218.0 GB | 240 GB |
Get the model
Ollama
One-line install
ollama run llama4:scoutRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Llama 4 Scout.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Llama 4 Scout?
Can I use Llama 4 Scout commercially?
What's the context length of Llama 4 Scout?
How do I install Llama 4 Scout with Ollama?
Source: huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Llama 4 Scout runs on your specific hardware before committing money.