llama
109B parameters
Commercial OK
Reviewed June 2026

Llama 4 Scout

Meta's 2026 flagship MoE model. 109B total parameters with only 17B active per forward pass and a record 10-million-token context window — unmatched in production at any tier. Built for long-document workflows, RAG over entire codebases, and continuous-context agents.

License: Llama 4 Community License·Released Apr 5, 2026·Context: 10,000,000 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
8.4/10

Positioning

Llama 4 Scout is Meta's "small flagship" of the new generation — natively multimodal, MoE architecture (109B total, 17B active), and the first Llama with a serious long-context story. It's the Llama 3.3 70B replacement for users with ~64 GB VRAM or unified memory.

Strengths

  • Native vision-language — single model handles image+text without a separate adapter, unlike Llama 3.2 11B Vision's bolted-on approach.
  • MoE active parameters (17B) keep tokens/sec respectable at flagship quality — ~30–38 tok/s on a 4090 at Q4 with offload.
  • Architectural long context that genuinely works further than 32K — recall stays competitive into 100K territory in practice.

Limitations

  • 109B total params mean Q4 is ~62 GB — needs dual high-VRAM cards, an A6000-class workstation, or Apple Silicon with 96 GB+ unified memory.
  • License added new clauses vs Llama 3 — review the AUP if you ship at scale.
  • Vision quality is solid but not best-in-class — Pixtral and Qwen 2.5 VL still edge it on dense OCR and chart understanding.

Real-world performance on RTX 4090

  • Q4_K_M (62 GB) — heavy offload required: 8–14 tok/s, only practical with 64 GB+ system RAM
  • Q5_K_M (74 GB) — workstation only
  • Q8_0 (~110 GB) — Mac Studio territory

Should you run this locally?

Yes, for workstation rigs (dual 4090, A6000, RTX 6000 Ada) and high-RAM Mac Studios. Excellent native multimodal model. No, for single-card consumer setups — at Q4 you're CPU-offloaded; at lower quants, quality erodes faster than usual on MoE.

How it compares

  • vs Llama 3.3 70B → Scout is multimodal and has better architectural long context; Llama 3.3 70B is faster on a single 24 GB card. Pick Scout if you have the memory and want vision; otherwise stick with 3.3 70B.
  • vs Llama 4 Maverick → Maverick is the bigger sibling (400B/17B active). Same active compute but Maverick has a much larger expert pool — better quality if you can afford the disk + memory.
  • vs Qwen 2.5 VL 72B → Qwen 2.5 VL is stronger on dense visual reasoning; Scout is more usable as a general assistant. Different jobs.

Run this yourself

ollama pull llama4:scout
ollama run llama4:scout
Settings: Q4_K_M GGUF, 16384 ctx, --n-gpu-layers 30 of 49, RTX 4090 + 64 GB DDR5
Why this rating

8.4/10 — the smallest Llama 4 is the model most local users will actually run, with native multimodality and a 10M-context architecture. Loses points only because real-world recall over the full advertised context is still imperfect.

Overview

Meta's 2026 flagship MoE model. 109B total parameters with only 17B active per forward pass and a record 10-million-token context window — unmatched in production at any tier. Built for long-document workflows, RAG over entire codebases, and continuous-context agents.

Featured in this stack

The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.

  • Stack · L3·Workstation tier·Role: Primary multimodal model (text + vision)
    Build a local vision-model stack (May 2026)

    Llama 4 Scout is the multimodal flagship in the open Llama 4 family. Strong image understanding combined with the same reasoning quality the text-only Llama 4 line delivers. The pick when you need image-grounded analysis at frontier-tier quality.

Execution notes

L1.25 enriched

Operator notes

Llama 4 Scout is the production multimodal flagship in the Llama 4 line for the workstation-cluster tier (datacenter-class deployment, not consumer-tier). Apache-equivalent Llama Community License. Strong on image-text reasoning at the workstation-cluster scale; the right pick when you need multimodal capability above the consumer Pixtral / Qwen-VL tier but don't need Maverick's frontier-cluster cost.

Deployment notes

The /stacks/local-vision-model recipe pairs this model with vLLM on RTX 4090 (tight, AWQ-INT4 fits with vision-token headroom carefully managed). Production-tier deployment is 2x A100 80GB or 1x H100 with comfortable headroom for multi-image queries.

For consumer-tier multimodal, drop to Pixtral 12B or Qwen 2.5-VL 7B — same multimodal capability shape, smaller hardware envelope.

For frontier-tier multimodal, Llama 4 Maverick is the parent — better image understanding at the multi-node cluster cost.

Runtime compatibility

  • vLLM ✓ excellent. Vision-language support landed in v0.7+; multi-image inputs handled cleanly.
  • SGLang ✓ good. Vision support younger than vLLM's — verify on your specific multi-image patterns.
  • Ollama ✓ partial. Vision-model support landed but vLLM has the production lead for multimodal workloads.
  • TensorRT-LLM ✗ partial. Multimodal support exists but the per-model recompile friction is severe for vision.
  • MLX-LM ✓ partial. Apple Silicon multimodal path is younger; Pixtral and Qwen 2.5-VL have stronger MLX integration.

Vision token economics

Vision-language models tokenize images as long sequences of vision tokens — Llama 4 Scout uses approximately 512 vision tokens per 1024×1024 image at default resolution. Multi-image queries (5 images at 512 tokens each) consume 2560 tokens of context just on images, before the user prompt or model response. Plan KV-cache budget accordingly.

Best use cases

  • Production multimodal serving at the workstation-cluster tier — image-Q&A, document understanding, visual reasoning.
  • Multi-image queries — the model handles 5-10 images per query cleanly.
  • Document layout + OCR + reasoning combined — strong on mixed-content workloads.
  • Llama-ecosystem migration path — drop-in for teams already on Llama 3 with multimodal needs.

When to use a different model

  • Consumer-tier multimodal (16-24GB VRAM): use Pixtral 12B or Qwen 2.5-VL 7B.
  • Frontier-tier multimodal: use Llama 4 Maverick — same family, larger.
  • Apple Silicon multimodal: Pixtral 12B has stronger MLX integration today.
  • OCR-first workloads: dedicated OCR models (Florence-2, MiniCPM-V) often beat general VLMs at text extraction.
  • Apache 2.0 license required: Pixtral 12B or Qwen 2.5-VL 72B — clean Apache 2.0.

Failure modes specific to this model

  1. OOM on multi-image queries. A 5-image query blows past KV-cache budgets sized for text-only workloads. Lower `--max-num-seqs` to 4 or 2.
  2. Image-format mismatch. Some vision models require RGB; obscure formats (TIFF, RAW) fail. Pre-convert client-side.
  3. Resolution silently downsampled. Default vision encoder downsamples; high-detail tasks (small text OCR) need explicit higher-resolution model variants.
  4. Mixed-modality tool-call format. Some agent harnesses can't handle text+image content blocks correctly.

Going deeper

Reviewed May 6, 2026 by Fredoline Eruo

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model
Llama 4 Maverick400B
Frontier
Family siblings (llama-4)

Strengths

  • 10M token context (industry-leading)
  • Efficient MoE — runs at 17B-active speed
  • Strong tool/function calling

Weaknesses

  • Total weights still need 65GB+ VRAM at Q4
  • Long-context attention is RAM-hungry
  • Newer than Llama 3.x — less ecosystem battle-testing

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M65.0 GB80 GB
Q5_K_M78.0 GB95 GB
FP16218.0 GB240 GB

Get the model

Ollama

One-line install

ollama run llama4:scoutRead our Ollama review →

HuggingFace

Original weights

huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Llama 4 Scout.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Step up
More capable — bigger memory footprint
No verdicted models in the next tier up yet.

Frequently asked

What's the minimum VRAM to run Llama 4 Scout?

80GB of VRAM is enough to run Llama 4 Scout at the Q4_K_M quantization (file size 65.0 GB). Higher-quality quantizations need more.

Can I use Llama 4 Scout commercially?

Yes — Llama 4 Scout ships under the Llama 4 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Llama 4 Scout?

Llama 4 Scout supports a context window of 10,000,000 tokens (about 10000K).

How do I install Llama 4 Scout with Ollama?

Run `ollama pull llama4:scout` to download, then `ollama run llama4:scout` to start a chat session. The default quantization is Q4_K_M.

Source: huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify Llama 4 Scout runs on your specific hardware before committing money.