llama
90B parameters
Commercial OK
Multimodal
Reviewed June 2026

Llama 3.2 90B Vision Instruct

The 90B vision Llama. Best-in-class first-party multimodal open weight at the time of release. Workstation-class only.

License: Llama 3.2 Community License·Released Sep 25, 2024·Context: 131,072 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
unrated

Positioning

Llama 3.2 90B Vision Instruct is Meta's first-party multimodal extension of the Llama 3.2 family, adding native vision understanding to the dense 90B-parameter architecture. Released under the Llama 3.2 Community License, it targets operators who need a permissive, open-weight vision-language model at the 70B-class scale. With a 131,072-token context window, it is designed for long-context multimodal tasks such as document analysis, video understanding, and complex visual reasoning. At 90B dense parameters, this is a datacenter-class model: the FP16 checkpoint alone is ~180 GB, and even aggressive quantization requires workstation-grade hardware.

Strengths

  • First-party multimodal Llama: As a Meta release, this model benefits from the same training data, safety mitigations, and ecosystem support as the text-only Llama 3.2 models, making it a natural choice for operators already invested in the Llama stack.
  • Massive context window: 131,072 tokens of context enables processing of long documents, high-resolution image sequences, or extended video clips without truncation — a significant advantage over many open-weight VLMs with shorter contexts.
  • Permissive commercial license: The Llama 3.2 Community License allows for commercial use, including fine-tuning and deployment, with only usage-based restrictions for very large-scale applications (monthly active users thresholds).
  • Quantization flexibility: With quantized sizes ranging from Q8_0 (96 GB) down to Q2_K (29.3 GB), operators can trade off precision for hardware fit. The Q4_K_M variant (~50.6 GB) offers a practical balance for dual-GPU workstation setups.

Limitations

  • Datacenter-only at full precision: The FP16 checkpoint requires ~180 GB of GPU memory, plus substantial overhead for KV cache (add ~30-50% at typical context lengths). This effectively limits full-precision inference to multi-GPU datacenter nodes (e.g., 4× A100 80GB or 2× H100).
  • No community benchmarks yet: As a recent release, we lack independent, community-verified performance numbers for this model. Operators should treat vendor-published metrics as best-case and plan for their own evaluation.
  • Dense architecture at 90B: Unlike Mixture-of-Experts models that activate only a fraction of parameters per token, Llama 3.2 90B is dense — every forward pass uses all 90B parameters. This means inference cost scales linearly with parameter count, making it more expensive per token than an MoE model of similar total size.
  • Vision modality adds complexity: Running vision-language models requires additional preprocessing (image encoding) and often larger batch sizes for throughput. The vision encoder itself consumes memory and compute, further increasing hardware demands beyond the language model alone.

What it takes to run this locally

At FP16, the model requires ~180 GB of GPU memory just for weights. Adding KV cache and framework overhead (typically 30-50% at 131K context) pushes total memory beyond 250 GB. This places full-precision inference firmly in the datacenter class: 4× A100 80GB or 2× H100 80GB are the minimum viable configurations.

Quantization reduces the memory footprint significantly:

  • Q8_0: ~96 GB weights → ~125-145 GB total → still requires 2× A100 80GB or 4× A6000 48GB.
  • Q4_K_M: ~50.6 GB weights → ~66-76 GB total → fits on a single A100 80GB or 2× RTX 6000 Ada 48GB (with careful context management).
  • Q2_K: ~29.3 GB weights → ~38-44 GB total → possible on a single 48GB workstation GPU (e.g., RTX A6000) but with significant quality loss.

For practical deployment, a workstation with 2× 48GB GPUs (e.g., RTX 6000 Ada) running Q4_K_M is the most accessible path, while consumer hardware (single 24GB GPU) is not viable even at Q2_K due to memory constraints.

Should you run this locally?

Yes if you need a permissively licensed, first-party multimodal Llama model for commercial deployment and have access to datacenter or high-end workstation GPUs (2× 48GB or better). The 131K context window is a strong differentiator for long-document or video analysis tasks.

No if you are limited to consumer hardware (single 24GB GPU) or need fast, low-cost inference. The dense 90B architecture is expensive to run, and smaller VLMs (e.g., 7B-13B class) may be more practical. Also, if you require community-verified benchmarks before committing, wait for independent evaluations.

Catalog cross-links

Overview

The 90B vision Llama. Best-in-class first-party multimodal open weight at the time of release. Workstation-class only.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (llama-3.x-vision)
Llama 3.2 11B Vision Instruct11B
Consumer
Llama 3.2 90B Vision Instruct90B
You are here

Strengths

  • Top-tier open-weight vision quality
  • 128K context

Weaknesses

  • Needs 60GB+ VRAM
  • EU restricted

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M51.0 GB60 GB

Get the model

Ollama

One-line install

ollama run llama3.2-vision:90bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Llama 3.2 90B Vision Instruct.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Llama 3.2 90B Vision Instruct?

60GB of VRAM is enough to run Llama 3.2 90B Vision Instruct at the Q4_K_M quantization (file size 51.0 GB). Higher-quality quantizations need more.

Can I use Llama 3.2 90B Vision Instruct commercially?

Yes — Llama 3.2 90B Vision Instruct ships under the Llama 3.2 Community License, which permits commercial use. Always read the license text before deployment.

What's the context length of Llama 3.2 90B Vision Instruct?

Llama 3.2 90B Vision Instruct supports a context window of 131,072 tokens (about 131K).

How do I install Llama 3.2 90B Vision Instruct with Ollama?

Run `ollama pull llama3.2-vision:90b` to download, then `ollama run llama3.2-vision:90b` to start a chat session. The default quantization is Q4_K_M.

Does Llama 3.2 90B Vision Instruct support images?

Yes — Llama 3.2 90B Vision Instruct is multimodal and accepts text + vision inputs. Vision support requires a runner that handles its image-conditioning architecture.

Source: huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify Llama 3.2 90B Vision Instruct runs on your specific hardware before committing money.