mistral
123B parameters
Restricted
Reviewed June 2026

Mistral Large 2 (123B)

Mistral's flagship dense model. Open weights but restricted commercial license — research and non-commercial only.

License: Mistral Research License·Released Jul 24, 2024·Context: 131,072 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
unrated

Positioning

Mistral Large 2 (123B) is Mistral AI's flagship dense model, released under the Mistral Research License for research and non-commercial use only. With 123 billion parameters and a 131,072-token context window, it represents the largest open-weight offering from Mistral, positioned as a datacenter-class model. Its dense architecture means every forward pass uses all 123B parameters, making inference compute-bound relative to Mixture-of-Experts alternatives of similar total parameter count.

Strengths

  • Massive context window: 131,072 tokens of native context length allows processing of very long documents, codebases, or multi-turn conversations without truncation.
  • Dense architecture simplicity: As a dense model, it avoids the routing overhead and potential expert-load imbalance of MoE designs, offering predictable inference behavior.
  • Open weights for research: The Mistral Research License permits academic study, experimentation, and non-commercial applications, making it accessible to the research community.
  • Large parameter count: 123B parameters provide substantial model capacity for complex reasoning and generation tasks, typical of flagship dense models.

Limitations

  • Restricted commercial use: The Mistral Research License prohibits commercial deployment, limiting its use to research and non-commercial settings. Enterprises must seek alternative licensing.
  • High hardware requirements: At FP16, the model requires ~246 GB of storage, and with KV cache overhead (typically 30–50% additional memory at full context), even quantized versions demand multiple high-end GPUs or a datacenter setup.
  • No community benchmarks available: We do not yet have independently verified performance measurements for this model. Published vendor metrics should be treated as best-case until community replication.
  • Dense inference cost: Unlike MoE models that activate only a fraction of parameters per token, Mistral Large 2 uses all 123B parameters at every step, resulting in higher per-token compute and memory requirements.

What it takes to run this locally

Quantized model sizes (disk): Q8_0 ~131 GB, Q6_K ~101.5 GB, Q5_K_M ~87.6 GB, Q4_K_M ~69.2 GB, Q3_K_M ~60.0 GB, Q2_K ~40.0 GB. Add ~30–50% for KV cache and framework overhead at typical context lengths. This places the model firmly in the datacenter deployment class: even the smallest quant (Q2_K) requires a multi-GPU workstation (e.g., dual 48GB GPUs) or a server-grade setup. Consumer single-GPU systems (12–24 GB) are insufficient.

Should you run this locally?

Yes if you are a researcher with access to multi-GPU datacenter hardware and need a large dense model with a permissive research license for non-commercial experimentation. The 131K context window is a strong draw for long-document analysis.

No if you need commercial deployment rights, lack multi-GPU infrastructure, or prefer a model with lower hardware requirements. For commercial use, consider Mistral's commercial offerings or other permissively licensed models.

Catalog cross-links

  • Mistral 7B
  • Mixtral 8x7B
  • Mistral Small

Overview

Mistral's flagship dense model. Open weights but restricted commercial license — research and non-commercial only.

How to run it

Mistral Large 2 is Mistral AI's 123B dense model. Run at Q4_K_M via Ollama (ollama pull mistral-large:2) or llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size 70 GB on disk. Minimum VRAM: 80 GB — single A100 80GB at Q4_K_M (tight), or dual RTX A6000 row-split (96 GB). RTX 4090 24GB cannot run Q4 — use Q3_K_M (52 GB) on dual RTX 4090 row-split, or Q2. Recommended: A100 80GB at AWQ-INT4 for serving. Throughput: ~8-15 tok/s on A100 at 8K context; ~5-10 tok/s on dual A6000. Standard Mistral architecture — well-supported across all inference stacks. 128K context window advertised; practical usable at Q4 on 80 GB is 8-16K. KV cache at 32K adds ~30-40 GB, pushing total above 80 GB. Scale context based on available VRAM. Mistral Large 2 is known for strong multilingual performance and coding — benchmark your domain.

Hardware guidance

Minimum: A100 80GB at Q4_K_M (4K context — tight). Recommended: dual A100 80GB at AWQ-INT4 for 32K context with headroom. Budget: 2× RTX A6000 96 GB total at Q4_K_M. VRAM math: 123B dense, Q4_K_M ~0.57 bytes/param → ~70 GB. KV cache at 8K: ~20 GB. Total: ~90 GB at 8K batch=1. Single A100 80GB is 10 GB short — trim context to 4K or use Q3_K_M. Dual A6000 96 GB: comfortable for 8K context. Mac Studio M4 Ultra 128GB: Q4_K_M at 4-8 tok/s. RTX 4090 24GB × 3 = 72 GB — tight at Q4, use Q3_K_M. Single RTX 6000 Ada 48GB: Q3_K_M only with minimal context. Cloud: single A100 at $5-10/hr for Q4_K_M.

What breaks first

  1. VRAM underestimated for context. 70 GB weights + KV cache quickly exceeds 80 GB at >4K context. Trim context aggressively. 2. Q4_K_M quality on coding tasks. Mistral Large 2's code generation quality degrades measurably at Q4 vs Q8 — more than similarly-sized Llama models. Use AWQ-INT4 (calibrated) over GGUF Q4_K_M for code if quality matters. 3. Tokenization quirks. Mistral Large 2's tokenizer handles whitespace and special characters differently from Llama — prompts with unusual formatting may produce higher token counts than expected. 4. Multilingual performance variance. Non-English quality varies by language. Spanish/French/German are strong; less-resourced languages may degrade significantly. Benchmark your target language.

Runtime recommendation

vLLM for production serving — Mistral models are first-class citizens in vLLM's supported architectures. llama.cpp for single-node local use. Ollama for quick-start (wraps llama.cpp). Mistral Large 2 uses standard Mistral architecture — no exotic kernels needed. MLX-LM for Apple Silicon users.

Common beginner mistakes

Mistake: Running at 32K context on single A100 80GB. Fix: KV cache at 32K is ~40 GB. Total: 70 + 40 = 110 GB — OOM. Start at -c 4096 and increment. Mistake: Using Ollama's default Q4_0 instead of Q4_K_M. Fix: Q4_0 is larger and lower quality than Q4_K_M. Use the K-quant variants. Check ollama show for the exact tag. Mistake: Assuming Mistral Large 2 is an instruct model. Fix: Verify the specific hf repo — some distributions are base, some are instruct. The instruct variant needs the Mistral chat template. Mistake: Mixing Mistral Large 2 tokenizer with Llama tokenizer in the same pipeline. Fix: Different tokenizers, different vocabularies. Token counts will differ. Use the model's tokenizer for accurate context calculations.

Strengths

  • Top-tier dense quality
  • 128K context
  • Strong multilingual

Weaknesses

  • Non-commercial license
  • Workstation-only

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M73.0 GB88 GB

Get the model

Ollama

One-line install

ollama run mistral-large:123bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/mistralai/Mistral-Large-Instruct-2407

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Mistral Large 2 (123B).

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Step up
More capable — bigger memory footprint
No verdicted models in the next tier up yet.

Frequently asked

What's the minimum VRAM to run Mistral Large 2 (123B)?

88GB of VRAM is enough to run Mistral Large 2 (123B) at the Q4_K_M quantization (file size 73.0 GB). Higher-quality quantizations need more.

Can I use Mistral Large 2 (123B) commercially?

Mistral Large 2 (123B) is released under the Mistral Research License, which has restrictions for commercial use. Review the license terms before using it in a product.

What's the context length of Mistral Large 2 (123B)?

Mistral Large 2 (123B) supports a context window of 131,072 tokens (about 131K).

How do I install Mistral Large 2 (123B) with Ollama?

Run `ollama pull mistral-large:123b` to download, then `ollama run mistral-large:123b` to start a chat session. The default quantization is Q4_K_M.

Source: huggingface.co/mistralai/Mistral-Large-Instruct-2407

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify Mistral Large 2 (123B) runs on your specific hardware before committing money.