Mistral Large 2 (123B)
Mistral's flagship dense model. Open weights but restricted commercial license — research and non-commercial only.
Positioning
Mistral Large 2 (123B) is Mistral AI's flagship dense model, released under the Mistral Research License for research and non-commercial use only. With 123 billion parameters and a 131,072-token context window, it represents the largest open-weight offering from Mistral, positioned as a datacenter-class model. Its dense architecture means every forward pass uses all 123B parameters, making inference compute-bound relative to Mixture-of-Experts alternatives of similar total parameter count.
Strengths
- Massive context window: 131,072 tokens of native context length allows processing of very long documents, codebases, or multi-turn conversations without truncation.
- Dense architecture simplicity: As a dense model, it avoids the routing overhead and potential expert-load imbalance of MoE designs, offering predictable inference behavior.
- Open weights for research: The Mistral Research License permits academic study, experimentation, and non-commercial applications, making it accessible to the research community.
- Large parameter count: 123B parameters provide substantial model capacity for complex reasoning and generation tasks, typical of flagship dense models.
Limitations
- Restricted commercial use: The Mistral Research License prohibits commercial deployment, limiting its use to research and non-commercial settings. Enterprises must seek alternative licensing.
- High hardware requirements: At FP16, the model requires ~246 GB of storage, and with KV cache overhead (typically 30–50% additional memory at full context), even quantized versions demand multiple high-end GPUs or a datacenter setup.
- No community benchmarks available: We do not yet have independently verified performance measurements for this model. Published vendor metrics should be treated as best-case until community replication.
- Dense inference cost: Unlike MoE models that activate only a fraction of parameters per token, Mistral Large 2 uses all 123B parameters at every step, resulting in higher per-token compute and memory requirements.
What it takes to run this locally
Quantized model sizes (disk): Q8_0 ~131 GB, Q6_K ~101.5 GB, Q5_K_M ~87.6 GB, Q4_K_M ~69.2 GB, Q3_K_M ~60.0 GB, Q2_K ~40.0 GB. Add ~30–50% for KV cache and framework overhead at typical context lengths. This places the model firmly in the datacenter deployment class: even the smallest quant (Q2_K) requires a multi-GPU workstation (e.g., dual 48GB GPUs) or a server-grade setup. Consumer single-GPU systems (12–24 GB) are insufficient.
Should you run this locally?
Yes if you are a researcher with access to multi-GPU datacenter hardware and need a large dense model with a permissive research license for non-commercial experimentation. The 131K context window is a strong draw for long-document analysis.
No if you need commercial deployment rights, lack multi-GPU infrastructure, or prefer a model with lower hardware requirements. For commercial use, consider Mistral's commercial offerings or other permissively licensed models.
Catalog cross-links
- Mistral 7B
- Mixtral 8x7B
- Mistral Small
Overview
Mistral's flagship dense model. Open weights but restricted commercial license — research and non-commercial only.
How to run it
Mistral Large 2 is Mistral AI's 123B dense model. Run at Q4_K_M via Ollama (ollama pull mistral-large:2) or llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size 70 GB on disk. Minimum VRAM: 80 GB — single A100 80GB at Q4_K_M (tight), or dual RTX A6000 row-split (96 GB). RTX 4090 24GB cannot run Q4 — use Q3_K_M (52 GB) on dual RTX 4090 row-split, or Q2. Recommended: A100 80GB at AWQ-INT4 for serving. Throughput: ~8-15 tok/s on A100 at 8K context; ~5-10 tok/s on dual A6000. Standard Mistral architecture — well-supported across all inference stacks. 128K context window advertised; practical usable at Q4 on 80 GB is 8-16K. KV cache at 32K adds ~30-40 GB, pushing total above 80 GB. Scale context based on available VRAM. Mistral Large 2 is known for strong multilingual performance and coding — benchmark your domain.
Hardware guidance
Minimum: A100 80GB at Q4_K_M (4K context — tight). Recommended: dual A100 80GB at AWQ-INT4 for 32K context with headroom. Budget: 2× RTX A6000 96 GB total at Q4_K_M. VRAM math: 123B dense, Q4_K_M ~0.57 bytes/param → ~70 GB. KV cache at 8K: ~20 GB. Total: ~90 GB at 8K batch=1. Single A100 80GB is 10 GB short — trim context to 4K or use Q3_K_M. Dual A6000 96 GB: comfortable for 8K context. Mac Studio M4 Ultra 128GB: Q4_K_M at 4-8 tok/s. RTX 4090 24GB × 3 = 72 GB — tight at Q4, use Q3_K_M. Single RTX 6000 Ada 48GB: Q3_K_M only with minimal context. Cloud: single A100 at $5-10/hr for Q4_K_M.
What breaks first
- VRAM underestimated for context. 70 GB weights + KV cache quickly exceeds 80 GB at >4K context. Trim context aggressively. 2. Q4_K_M quality on coding tasks. Mistral Large 2's code generation quality degrades measurably at Q4 vs Q8 — more than similarly-sized Llama models. Use AWQ-INT4 (calibrated) over GGUF Q4_K_M for code if quality matters. 3. Tokenization quirks. Mistral Large 2's tokenizer handles whitespace and special characters differently from Llama — prompts with unusual formatting may produce higher token counts than expected. 4. Multilingual performance variance. Non-English quality varies by language. Spanish/French/German are strong; less-resourced languages may degrade significantly. Benchmark your target language.
Runtime recommendation
Common beginner mistakes
Mistake: Running at 32K context on single A100 80GB. Fix: KV cache at 32K is ~40 GB. Total: 70 + 40 = 110 GB — OOM. Start at -c 4096 and increment. Mistake: Using Ollama's default Q4_0 instead of Q4_K_M. Fix: Q4_0 is larger and lower quality than Q4_K_M. Use the K-quant variants. Check ollama show for the exact tag. Mistake: Assuming Mistral Large 2 is an instruct model. Fix: Verify the specific hf repo — some distributions are base, some are instruct. The instruct variant needs the Mistral chat template. Mistake: Mixing Mistral Large 2 tokenizer with Llama tokenizer in the same pipeline. Fix: Different tokenizers, different vocabularies. Token counts will differ. Use the model's tokenizer for accurate context calculations.
Strengths
- Top-tier dense quality
- 128K context
- Strong multilingual
Weaknesses
- Non-commercial license
- Workstation-only
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 73.0 GB | 88 GB |
Get the model
Ollama
One-line install
ollama run mistral-large:123bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Mistral Large 2 (123B).
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Mistral Large 2 (123B)?
Can I use Mistral Large 2 (123B) commercially?
What's the context length of Mistral Large 2 (123B)?
How do I install Mistral Large 2 (123B) with Ollama?
Source: huggingface.co/mistralai/Mistral-Large-Instruct-2407
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Mistral Large 2 (123B) runs on your specific hardware before committing money.