Mixtral 8x7B Instruct
The MoE model that introduced the 8-experts pattern to the open-weight world. 47B params total, 13B active. Still a viable workhorse on 36GB+ setups.
Positioning
The first practical MoE model in local AI. Today it's stuck in an awkward middle: the routing means it activates only 13B parameters per token (fast for its size), but you still need to fit all 47B in VRAM (26 GB at Q4_K_M). The model that should beat it on both axes — Llama 3.3 70B at Q4 — exists and runs in similar memory with offload.
Strengths
- Active-parameter speed: 28–35 tok/s on a 4090 at Q4 (with offload), notably faster than dense 47B equivalents.
- Apache 2.0 license — clean commercial story.
- Strong multilingual for its era; French and German specifically remain solid.
Limitations
- VRAM-heavy for the active compute: you pay 26 GB to use 13B worth of compute per token. Bad memory-vs-quality tradeoff today.
- Routing instability at long contexts — output quality degrades noticeably past 16K.
- Beat by Llama 3.3 70B on almost every general benchmark while needing similar memory.
Real-world performance on RTX 4090
- Q4_K_M (26 GB) — partial offload on 24 GB: 28–35 tok/s decode, TTFT ~250 ms on 1K prompt
- Q5_K_M (33 GB) — heavy offload: 14–20 tok/s
- Q8_0 (47 GB) — workstation territory only
Should you run this locally?
Yes, for legacy fine-tunes you depend on, or where the Apache 2.0 license is required and you need MoE speed characteristics. No, for new deployments — Llama 3.3 70B at Q4_K_M lives in similar memory and produces meaningfully better outputs.
How it compares
- vs Llama 3.3 70B Q4 → similar VRAM footprint, Llama 3.3 wins on quality across general tasks. The MoE speed advantage is real (~25% faster) but the quality gap is larger.
- vs Mixtral 8x22B → 8x22B is the modern Mixtral pick; uses ~3× the VRAM but earns it on quality.
- vs Qwen 3 30B-A3B (MoE) → Qwen 3 30B-A3B does what Mixtral 8x7B promised: smaller VRAM (~17 GB Q4), tighter routing, better quality. Pick Qwen 3 30B-A3B if you want MoE speed today.
Run this yourself
ollama pull mixtral:8x7b-instruct-v0.1-q4_K_M
ollama run mixtral:8x7b-instruct-v0.1-q4_K_M
Settings: Q4_K_M GGUF, 8192 ctx, --n-gpu-layers 24 of 33 on 4090, CUDA 12.4
›Why this rating
6.4/10 — the original sparse-MoE story for local was important, but the math no longer pencils out: Llama 3.1 70B uses similar VRAM and is materially better, and Mixtral 8x22B is a more credible MoE option for dense workloads.
Overview
The MoE model that introduced the 8-experts pattern to the open-weight world. 47B params total, 13B active. Still a viable workhorse on 36GB+ setups.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Apache 2.0
- Pioneer MoE
- Wide ecosystem support
Weaknesses
- Now outpaced by Qwen 3 30B-A3B
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 28.0 GB | 32 GB |
| Q5_K_M | 33.0 GB | 38 GB |
Get the model
Ollama
One-line install
ollama run mixtral:8x7bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Mixtral 8x7B Instruct.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Mixtral 8x7B Instruct?
Can I use Mixtral 8x7B Instruct commercially?
What's the context length of Mixtral 8x7B Instruct?
How do I install Mixtral 8x7B Instruct with Ollama?
Source: huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Mixtral 8x7B Instruct runs on your specific hardware before committing money.