Mixtral 8x7B Instruct
The MoE model that introduced the 8-experts pattern to the open-weight world. 47B params total, 13B active. Still a viable workhorse on 36GB+ setups.
The first practical MoE model in local AI. Today it's stuck in an awkward middle: the routing means it activates only 13B parameters per token (fast for its size), but you still need to fit all 47B in VRAM (26 GB at Q4_K_M). The model that should beat it on both axes — Llama 3.3 70B at Q4 — exists and runs in similar memory with offload.
Strengths- Active-parameter speed: 28–35 tok/s on a 4090 at Q4 (with offload), notably faster than dense 47B equivalents.
- Apache 2.0 license — clean commercial story.
- Strong multilingual for its era; French and German specifically remain solid.
- VRAM-heavy for the active compute: you pay 26 GB to use 13B worth of compute per token. Bad memory-vs-quality tradeoff today.
- Routing instability at long contexts — output quality degrades noticeably past 16K.
- Beat by Llama 3.3 70B on almost every general benchmark while needing similar memory.
- Q4_K_M (26 GB) — partial offload on 24 GB: 28–35 tok/s decode, TTFT ~250 ms on 1K prompt
- Q5_K_M (33 GB) — heavy offload: 14–20 tok/s
- Q8_0 (47 GB) — workstation territory only
Yes, for legacy fine-tunes you depend on, or where the Apache 2.0 license is required and you need MoE speed characteristics. No, for new deployments — Llama 3.3 70B at Q4_K_M lives in similar memory and produces meaningfully better outputs.
How it compares- vs Llama 3.3 70B Q4 → similar VRAM footprint, Llama 3.3 wins on quality across general tasks. The MoE speed advantage is real (~25% faster) but the quality gap is larger.
- vs Mixtral 8x22B → 8x22B is the modern Mixtral pick; uses ~3× the VRAM but earns it on quality.
- vs Qwen 3 30B-A3B (MoE) → Qwen 3 30B-A3B does what Mixtral 8x7B promised: smaller VRAM (~17 GB Q4), tighter routing, better quality. Pick Qwen 3 30B-A3B if you want MoE speed today.
ollama pull mixtral:8x7b-instruct-v0.1-q4_K_M
ollama run mixtral:8x7b-instruct-v0.1-q4_K_M
Settings: Q4_K_M GGUF, 8192 ctx, --n-gpu-layers 24 of 33 on 4090, CUDA 12.4
›Why this rating
6.4/10 — the original sparse-MoE story for local was important, but the math no longer pencils out: Llama 3.1 70B uses similar VRAM and is materially better, and Mixtral 8x22B is a more credible MoE option for dense workloads.
Overview
The MoE model that introduced the 8-experts pattern to the open-weight world. 47B params total, 13B active. Still a viable workhorse on 36GB+ setups.
Strengths
- Apache 2.0
- Pioneer MoE
- Wide ecosystem support
Weaknesses
- Now outpaced by Qwen 3 30B-A3B
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 28.0 GB | 32 GB |
| Q5_K_M | 33.0 GB | 38 GB |
Get the model
Ollama
One-line install
ollama run mixtral:8x7bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Benchmarks
Real measurements on real hardware. Numbers ship with the runner version, quant, and date.
| Hardware | Conf. | Quant | Ctx | Tokens / sec | VRAM | TTFT | Date |
|---|---|---|---|---|---|---|---|
| NVIDIA GeForce RTX 4090(Ollama) | M | Q4_K_M | 8K | 31.4tok/s | 23.1 GB | 248 ms | Apr 23, 26 |
Hardware that runs this
Cards with enough VRAM for at least one quantization of Mixtral 8x7B Instruct.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Mixtral 8x7B Instruct?
Can I use Mixtral 8x7B Instruct commercially?
What's the context length of Mixtral 8x7B Instruct?
How do I install Mixtral 8x7B Instruct with Ollama?
Source: huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.