Mixtral 8x7B Instruct

Positioning

The first practical MoE model in local AI. Today it's stuck in an awkward middle: the routing means it activates only 13B parameters per token (fast for its size), but you still need to fit all 47B in VRAM (26 GB at Q4_K_M). The model that should beat it on both axes — Llama 3.3 70B at Q4 — exists and runs in similar memory with offload.

Strengths

Active-parameter speed: 28–35 tok/s on a 4090 at Q4 (with offload), notably faster than dense 47B equivalents.
Apache 2.0 license — clean commercial story.
Strong multilingual for its era; French and German specifically remain solid.

Limitations

VRAM-heavy for the active compute: you pay 26 GB to use 13B worth of compute per token. Bad memory-vs-quality tradeoff today.
Routing instability at long contexts — output quality degrades noticeably past 16K.
Beat by Llama 3.3 70B on almost every general benchmark while needing similar memory.

Real-world performance on RTX 4090

Q4_K_M (26 GB) — partial offload on 24 GB: 28–35 tok/s decode, TTFT ~250 ms on 1K prompt
Q5_K_M (33 GB) — heavy offload: 14–20 tok/s
Q8_0 (47 GB) — workstation territory only

Should you run this locally?

Yes, for legacy fine-tunes you depend on, or where the Apache 2.0 license is required and you need MoE speed characteristics. No, for new deployments — Llama 3.3 70B at Q4_K_M lives in similar memory and produces meaningfully better outputs.

How it compares

vs Llama 3.3 70B Q4 → similar VRAM footprint, Llama 3.3 wins on quality across general tasks. The MoE speed advantage is real (~25% faster) but the quality gap is larger.
vs Mixtral 8x22B → 8x22B is the modern Mixtral pick; uses ~3× the VRAM but earns it on quality.
vs Qwen 3 30B-A3B (MoE) → Qwen 3 30B-A3B does what Mixtral 8x7B promised: smaller VRAM (~17 GB Q4), tighter routing, better quality. Pick Qwen 3 30B-A3B if you want MoE speed today.

Run this yourself

ollama pull mixtral:8x7b-instruct-v0.1-q4_K_M
ollama run mixtral:8x7b-instruct-v0.1-q4_K_M

Settings: Q4_K_M GGUF, 8192 ctx, --n-gpu-layers 24 of 33 on 4090, CUDA 12.4

Quantization	File size	VRAM required
Q4_K_M	28.0 GB	32 GB
Q5_K_M	33.0 GB	38 GB

Quantization

File size

VRAM required

Q4_K_M

28.0 GB

32 GB

Q5_K_M

33.0 GB

38 GB

Hardware	Conf.	Quant	Ctx	Tokens / sec	VRAM	TTFT	Date
NVIDIA GeForce RTX 4090(Ollama)	M	Q4_K_M	8K	31.4tok/s	23.1 GB	248 ms	Apr 23, 26

Hardware

Conf.

Quant

Ctx

Tokens / sec

VRAM

TTFT

Date

NVIDIA GeForce RTX 4090(Ollama)

Q4_K_M

31.4tok/s

23.1 GB

248 ms

Apr 23, 26

Frequently asked

What's the minimum VRAM to run Mixtral 8x7B Instruct?

32GB of VRAM is enough to run Mixtral 8x7B Instruct at the Q4_K_M quantization (file size 28.0 GB). Higher-quality quantizations need more.

Can I use Mixtral 8x7B Instruct commercially?

Yes — Mixtral 8x7B Instruct ships under the Apache 2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of Mixtral 8x7B Instruct?

Mixtral 8x7B Instruct supports a context window of 32,768 tokens (about 33K).

How do I install Mixtral 8x7B Instruct with Ollama?

Run `ollama pull mixtral:8x7b` to download, then `ollama run mixtral:8x7b` to start a chat session. The default quantization is Q4_K_M.

Overview

Strengths

Weaknesses

Quantization variants

Get the model

Ollama

HuggingFace

Benchmarks

Hardware that runs this

Models worth comparing