Mistral
by Mistral AI
Mistral's mixed open + closed family. Mistral 7B + Mixtral 8x22B are the open-weight standards; Mistral Large + Codestral are commercial. Codestral Mamba 7B introduced state-space models to production code workflows.
Best entry point for local use
Start with Mistral Small 22B at Q4_K_M via Ollama — fits on single RTX 4090 24 GB, delivers strong European-language token efficiency (32K sentencepiece BPE vocab) and competitive reasoning (MMLU 80%). The 22B sits at Mistral's optimal density-performance intersection — better than Llama 3.1 8B for multilingual tasks, more deployable than Mixtral's MoE overhead. If your VRAM budget is <12 GB, use Mistral 7B v0.3 Q4 (5 GB) — runs on MacBook Pro M4 Max at 30+ tok/s and remains competitive for general assistant workloads. Skip Mixtral 8x22B for first deployment — the MoE complexity adds serving overhead without proportional quality gains over the 22B dense. Skip Codestral Mamba unless you specifically need O(1) per-token long-context inference — the Mamba architecture has narrower runtime support.
Deployment guidance
For single-user local: Ollama + mistral:22b Q4_K_M on RTX 4090 24 GB. For multi-user serving: vLLM with AWQ 4-bit on 2× L40S — Mistral's sliding window attention (SWA) enables efficient KV-cache management at high concurrency. For Mixtral 8x7B MoE: vLLM 0.5.4+ on 2× RTX 4090 with expert parallelism — 12.9B active per token means VRAM requirement is ~30 GB Q4 for full model. For mobile/edge: llama.cpp Mistral 7B Q4_0 on Snapdragon X Elite — ~20 tok/s. For Codestral Mamba: mamba.c CUDA kernels required — standard transformer engines do not support Mamba architecture. Mistral models use European-optimized tokenizer — English throughput is ~12% lower token-for-token vs Llama. See GPU buyer guide (same GPU class applies).
Featured models
Models in this family with our verdicts
Related families
Related — keep moving
Verify Mistral runs on your specific hardware before committing money.