Mistral Small 3 24B
Re-release of Mistral Small under Apache 2.0. Competitive with Llama 3.3 70B at one-third the size for many tasks.
Positioning
The model that proved Mistral could still ship competitive open weights post-2024. Mistral Small 3 24B is the cleanest-licensed option in the 20–32B class, with strong instruction polish and runtime stability. Right pick if Apache 2.0 is required and you have a 16 GB+ card.
Strengths
- True Apache 2.0 — no MAU caps, no usage restrictions.
- Instruction following is excellent — among the most reliable in this size class.
- Tool-use format clean and well-documented — Mistral's function-call convention is mature.
Limitations
- Slightly weaker than Qwen 3 32B on hard reasoning tasks.
- No thinking-mode equivalent — it's a single-mode dense model.
- Multilingual is European-focused — Asian languages weaker than Qwen.
Real-world performance on RTX 4090
- Q4_K_M (14.6 GB): 75–92 tok/s decode, TTFT ~110 ms — full GPU
- Q5_K_M (17.3 GB): 62–78 tok/s
- Q8_0 (26 GB): partial offload, 22–30 tok/s
Should you run this locally?
Yes, for Apache-licensed deployments, RTX 4070 Ti 16 GB / 4080 / 5080 owners, or anyone who values instruction-polish over raw capability. No, for users who can run 32B+ and don't care about license terms — Qwen 3 32B is slightly stronger.
How it compares
- vs Qwen 3 32B → Qwen 3 32B is slightly smarter; Mistral Small 3 24B has cleaner license + better instruction polish. Pick by priority.
- vs Mistral 7B v0.3 → Mistral Small 3 24B is the modern Mistral; 7B v0.3 is obsolete.
- vs Mistral Nemo 12B → Small 3 24B wins on capability; Nemo wins on VRAM (8 GB Q4 vs 14.6 GB).
- vs Mixtral 8x7B → Small 3 24B uses far less VRAM (14.6 GB vs 26 GB) and is comparable in quality.
Run this yourself
ollama pull mistral-small:24b-instruct-q4_K_M
ollama run mistral-small:24b-instruct-q4_K_M
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4080 / 4090
›Why this rating
8.4/10 — Mistral's return to relevance in the dense mid-tier. Apache 2.0, strong instruction following, fits on a 24 GB card with comfort. Loses points to Qwen 3 32B which is a slightly bigger, slightly stronger sibling at similar VRAM.
Overview
Re-release of Mistral Small under Apache 2.0. Competitive with Llama 3.3 70B at one-third the size for many tasks.
Execution notes
Operator notes
Mistral Small 3 24B Instruct is Mistral AI's open-weight instruction-tuned baseline at 24B as of late 2025. Apache 2.0. The right pick when you want Mistral's traditional instruction-following polish + European-multilingual depth + a clean commercial license.
The 24B size hits a specific operating point: bigger than the 14B-class models (Phi-4 14B, Qwen 2.5 14B) where instruction following starts to feel constrained, smaller than the 32B class (Qwen 2.5 32B, DeepSeek R1 Distill Qwen 32B) that requires AWQ-INT4 to fit consumer cards. 24B at Q4_K_M fits 16GB VRAM comfortably with 8K context.
Deployment notes
The /stacks/16gb-vram-local-ai recipe doesn't default to this model (Phi-4 14B and Qwen 2.5 7B win on raw benchmarks at the same VRAM tier), but Mistral Small 3 is the right pick when instruction-following polish matters more than raw benchmark scores. Tool-call reliability is consistently strong; structured output works without elaborate prompting.
For consumer-tier multilingual workflows with European-language depth, this is the operator default.
Runtime compatibility
- Ollama ✓ excellent. Q4_K_M GGUF; one-line pull via `ollama pull mistral-small3:24b`.
- vLLM ✓ excellent. AWQ-INT4 supported; production-tier multi-user serving on RTX 5090 / A100 80GB.
- MLX-LM ✓ good. Apple Silicon with MLX-4bit quant on M3 Max 64GB / M4 Max.
- llama.cpp ✓ excellent. Native GGUF support — the engine under Ollama / LM Studio.
- TensorRT-LLM ✓ partial. Compiles but the recompile-per-config friction is high for iteration.
Best use cases
- Multilingual chat — Mistral's traditional European-language depth carries; better than 14B-class on long-tail languages.
- Instruction-following workflows — tool-call reliability + structured output strong out of the box.
- Apache 2.0 commercial deployments — clean license for production self-hosting.
- Consumer-tier serving with a polished baseline — when you don't need reasoning-mode toggle (Qwen 3) or coding specialization (Qwen 2.5 Coder).
When to use a different model
- Coding-first workloads: Qwen 2.5 Coder 14B or Qwen 2.5 Coder 32B — Mistral Small 3 isn't coding-specialized.
- Reasoning-first workloads: Phi-4 14B (cheaper) or Qwen 3 32B (reasoning-toggle) or DeepSeek R1 Distill Qwen 14B (always-on reasoning at consumer scale).
- CJK languages: Qwen 2.5 14B — Qwen has deeper CJK depth.
- Edge / phone tier: Phi-4 Mini 4B or Llama 3.2 3B.
- Latest Mistral release: Mistral Small 3.2 24B — the iterative refresh with improved tool-call reliability.
Failure modes specific to this model
- No reasoning-mode toggle. Mistral Small 3 is pure instruction-tuned; no `` block emission. If your workflow needs explicit reasoning, use Qwen 3 / R1 Distill variants instead.
- Tool-call format quirks on older runtimes. Some llama.cpp builds parse Mistral's tool-call format slightly differently — verify with your specific runtime + version pin.
- Short context vs newer flagships. 131K context is competitive; long-context recall holds reasonably but not as strong as the 256K Mistral Medium 3.5 family.
Going deeper
- Mistral Small 3.2 24B — the iterative refresh
- Magistral 32B — reasoning-specialized Mistral
- Codestral Mamba 7B — Mistral SSM coding architecture alternative
- /stacks/16gb-vram-local-ai — consumer-tier deployment context
- /stacks/rtx-4090-workstation — workstation deployment context
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Apache 2.0
- Strong instruction following
- 32K context
Weaknesses
- Smaller context than Qwen/Llama
Prompting kit
Tested patterns for getting the most out of Mistral Small 3 24B locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.
Recommended system prompt
You are a precise and helpful assistant. Answer directly. For tool calls, emit valid JSON; for text answers, keep them concise unless detail is requested.
Quirks to know
- •Mistral Small 3.0 (January 2025 release). Per Mistral's release notes, positioned as a 'workhorse' replacement for GPT-4o-mini-class workloads at much lower inference cost.
- •32K context window per the model card. Note: this is shorter than the 3.2 release's 128K window.
- •Native tool calling supported, but Mistral's release notes note that tool-call reliability was materially improved in 3.1+ and 3.2. If reliability matters, prefer 3.2 over 3.0.
- •Per Mistral's docs, recommended sampler is low temperature (0.15) for tool/reasoning workloads — same convention as the 3.2 sibling.
- •Multilingual: same set as 3.2 — dozens of languages with strong Western European and CJK quality, weaker lower-resource coverage than Gemma or Qwen.
Chat template
[INST]...[/INST] markers with system-prompt support. Ships in tokenizer_config.json — apply via the runtime.
Tool calling
OpenAI-compatible tool call format per the model card. Reliability is lower than 3.2 — re-prompt on tool_call parse failures.
Sampler settings
- temperature
- 0.15
- top_p
- 1
Mistral's recommended low-temperature default for tool use and reasoning. Raise to ~0.7 for creative writing.
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 14.0 GB | 18 GB |
| Q8_0 | 26.0 GB | 30 GB |
Get the model
Ollama
One-line install
ollama run mistral-small:24bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Mistral Small 3 24B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Mistral Small 3 24B?
Can I use Mistral Small 3 24B commercially?
What's the context length of Mistral Small 3 24B?
How do I install Mistral Small 3 24B with Ollama?
Source: huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Mistral Small 3 24B runs on your specific hardware before committing money.