mistral
24B parameters
Commercial OK
Reviewed June 2026

Mistral Small 3 24B

Re-release of Mistral Small under Apache 2.0. Competitive with Llama 3.3 70B at one-third the size for many tasks.

License: Apache 2.0·Released Jan 30, 2025·Context: 32,768 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
8.4/10

Positioning

The model that proved Mistral could still ship competitive open weights post-2024. Mistral Small 3 24B is the cleanest-licensed option in the 20–32B class, with strong instruction polish and runtime stability. Right pick if Apache 2.0 is required and you have a 16 GB+ card.

Strengths

  • True Apache 2.0 — no MAU caps, no usage restrictions.
  • Instruction following is excellent — among the most reliable in this size class.
  • Tool-use format clean and well-documented — Mistral's function-call convention is mature.

Limitations

  • Slightly weaker than Qwen 3 32B on hard reasoning tasks.
  • No thinking-mode equivalent — it's a single-mode dense model.
  • Multilingual is European-focused — Asian languages weaker than Qwen.

Real-world performance on RTX 4090

  • Q4_K_M (14.6 GB): 75–92 tok/s decode, TTFT ~110 ms — full GPU
  • Q5_K_M (17.3 GB): 62–78 tok/s
  • Q8_0 (26 GB): partial offload, 22–30 tok/s

Should you run this locally?

Yes, for Apache-licensed deployments, RTX 4070 Ti 16 GB / 4080 / 5080 owners, or anyone who values instruction-polish over raw capability. No, for users who can run 32B+ and don't care about license terms — Qwen 3 32B is slightly stronger.

How it compares

  • vs Qwen 3 32B → Qwen 3 32B is slightly smarter; Mistral Small 3 24B has cleaner license + better instruction polish. Pick by priority.
  • vs Mistral 7B v0.3 → Mistral Small 3 24B is the modern Mistral; 7B v0.3 is obsolete.
  • vs Mistral Nemo 12B → Small 3 24B wins on capability; Nemo wins on VRAM (8 GB Q4 vs 14.6 GB).
  • vs Mixtral 8x7B → Small 3 24B uses far less VRAM (14.6 GB vs 26 GB) and is comparable in quality.

Run this yourself

ollama pull mistral-small:24b-instruct-q4_K_M
ollama run mistral-small:24b-instruct-q4_K_M
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on RTX 4080 / 4090
Why this rating

8.4/10 — Mistral's return to relevance in the dense mid-tier. Apache 2.0, strong instruction following, fits on a 24 GB card with comfort. Loses points to Qwen 3 32B which is a slightly bigger, slightly stronger sibling at similar VRAM.

Overview

Re-release of Mistral Small under Apache 2.0. Competitive with Llama 3.3 70B at one-third the size for many tasks.

Execution notes

L1.25 enriched

Operator notes

Mistral Small 3 24B Instruct is Mistral AI's open-weight instruction-tuned baseline at 24B as of late 2025. Apache 2.0. The right pick when you want Mistral's traditional instruction-following polish + European-multilingual depth + a clean commercial license.

The 24B size hits a specific operating point: bigger than the 14B-class models (Phi-4 14B, Qwen 2.5 14B) where instruction following starts to feel constrained, smaller than the 32B class (Qwen 2.5 32B, DeepSeek R1 Distill Qwen 32B) that requires AWQ-INT4 to fit consumer cards. 24B at Q4_K_M fits 16GB VRAM comfortably with 8K context.

Deployment notes

The /stacks/16gb-vram-local-ai recipe doesn't default to this model (Phi-4 14B and Qwen 2.5 7B win on raw benchmarks at the same VRAM tier), but Mistral Small 3 is the right pick when instruction-following polish matters more than raw benchmark scores. Tool-call reliability is consistently strong; structured output works without elaborate prompting.

For consumer-tier multilingual workflows with European-language depth, this is the operator default.

Runtime compatibility

  • Ollama ✓ excellent. Q4_K_M GGUF; one-line pull via `ollama pull mistral-small3:24b`.
  • vLLM ✓ excellent. AWQ-INT4 supported; production-tier multi-user serving on RTX 5090 / A100 80GB.
  • MLX-LM ✓ good. Apple Silicon with MLX-4bit quant on M3 Max 64GB / M4 Max.
  • llama.cpp ✓ excellent. Native GGUF support — the engine under Ollama / LM Studio.
  • TensorRT-LLM ✓ partial. Compiles but the recompile-per-config friction is high for iteration.

Best use cases

  • Multilingual chat — Mistral's traditional European-language depth carries; better than 14B-class on long-tail languages.
  • Instruction-following workflows — tool-call reliability + structured output strong out of the box.
  • Apache 2.0 commercial deployments — clean license for production self-hosting.
  • Consumer-tier serving with a polished baseline — when you don't need reasoning-mode toggle (Qwen 3) or coding specialization (Qwen 2.5 Coder).

When to use a different model

Failure modes specific to this model

  1. No reasoning-mode toggle. Mistral Small 3 is pure instruction-tuned; no `` block emission. If your workflow needs explicit reasoning, use Qwen 3 / R1 Distill variants instead.
  2. Tool-call format quirks on older runtimes. Some llama.cpp builds parse Mistral's tool-call format slightly differently — verify with your specific runtime + version pin.
  3. Short context vs newer flagships. 131K context is competitive; long-context recall holds reasonably but not as strong as the 256K Mistral Medium 3.5 family.

Going deeper

Reviewed May 6, 2026 by Fredoline Eruo

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (mistral-small-3)
Mistral Small 3 24B24B
You are here
Mistral Small 3.2 24B24B
Consumer
Distilled / fine-tuned from this

Strengths

  • Apache 2.0
  • Strong instruction following
  • 32K context

Weaknesses

  • Smaller context than Qwen/Llama

Prompting kit

From model card
source

Tested patterns for getting the most out of Mistral Small 3 24B locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.

Recommended system prompt

You are a precise and helpful assistant. Answer directly. For tool calls, emit valid JSON; for text answers, keep them concise unless detail is requested.

Quirks to know

  • Mistral Small 3.0 (January 2025 release). Per Mistral's release notes, positioned as a 'workhorse' replacement for GPT-4o-mini-class workloads at much lower inference cost.
  • 32K context window per the model card. Note: this is shorter than the 3.2 release's 128K window.
  • Native tool calling supported, but Mistral's release notes note that tool-call reliability was materially improved in 3.1+ and 3.2. If reliability matters, prefer 3.2 over 3.0.
  • Per Mistral's docs, recommended sampler is low temperature (0.15) for tool/reasoning workloads — same convention as the 3.2 sibling.
  • Multilingual: same set as 3.2 — dozens of languages with strong Western European and CJK quality, weaker lower-resource coverage than Gemma or Qwen.

Chat template

Mistral Instruct v3

[INST]...[/INST] markers with system-prompt support. Ships in tokenizer_config.json — apply via the runtime.

Tool calling

✓ Supported(openai-compatible)

OpenAI-compatible tool call format per the model card. Reliability is lower than 3.2 — re-prompt on tool_call parse failures.

Sampler settings

temperature
0.15
top_p
1

Mistral's recommended low-temperature default for tool use and reasoning. Raise to ~0.7 for creative writing.

Browse prompting kits for every model →/prompting

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M14.0 GB18 GB
Q8_026.0 GB30 GB

Get the model

Ollama

One-line install

ollama run mistral-small:24bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Mistral Small 3 24B.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Mistral Small 3 24B?

18GB of VRAM is enough to run Mistral Small 3 24B at the Q4_K_M quantization (file size 14.0 GB). Higher-quality quantizations need more.

Can I use Mistral Small 3 24B commercially?

Yes — Mistral Small 3 24B ships under the Apache 2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of Mistral Small 3 24B?

Mistral Small 3 24B supports a context window of 32,768 tokens (about 33K).

How do I install Mistral Small 3 24B with Ollama?

Run `ollama pull mistral-small:24b` to download, then `ollama run mistral-small:24b` to start a chat session. The default quantization is Q4_K_M.

Source: huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify Mistral Small 3 24B runs on your specific hardware before committing money.