mistral
24B parameters
Commercial OK
Reviewed June 2026

Devstral Small 2 24B

Mistral's coding-specialized Mistral Small 2 successor. Apache 2.0 — the rare commercial-OK Mistral coder.

License: Apache 2.0·Released Sep 25, 2025·Context: 131,072 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
unrated

Positioning

Devstral Small 2 24B is a dense 24-billion-parameter model from Mistral AI, released under the permissive Apache 2.0 license. With a 131K-token context window, it is positioned as a coding-specialized successor to Mistral Small 2. Its Apache 2.0 license makes it a rare commercial-OK coding model from Mistral, offering an open-weight alternative for developers who need unrestricted deployment.

Strengths

  • Apache 2.0 license for commercial coding: Unlike many Mistral models, this one is fully open for commercial use, making it suitable for proprietary codebases and enterprise deployment.
  • Large 131K context window: Supports long code files, multi-file projects, or extensive documentation in a single prompt.
  • Dense architecture at 24B params: Inference cost scales predictably with parameter count, without the overhead of an MoE router.
  • Consumer-grade deployment possible: At Q4_K_M (13.5 GB) or Q3_K_M (11.7 GB), the model fits on a single 16-24 GB GPU, with room for KV cache.

Limitations

  • No community benchmarks available: We do not have independent measurements of coding accuracy, instruction following, or speed. Vendor claims should be treated as best-case.
  • 24B dense requires significant VRAM: FP16 (~48 GB) is impractical for consumer hardware; quantized versions are necessary, which may affect output quality.
  • KV cache overhead at long context: At 131K tokens, the KV cache can add 30-50% to memory requirements, potentially pushing beyond single-GPU limits.
  • Not a frontier model: As a 24B dense model, it is not designed to compete with larger frontier models; its value is in permissive licensing and local deployability.

What it takes to run this locally

Quantized sizes (disk): Q8_0 ~26 GB, Q6_K ~19.8 GB, Q5_K_M ~17.1 GB, Q4_K_M ~13.5 GB, Q3_K_M ~11.7 GB, Q2_K ~7.8 GB. Add ~30-50% for KV cache and framework overhead. A single consumer GPU with 16-24 GB VRAM can run Q4_K_M or Q3_K_M comfortably. For full FP16 precision, a workstation GPU (e.g., 48 GB) or multi-GPU setup is required.

Should you run this locally?

Yes if you need a permissively licensed coding model for commercial use and have a consumer GPU with at least 16 GB VRAM. The Apache 2.0 license removes deployment friction.

No if you require frontier-level coding performance or cannot quantize below Q4_K_M without quality loss. Also not ideal if you need a general-purpose model rather than a coding specialist.

Catalog cross-links

Overview

Mistral's coding-specialized Mistral Small 2 successor. Apache 2.0 — the rare commercial-OK Mistral coder.

How to run it

Devstral Small 2 24B is a developer-oriented fine-tune of a Mistral 24B base model. "Devstral" positioning suggests coding/dev-tool optimization. Run at Q4_K_M via Ollama (ollama pull devstral:24b) or llama.cpp with -ngl 999 -fa -c 16384. Q4_K_M file size ~14 GB on disk. Minimum VRAM: 12 GB — RTX 4070 (12GB) at Q4_K_M with KV offload for 4K context. RTX 4090 24GB: Q4_K_M comfortably at 16K+ context. Recommended: RTX 4090 24GB at Q4_K_M. Throughput: ~40-65 tok/s on RTX 4090 at Q4_K_M. Mistral-derived architecture — standard inference compatibility. Devstral is optimized for coding tasks: code generation, debugging, refactoring, and developer tool integration (FIM likely supported — verify). Strong on: code generation, technical documentation, system design discussions. Weaker on: general chat, creative writing, non-technical tasks. The "Small 2" versioning suggests this is the second iteration of the Devstral Small line — improvements over v1 in code quality and tool-use. Verify exact model provenance and license on the hf repo. Context: 32K+ (Mistral base); code contexts typically 2-8K — efficient. For Mistral's official models: Mistral Small 3.2 24B.

Hardware guidance

Minimum: RTX 3060 12GB at Q3_K_M with KV offload. Recommended: RTX 4090 24GB at Q4_K_M (16K+ context). VRAM math: 24B dense, Q4_K_M ≈ 14 GB. KV cache at 16K: ~5 GB. Total: ~19 GB. RTX 4090 24GB: comfortable on-GPU. RTX 3080 10GB: Q3_K_M with KV offload. RTX 4080 16GB: Q4 + 8K context on-GPU. MacBook Pro M4 Pro 24GB+: Q4 at 15-30 tok/s. Cloud: A10 24GB at Q4_K_M. For IDE integration (FIM): similar VRAM profile, but FIM adds context for prefix/suffix. RTX 4090 handles FIM well. AWQ-INT4 drops to ~12 GB. As a fine-tune, this may have fewer GGUF options than official Mistral models — check bartowski.

What breaks first

  1. Model provenance. Devstral is a community/vendor fine-tune — verify the training source, data, and license before production use. 2. FIM support. If Devstral supports fill-in-the-middle, standard chat interfaces won't expose it. Use llama.cpp FIM server + IDE plugin. 3. Code quality vs general Mistral. Devstral's coding specialization may have tradeoffs: worse general chat quality, possible catastrophic forgetting of non-coding knowledge. 4. Chat template. Devstral may use a custom chat template that differs from standard Mistral. Verify on the hf repo before deploying. 5. Surprise license. Fine-tunes may have different licenses than the base model. Verify commercial use terms.

Runtime recommendation

llama.cpp with FIM support for IDE integration. Continue.dev as the IDE frontend. Ollama for quick chat-based code help. vLLM for serving. Standard Mistral inference — broad support. For pure Mistral models, see Mistral Small 3.2 24B.

Common beginner mistakes

Mistake: Assuming Devstral is an official Mistral model. Fix: Devstral is a community/vendor fine-tune, not an official Mistral release. Verify provenance and license on hf. Mistake: Using Devstral for general chat and wondering why quality is low. Fix: It's coding-optimized. General conversational ability is degraded vs general-purpose 24B models. Mistake: Expecting FIM to work out of the box. Fix: FIM requires specific inference stack setup (FIM server + IDE plugin). Standard chat interfaces don't expose it. Mistake: Trusting generated code without review. Fix: As with all code models, generated code may have bugs, security issues, or hallucinated APIs. Always review and test.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Strengths

  • Apache 2.0
  • Coding-specialized Mistral lineage

Weaknesses

  • Community smaller than Qwen Coder

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M14.0 GB18 GB

Get the model

HuggingFace

Original weights

huggingface.co/mistralai/Devstral-Small-2-24B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Devstral Small 2 24B.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Frequently asked

What's the minimum VRAM to run Devstral Small 2 24B?

18GB of VRAM is enough to run Devstral Small 2 24B at the Q4_K_M quantization (file size 14.0 GB). Higher-quality quantizations need more.

Can I use Devstral Small 2 24B commercially?

Yes — Devstral Small 2 24B ships under the Apache 2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of Devstral Small 2 24B?

Devstral Small 2 24B supports a context window of 131,072 tokens (about 131K).

Source: huggingface.co/mistralai/Devstral-Small-2-24B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify Devstral Small 2 24B runs on your specific hardware before committing money.