Devstral Small 2 24B

Mistral's coding-specialized Mistral Small 2 successor. Apache 2.0 — the rare commercial-OK Mistral coder.

License: Apache 2.0·Released Sep 25, 2025·Context: 131,072 tokens

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

unrated

Positioning

Devstral Small 2 24B is a dense 24-billion-parameter model from Mistral AI, released under the permissive Apache 2.0 license. With a 131K-token context window, it is positioned as a coding-specialized successor to Mistral Small 2. Its Apache 2.0 license makes it a rare commercial-OK coding model from Mistral, offering an open-weight alternative for developers who need unrestricted deployment.

Strengths

Apache 2.0 license for commercial coding: Unlike many Mistral models, this one is fully open for commercial use, making it suitable for proprietary codebases and enterprise deployment.
Large 131K context window: Supports long code files, multi-file projects, or extensive documentation in a single prompt.
Dense architecture at 24B params: Inference cost scales predictably with parameter count, without the overhead of an MoE router.
Consumer-grade deployment possible: At Q4_K_M (13.5 GB) or Q3_K_M (11.7 GB), the model fits on a single 16-24 GB GPU, with room for KV cache.

Limitations

No community benchmarks available: We do not have independent measurements of coding accuracy, instruction following, or speed. Vendor claims should be treated as best-case.
24B dense requires significant VRAM: FP16 (~48 GB) is impractical for consumer hardware; quantized versions are necessary, which may affect output quality.
KV cache overhead at long context: At 131K tokens, the KV cache can add 30-50% to memory requirements, potentially pushing beyond single-GPU limits.
Not a frontier model: As a 24B dense model, it is not designed to compete with larger frontier models; its value is in permissive licensing and local deployability.

What it takes to run this locally

Quantized sizes (disk): Q8_0 ~26 GB, Q6_K ~19.8 GB, Q5_K_M ~17.1 GB, Q4_K_M ~13.5 GB, Q3_K_M ~11.7 GB, Q2_K ~7.8 GB. Add ~30-50% for KV cache and framework overhead. A single consumer GPU with 16-24 GB VRAM can run Q4_K_M or Q3_K_M comfortably. For full FP16 precision, a workstation GPU (e.g., 48 GB) or multi-GPU setup is required.

Should you run this locally?

Yes if you need a permissively licensed coding model for commercial use and have a consumer GPU with at least 16 GB VRAM. The Apache 2.0 license removes deployment friction.

No if you require frontier-level coding performance or cannot quantize below Q4_K_M without quality loss. Also not ideal if you need a general-purpose model rather than a coding specialist.

Catalog cross-links

Mistral Small 2
Qwen 2.5 Coder
Apache 2.0 models

Overview

Mistral's coding-specialized Mistral Small 2 successor. Apache 2.0 — the rare commercial-OK Mistral coder.

How to run it

Devstral Small 2 24B is a developer-oriented fine-tune of a Mistral 24B base model. "Devstral" positioning suggests coding/dev-tool optimization. Run at Q4_K_M via Ollama (ollama pull devstral:24b) or llama.cpp with -ngl 999 -fa -c 16384. Q4_K_M file size ~14 GB on disk. Minimum VRAM: 12 GB — RTX 4070 (12GB) at Q4_K_M with KV offload for 4K context. RTX 4090 24GB: Q4_K_M comfortably at 16K+ context. Recommended: RTX 4090 24GB at Q4_K_M. Throughput: ~40-65 tok/s on RTX 4090 at Q4_K_M. Mistral-derived architecture — standard inference compatibility. Devstral is optimized for coding tasks: code generation, debugging, refactoring, and developer tool integration (FIM likely supported — verify). Strong on: code generation, technical documentation, system design discussions. Weaker on: general chat, creative writing, non-technical tasks. The "Small 2" versioning suggests this is the second iteration of the Devstral Small line — improvements over v1 in code quality and tool-use. Verify exact model provenance and license on the hf repo. Context: 32K+ (Mistral base); code contexts typically 2-8K — efficient. For Mistral's official models: Mistral Small 3.2 24B.

Hardware guidance

Minimum: RTX 3060 12GB at Q3_K_M with KV offload. Recommended: RTX 4090 24GB at Q4_K_M (16K+ context). VRAM math: 24B dense, Q4_K_M ≈ 14 GB. KV cache at 16K: ~5 GB. Total: ~19 GB. RTX 4090 24GB: comfortable on-GPU. RTX 3080 10GB: Q3_K_M with KV offload. RTX 4080 16GB: Q4 + 8K context on-GPU. MacBook Pro M4 Pro 24GB+: Q4 at 15-30 tok/s. Cloud: A10 24GB at Q4_K_M. For IDE integration (FIM): similar VRAM profile, but FIM adds context for prefix/suffix. RTX 4090 handles FIM well. AWQ-INT4 drops to ~12 GB. As a fine-tune, this may have fewer GGUF options than official Mistral models — check bartowski.

What breaks first

Model provenance. Devstral is a community/vendor fine-tune — verify the training source, data, and license before production use. 2. FIM support. If Devstral supports fill-in-the-middle, standard chat interfaces won't expose it. Use llama.cpp FIM server + IDE plugin. 3. Code quality vs general Mistral. Devstral's coding specialization may have tradeoffs: worse general chat quality, possible catastrophic forgetting of non-coding knowledge. 4. Chat template. Devstral may use a custom chat template that differs from standard Mistral. Verify on the hf repo before deploying. 5. Surprise license. Fine-tunes may have different licenses than the base model. Verify commercial use terms.

Runtime recommendation

llama.cpp with FIM support for IDE integration. Continue.dev as the IDE frontend. Ollama for quick chat-based code help. vLLM for serving. Standard Mistral inference — broad support. For pure Mistral models, see Mistral Small 3.2 24B.

Common beginner mistakes

Mistake: Assuming Devstral is an official Mistral model. Fix: Devstral is a community/vendor fine-tune, not an official Mistral release. Verify provenance and license on hf. Mistake: Using Devstral for general chat and wondering why quality is low. Fix: It's coding-optimized. General conversational ability is degraded vs general-purpose 24B models. Mistake: Expecting FIM to work out of the box. Fix: FIM requires specific inference stack setup (FIM server + IDE plugin). Standard chat interfaces don't expose it. Mistake: Trusting generated code without review. Fix: As with all code models, generated code may have bugs, security issues, or hallucinated APIs. Always review and test.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model

Mistral Small 3 24B24B

Consumer