other

52B parameters

Commercial OK

Reviewed May 2026

Jamba 1.5 Mini

AI21's hybrid Mamba-Transformer MoE. 256k context with the SSM throughput advantage.

License: Jamba Open Model License·Released Aug 22, 2024·Context: 262,144 tokens

Overview

AI21's hybrid Mamba-Transformer MoE. 256k context with the SSM throughput advantage.

How to run it

Jamba 1.5 Mini is AI21's smaller SSM-hybrid model (52B total, ~12B active via MoE). The SSM backbone enables efficient long-context handling. Run at Q4_K_M via Ollama (ollama pull jamba:1.5-mini) or llama.cpp with -ngl 999 -fa -c 16384. Q4_K_M file size ~25-30 GB on disk. Minimum VRAM: 16 GB — RTX 4080 (16GB) at Q4_K_M with KV offload for 8K context. RTX 4090 24GB: Q4_K_M with 16K context comfortable. Recommended: single RTX 4090 24GB at Q4_K_M. Throughput: ~30-50 tok/s on RTX 4090 at Q4_K_M. SSM architecture keeps KV cache growth low — 32K+ context is practical on 24 GB. The active subset (12B) makes generation efficient. SSM layers decode sequentially — slightly lower peak tok/s than pure attention at the same active size, but the context efficiency is the tradeoff. Jamba 1.5 Mini is the most accessible SSM-hybrid model — consumer GPU friendly. For larger SSM models, see Jamba 1.5 Large.

Hardware guidance

Minimum: RTX 3060 12GB at Q3_K_M with KV offload. Recommended: RTX 4090 24GB at Q4_K_M. VRAM math: ~52B total, ~12B active. Q4_K_M ≈ 25-30 GB for full weights. Expert offload reduces VRAM to ~8-12 GB (active experts only). SSM KV cache: ~2-5 GB at 32K context (significantly less than attention models). Total at Q4 with offload: ~15-20 GB for 32K context — comfortable on 24 GB cards. RTX 3090 24GB: Q4_K_M with expert offload at 32K context. RTX 4080 16GB: Q4_K_M with expert offload at 8-16K. MacBook Pro M4 Pro 24GB+: Q4_K_M at 8-12 tok/s. Cloud: A10 24GB at Q4_K_M. SSM kernel requires CUDA 11.8+ / SM 7.5+ (Turing+). Pascal GPUs not supported.

What breaks first

SSM kernel on older GPUs. Mamba kernels require Turing (SM 7.5) or newer. GTX 10-series and older won't run Jamba. Check CUDA compute capability. 2. Ollama Jamba support. Jamba's SSM-hybrid architecture may not be in Ollama's default catalog. Verify with ollama list or use raw llama.cpp. 3. Per-token speed ceiling. SSM decode is sequential — tok/s is lower than attention at the same active parameter count. Jamba 1.5 Mini trades peak speed for context efficiency. 4. Expert offload latency. When experts are in system RAM, routing to a RAM-resident expert causes 30-80ms stalls. On consumer GPUs with slow RAM (DDR4), this stall is noticeable. Use fast DDR5 RAM to minimize penalty.

Runtime recommendation

llama.cpp with -ngl 999 is the primary option — most mature Jamba/SSM support. Ollama for quick-start if Jamba tag exists. vLLM for serving (verify Jamba SSM support). Avoid MLX-LM — Apple Silicon SSM kernel is less optimized.

Common beginner mistakes

Mistake: Running Jamba on GTX 1080-class GPUs. Fix: SSM kernels require Turing+ (SM 7.5). Pascal GPUs will crash or produce undefined behavior. Mistake: Expecting 100+ tok/s because active params are only 12B. Fix: SSM decode is sequential. 30-50 tok/s at Q4 on RTX 4090 is realistic — not 100+. Mistake: Setting 256K context and expecting it to work on 24 GB. Fix: While SSM is efficient, 256K is extreme. Start at 32K, benchmark VRAM, scale up. Mistake: Using Q8 because "the file size is small." Fix: Q8 is ~50 GB — 2× Q4_K_M. Stick to Q4_K_M for consumer hardware. Q8 gains are marginal on SSM models.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (jamba-1.5)

Jamba 1.5 Mini52B

You are here

Jamba 1.5 Large398B

Frontier

Distilled / fine-tuned from this

Jamba 1.5 Large398B

Frontier

Strengths

256k context
Hybrid SSM-Transformer
Long-context throughput

Weaknesses

Limited runtime support outside vLLM

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	30.0 GB	36 GB

Get the model

HuggingFace

Original weights

huggingface.co/ai21labs/AI21-Jamba-1.5-Mini

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Jamba 1.5 Mini.

Frequently asked

What's the minimum VRAM to run Jamba 1.5 Mini?

36GB of VRAM is enough to run Jamba 1.5 Mini at the Q4_K_M quantization (file size 30.0 GB). Higher-quality quantizations need more.

Can I use Jamba 1.5 Mini commercially?

Yes — Jamba 1.5 Mini ships under the Jamba Open Model License, which permits commercial use. Always read the license text before deployment.

What's the context length of Jamba 1.5 Mini?

Jamba 1.5 Mini supports a context window of 262,144 tokens (about 262K).

Source: huggingface.co/ai21labs/AI21-Jamba-1.5-Mini

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Alternatives

Jamba 1.5 Large

Before you buy

Verify Jamba 1.5 Mini runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →

other

52B parameters

Commercial OK

Reviewed May 2026

Jamba 1.5 Mini

AI21's hybrid Mamba-Transformer MoE. 256k context with the SSM throughput advantage.

License: Jamba Open Model License·Released Aug 22, 2024·Context: 262,144 tokens

Overview

AI21's hybrid Mamba-Transformer MoE. 256k context with the SSM throughput advantage.

How to run it

Hardware guidance

What breaks first

SSM kernel on older GPUs. Mamba kernels require Turing (SM 7.5) or newer. GTX 10-series and older won't run Jamba. Check CUDA compute capability. 2. Ollama Jamba support. Jamba's SSM-hybrid architecture may not be in Ollama's default catalog. Verify with ollama list or use raw llama.cpp. 3. Per-token speed ceiling. SSM decode is sequential — tok/s is lower than attention at the same active parameter count. Jamba 1.5 Mini trades peak speed for context efficiency. 4. Expert offload latency. When experts are in system RAM, routing to a RAM-resident expert causes 30-80ms stalls. On consumer GPUs with slow RAM (DDR4), this stall is noticeable. Use fast DDR5 RAM to minimize penalty.

Runtime recommendation

Common beginner mistakes

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (jamba-1.5)

Jamba 1.5 Mini52B

You are here

Jamba 1.5 Large398B

Frontier

Distilled / fine-tuned from this

Jamba 1.5 Large398B

Frontier