Jamba 1.5 Mini
AI21's hybrid Mamba-Transformer MoE. 256k context with the SSM throughput advantage.
Overview
AI21's hybrid Mamba-Transformer MoE. 256k context with the SSM throughput advantage.
How to run it
Jamba 1.5 Mini is AI21's smaller SSM-hybrid model (52B total, ~12B active via MoE). The SSM backbone enables efficient long-context handling. Run at Q4_K_M via Ollama (ollama pull jamba:1.5-mini) or llama.cpp with -ngl 999 -fa -c 16384. Q4_K_M file size ~25-30 GB on disk. Minimum VRAM: 16 GB — RTX 4080 (16GB) at Q4_K_M with KV offload for 8K context. RTX 4090 24GB: Q4_K_M with 16K context comfortable. Recommended: single RTX 4090 24GB at Q4_K_M. Throughput: ~30-50 tok/s on RTX 4090 at Q4_K_M. SSM architecture keeps KV cache growth low — 32K+ context is practical on 24 GB. The active subset (12B) makes generation efficient. SSM layers decode sequentially — slightly lower peak tok/s than pure attention at the same active size, but the context efficiency is the tradeoff. Jamba 1.5 Mini is the most accessible SSM-hybrid model — consumer GPU friendly. For larger SSM models, see Jamba 1.5 Large.
Hardware guidance
Minimum: RTX 3060 12GB at Q3_K_M with KV offload. Recommended: RTX 4090 24GB at Q4_K_M. VRAM math: ~52B total, ~12B active. Q4_K_M ≈ 25-30 GB for full weights. Expert offload reduces VRAM to ~8-12 GB (active experts only). SSM KV cache: ~2-5 GB at 32K context (significantly less than attention models). Total at Q4 with offload: ~15-20 GB for 32K context — comfortable on 24 GB cards. RTX 3090 24GB: Q4_K_M with expert offload at 32K context. RTX 4080 16GB: Q4_K_M with expert offload at 8-16K. MacBook Pro M4 Pro 24GB+: Q4_K_M at 8-12 tok/s. Cloud: A10 24GB at Q4_K_M. SSM kernel requires CUDA 11.8+ / SM 7.5+ (Turing+). Pascal GPUs not supported.
What breaks first
- SSM kernel on older GPUs. Mamba kernels require Turing (SM 7.5) or newer. GTX 10-series and older won't run Jamba. Check CUDA compute capability. 2. Ollama Jamba support. Jamba's SSM-hybrid architecture may not be in Ollama's default catalog. Verify with
ollama listor use raw llama.cpp. 3. Per-token speed ceiling. SSM decode is sequential — tok/s is lower than attention at the same active parameter count. Jamba 1.5 Mini trades peak speed for context efficiency. 4. Expert offload latency. When experts are in system RAM, routing to a RAM-resident expert causes 30-80ms stalls. On consumer GPUs with slow RAM (DDR4), this stall is noticeable. Use fast DDR5 RAM to minimize penalty.
Runtime recommendation
Common beginner mistakes
Mistake: Running Jamba on GTX 1080-class GPUs. Fix: SSM kernels require Turing+ (SM 7.5). Pascal GPUs will crash or produce undefined behavior. Mistake: Expecting 100+ tok/s because active params are only 12B. Fix: SSM decode is sequential. 30-50 tok/s at Q4 on RTX 4090 is realistic — not 100+. Mistake: Setting 256K context and expecting it to work on 24 GB. Fix: While SSM is efficient, 256K is extreme. Start at 32K, benchmark VRAM, scale up. Mistake: Using Q8 because "the file size is small." Fix: Q8 is ~50 GB — 2× Q4_K_M. Stick to Q4_K_M for consumer hardware. Q8 gains are marginal on SSM models.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- 256k context
- Hybrid SSM-Transformer
- Long-context throughput
Weaknesses
- Limited runtime support outside vLLM
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 30.0 GB | 36 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Jamba 1.5 Mini.
Frequently asked
What's the minimum VRAM to run Jamba 1.5 Mini?
Can I use Jamba 1.5 Mini commercially?
What's the context length of Jamba 1.5 Mini?
Source: huggingface.co/ai21labs/AI21-Jamba-1.5-Mini
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Jamba 1.5 Mini runs on your specific hardware before committing money.