llm-jp 4 32B A3B Thinking
A 32B MoE model from Japan's National Institute of Informatics, with only 3B parameters active per forward pass. Trained on 11.7T tokens across four stages: pre-training, mid-training, SFT, and DPO. Targets Japanese and English conversational tasks with a 65K context window.
If you need a Japanese-capable MoE that runs leaner than a dense 32B, this is a legitimate option from a credible academic source. The 11.7T token training and multi-stage alignment give it a serious foundation. That said, the vendor explicitly flags incomplete safety tuning, so keep it out of customer-facing or sensitive workflows for now. Hedge — worth evaluating for internal Japanese NLP tasks, but not a drop-in production pick yet.
›Why this rating
Auto-generated rating (Opus 4.7 judge, claude-opus-4-7). Overall 9.05/10. License is explicitly apache-2.0 on the HF card, commercial-OK flag is correct. Metadata is accurate: 32B total / ~3B active MoE, 65,536 context, llm-jp vendor, Japanese/English all verified from the card. Description is honest and operator-voiced, correctly flags the Harmony-template-but-different-tokenizer trap and incomplete safety alignment. Use case is reasonably specific (Japanese-English long-context) though could be sharper. Family 'other' is acceptable since the architecture is qwen3_moe but the model is llm-jp's own; family=qwen could also be defended. Clears the 9.0 bar narrowly.
Flags: - family='other' is defensible but qwen3_moe tag suggests 'qwen' family could also apply — minor consistency question - 11.7T token training claim is not visible in the included README excerpt; verify it appears elsewhere on the card before publishing
Overview
A 32B MoE model from Japan's National Institute of Informatics, with only 3B parameters active per forward pass. Trained on 11.7T tokens across four stages: pre-training, mid-training, SFT, and DPO. Targets Japanese and English conversational tasks with a 65K context window.
Strengths
- MoE efficiency: 32B total params, 3B active — lower inference cost than a dense 32B
- Extensive training: 11.7T tokens across pre-training, mid-training, SFT, and DPO
- 65,536-token context window supports long documents and multi-turn sessions
- Benchmarked on Japanese-specific evals (MT-Bench, AnswerCarefully)
Weaknesses
- Safety alignment is described as incomplete — not production-ready for sensitive use cases
- Custom tokenizer and chat template; expect friction with OpenAI-compatible tooling
- Low download count (2,262) means limited community testing and real-world feedback
- Full 32B weight footprint still loads into VRAM even though only 3B params are active per pass
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 17.6 GB | 23 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of llm-jp 4 32B A3B Thinking.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run llm-jp 4 32B A3B Thinking?
Can I use llm-jp 4 32B A3B Thinking commercially?
What's the context length of llm-jp 4 32B A3B Thinking?
Source: huggingface.co/llm-jp/llm-jp-4-32b-a3b-thinking
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify llm-jp 4 32B A3B Thinking runs on your specific hardware before committing money.