other

32B parameters

Commercial OK

Reviewed May 2026

llm-jp 4 32B A3B Thinking

A 32B MoE model from Japan's National Institute of Informatics, with only 3B parameters active per forward pass. Trained on 11.7T tokens across four stages: pre-training, mid-training, SFT, and DPO. Targets Japanese and English conversational tasks with a 65K context window.

License: apache-2.0·Context: 65,536 tokens

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED MAY 28, 2026

9.1/10

If you need a Japanese-capable MoE that runs leaner than a dense 32B, this is a legitimate option from a credible academic source. The 11.7T token training and multi-stage alignment give it a serious foundation. That said, the vendor explicitly flags incomplete safety tuning, so keep it out of customer-facing or sensitive workflows for now. Hedge — worth evaluating for internal Japanese NLP tasks, but not a drop-in production pick yet.

›Why this rating

Auto-generated rating (Opus 4.7 judge, claude-opus-4-7). Overall 9.05/10. License is explicitly apache-2.0 on the HF card, commercial-OK flag is correct. Metadata is accurate: 32B total / ~3B active MoE, 65,536 context, llm-jp vendor, Japanese/English all verified from the card. Description is honest and operator-voiced, correctly flags the Harmony-template-but-different-tokenizer trap and incomplete safety alignment. Use case is reasonably specific (Japanese-English long-context) though could be sharper. Family 'other' is acceptable since the architecture is qwen3_moe but the model is llm-jp's own; family=qwen could also be defended. Clears the 9.0 bar narrowly.

Flags: - family='other' is defensible but qwen3_moe tag suggests 'qwen' family could also apply — minor consistency question - 11.7T token training claim is not visible in the included README excerpt; verify it appears elsewhere on the card before publishing

Overview

Strengths

MoE efficiency: 32B total params, 3B active — lower inference cost than a dense 32B
Extensive training: 11.7T tokens across pre-training, mid-training, SFT, and DPO
65,536-token context window supports long documents and multi-turn sessions
Benchmarked on Japanese-specific evals (MT-Bench, AnswerCarefully)

Weaknesses

Safety alignment is described as incomplete — not production-ready for sensitive use cases
Custom tokenizer and chat template; expect friction with OpenAI-compatible tooling
Low download count (2,262) means limited community testing and real-world feedback
Full 32B weight footprint still loads into VRAM even though only 3B params are active per pass

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	17.6 GB	23 GB

Get the model

HuggingFace

Original weights

huggingface.co/llm-jp/llm-jp-4-32b-a3b-thinking

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of llm-jp 4 32B A3B Thinking.

NVIDIA B300 (Blackwell Ultra)

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier

Models in the same parameter band as this one

Step up

More capable — bigger memory footprint

Step down

Smaller — faster, runs on weaker hardware

Frequently asked

What's the minimum VRAM to run llm-jp 4 32B A3B Thinking?

23GB of VRAM is enough to run llm-jp 4 32B A3B Thinking at the Q4_K_M quantization (file size 17.6 GB). Higher-quality quantizations need more.

Can I use llm-jp 4 32B A3B Thinking commercially?

Yes — llm-jp 4 32B A3B Thinking ships under the apache-2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of llm-jp 4 32B A3B Thinking?

llm-jp 4 32B A3B Thinking supports a context window of 65,536 tokens (about 66K).

Source: huggingface.co/llm-jp/llm-jp-4-32b-a3b-thinking

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Before you buy

Verify llm-jp 4 32B A3B Thinking runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →