Qwen 3.5 235B-A17B (MoE)
Alibaba's May 2026 flagship. 397B total / 17B active MoE with hybrid thinking-mode toggle inherited from Qwen 3. Strongest open scientific reasoner per GPQA Diamond. The strongest multilingual open model in 2026 — Chinese, Korean, Japanese, German, French, Spanish all near-frontier.
Positioning
Qwen 3.5 235B-A17B is the late-2025 / early-2026 frontier MoE that's actually achievable on consumer-tier hardware — the DeepSeek V4 Pro for the rest of us. 397B total parameters with 235B activated and 17B effective per-token compute, distilled from Alibaba's Qwen 3.5 internal training. Where V4 Pro asks for $5,000-7,000 in unified memory or rented datacenter GPUs, Qwen 3.5 235B-A17B fits in a 128-GB Mac Studio M3 Ultra at Q4 (140 GB) and runs at usable interactive speeds. That's the operator-grade story: this is the frontier model the homelab tier can actually run.
Strengths
- Best-in-class multilingual capability at the frontier tier. Qwen's Chinese + English + 60+ language coverage is meaningfully stronger than DeepSeek or Llama 4. Critical for non-English-only operators.
- Strong reasoning + coding combo. Q3.5 is a generalist frontier model — not as specialist as DeepSeek R1 on pure reasoning, not as code-tuned as Qwen 2.5 Coder 32B, but solid on both at the frontier-quality tier.
- MoE architecture (~17B per-token compute) means decode is closer to a 17B dense model than a 397B dense one. Tok/s is reasonable for the absolute parameter count.
- 128-GB consumer-hardware accessibility. This is the bar Alibaba targeted. Q4 fits Mac Studio M3 Ultra; Q5 fits with offload. The 235B-A17B variant exists specifically because the team wanted a frontier MoE that wasn't datacenter-only.
- Apache 2.0 license — genuinely permissive, no MAU clauses, commercial use unrestricted.
Limitations
- Memory still substantial. ~140 GB at Q4, ~110 GB at Q3. 192-GB hardware (Mac Studio M3 Ultra flagship config) handles Q4 + 32K context comfortably. 128-GB tier handles Q4 with tight context. 96-GB tier is partial-offload territory.
- No 24-GB single-card path. RTX 5090 at 32 GB doesn't fit any usable quant. Multi-GPU is the NVIDIA path: 4× RTX 3090 (96 GB combined) runs Q3 with system-RAM offload.
- Chinese-language bias in some outputs. A frontier-tier consequence of the training corpus mix. Most users won't notice; sensitive deployments should evaluate.
- Tooling lag for new architectures. vLLM added Qwen 3.5 MoE support shortly after release; llama.cpp tracked within days. Day-zero performance was 60% of peak; current builds are at parity.
Real-world performance on Mac Studio M3 Ultra (192 GB)
- Q4 (~140 GB): ~12-18 tok/s decode, TTFT ~2s on 1K prompt. Genuinely interactive — the headline workload.
- Q3 (~110 GB): ~15-22 tok/s decode, faster TTFT, slight quality dip vs Q4. Good for most daily tasks.
- Q5 (~165 GB partial-offload to swap on 192-GB hardware): 8-~12 tok/s. Quality bump over Q4 is small; rarely worth the speed loss.
- Compare with: rented H100 80GB ×4 datacenter setup runs FP8 Q3.5 235B at ~80-120 tok/s — the production target, not consumer hardware.
Should you run this locally?
Yes, if you have a 128-GB+ Mac Studio (or equivalent unified-memory hardware), you want frontier-tier reasoning + coding + multilingual on local hardware, AND privacy-or-offline matters more than absolute peak quality. This is the right model for the "I want frontier capability locally" use case in 2026.
No, for anyone running a single consumer GPU. Anyone whose use case is "I need frontier-tier output today" with sub-1k-msg/day volume — rent the API (Alibaba Cloud Qwen API or via OpenRouter). Hosted-API economics are dramatically better for low-volume operators.
Probably not, for anyone whose primary workload is coding (Qwen 2.5 Coder 32B at 24 GB beats 3.5 235B-A17B on coding-specific benchmarks at 1/6 the hardware cost) or pure reasoning (the DeepSeek R1 Distill family fits smaller hardware).
How it compares
- vs DeepSeek V4 Pro (1.6T MoE) → V4 Pro has higher absolute quality ceiling but needs 192-GB+ hardware to run at any usable quant. Qwen 3.5 235B-A17B fits 128-GB hardware comfortably. Pick V4 Pro if you have the hardware AND need absolute frontier; pick Qwen 3.5 235B for accessibility + multilingual.
- vs Qwen 3 235B-A22B (prior generation) → Q3 235B is the late-2024 predecessor. Q3.5 has incremental quality gains + better coding + better multilingual. Same hardware footprint. Pick Q3.5 if available; Q3 is the prior-default.
- vs Llama 4 Maverick (Meta frontier MoE) → similar quality tier. Maverick has stronger ecosystem support (Meta's vLLM contributions land first). Qwen 3.5 has better multilingual + Apache 2.0 license. Pick on language requirements + license preference.
- vs DeepSeek R1 (671B reasoning) → R1 specializes in reasoning. Qwen 3.5 is generalist. For reasoning-only workloads R1 wins; for mixed daily-driver tasks Qwen 3.5 is more useful.
- vs Qwen 3 30B-A3B (smaller MoE sibling) → 30B-A3B fits 24 GB single-card hardware. Quality is meaningfully lower (closer to 8B-class than frontier) but accessibility is dramatically higher. Pick the small MoE for "I want Qwen on a 4090"; pick 235B-A17B for "I have a 128-GB workstation and want frontier."
Run this yourself
# Mac Studio M3 Ultra 192GB — Q4 fits comfortably
ollama pull qwen3.5:235b-a17b-q4_K_M
ollama run qwen3.5:235b-a17b-q4_K_M
# Or via llama.cpp directly:
llama-server -m qwen3.5-235b-a17b-Q4_K_M.gguf \
--ctx-size 32768 -ngl 999 --temp 0.7
Quant: Q4_K_M GGUF
Context: 32768 (KV cache f16, ~24 GB additional)
Backend: llama.cpp Metal via Ollama
Hardware: Mac Studio M3 Ultra 192 GB unified memory
Overview
Alibaba's May 2026 flagship. 397B total / 17B active MoE with hybrid thinking-mode toggle inherited from Qwen 3. Strongest open scientific reasoner per GPQA Diamond. The strongest multilingual open model in 2026 — Chinese, Korean, Japanese, German, French, Spanish all near-frontier.
Featured in this stack
The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.
- Stack · L3·Production tier·Role: Frontier MoE (235B/17B-active)4× H100 SXM tensor-parallel workstation — frontier MoE serving reference
Qwen 3.5 frontier MoE at FP8 fits comfortably in 4× H100 80GB. The strongest open-weight multilingual + reasoning model in 2026. Apache 2.0 successor to Qwen 3.
Execution notes
Operator notes
Qwen 3.5 235B-A17B is the strongest open-weight MoE in May 2026 for general reasoning, multilingual coverage, and tool-using agentic workloads. It's the model you reach for when DeepSeek V4 Pro is overkill or when the Qwen license is acceptable for your deployment.
What makes it the operator default at the frontier tier:
- 17B active params — keeps tok/s competitive with dense 30B-class models despite the 235B total weight count.
- Strongest multilingual coverage in open weights — Chinese / Korean / Japanese / German / French / Spanish all near-frontier.
- GPQA Diamond leader among open-weight models as of May 2026.
- Hybrid thinking-mode toggle carries forward from Qwen 3 —
enable_thinking=truefor hard reasoning,falsefor fast chat.
Deployment notes
This is a frontier-tier model. AWQ-INT4 fits on:
- Dual H100 80GB (160 GB total, ~4 GB headroom for KV cache at 32K context).
- 4× RTX 6000 Ada (192 GB, more comfortable headroom).
- Apple Mac Studio M3 Ultra 192GB — viable but throttled; expect ~12 tok/s decode.
For most operators the practical access path is hosted inference via Together / Fireworks / Alibaba's own API. Local self-hosting only makes sense at organizational scale where the AWQ quant + multi-GPU + vLLM tensor-parallel setup is sustainable.
For sub-frontier hardware, drop to:
- Qwen 3.5 32B — workstation tier, hybrid thinking-mode preserved.
- Qwen 3.5 14B — consumer tier, 16 GB VRAM friendly.
Runtime compatibility
- vLLM ✓ excellent. Tensor-parallel + AWQ-INT4 is the production path. Set
--tensor-parallel-size 2for dual-H100. - SGLang ✓ excellent. Particularly strong here because RadixAttention compounds over the agent-loop prefix-cache pattern at frontier scale.
- Ollama ✗ impractical at this size — single-machine GGUF wasn't designed for 235B MoE.
- MLX-LM ✓ partial. Mac Studio M3 Ultra path is feasible at 4-bit but expect ~12 tok/s ceiling.
- TensorRT-LLM ✓ enterprise-tier path; recompile cost is significant but throughput at scale beats vLLM.
Quantization suitability
AWQ-INT4 is the production quant. The MoE routing is sensitive to lower bit-widths — Q3-class quants degrade more than equivalents on dense models because the routing decisions get noisier. Avoid them.
For research / cluster eval, FP16 (~450 GB) is the reference; expect ~3% quality lift over AWQ-INT4 on coding benchmarks but not enough to justify the 3× hardware cost.
Best use cases
- Reasoning agents at the frontier tier — pair with vLLM + filesystem/git MCP. The thinking-mode toggle gives you per-call control over reasoning depth.
- Multilingual production serving — strongest open multilingual coverage in 2026.
- GPQA / scientific reasoning workloads — leader among open-weight as of release.
- Long-context document analysis — 262K context inherited from Qwen 3.
When to use a different model
- Coding-first workloads: Qwen 2.5 Coder 32B at workstation tier is sharper for SWE-Bench-shape tasks. The frontier MoE doesn't beat it on code by much.
- Maximum quality, license unconstrained: DeepSeek V4 Pro — currently leads on most benchmarks.
- Smaller-hardware reasoning: DeepSeek R1 Distill Qwen 32B at workstation tier preserves the reasoning quality without the cluster requirement.
- Apache 2.0 hard requirement: drop to Qwen 3 235B-A22B — the predecessor — for the Apache lineage.
Failure modes specific to this model
- License confusion. Qwen 3.5 uses the Qwen License (commercial OK below 100M MAU), not Apache 2.0. Verify your deployment scale fits within the license terms before committing.
- Geopolitical refusal posture. Like Qwen 3, refuses certain prompts about Chinese politics. Disable system-prompt overrides at your own legal risk.
- Thinking-mode token bloat.
enable_thinking=trueproduces 5-15× more tokens per query. Budget your concurrency model accordingly when serving multiple clients.
Going deeper
- /stacks/local-coding-agent — agent-loop deployment recipe (Qwen 2.5 Coder is the workstation pick; this is the frontier upgrade)
- /maps/inference-runtimes-2026 — runtime ecosystem map
- vLLM operational review — the production-recommended runtime
- SGLang operational review — multi-tenant agent-cluster alternative
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- GPQA Diamond leader among open models
- Hybrid thinking-mode toggle (think / no_think per turn)
- Strongest multilingual coverage in open-weight 2026
- 17B active params keep tok/s competitive with dense 30B
Weaknesses
- Qwen license caps commercial use at 100M MAU
- 397B total ⇒ workstation territory at Q4 (226 GB)
- Geopolitical refusal posture remains a concern for some deployments
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 226.0 GB | 256 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Qwen 3.5 235B-A17B (MoE).
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Qwen 3.5 235B-A17B (MoE)?
Can I use Qwen 3.5 235B-A17B (MoE) commercially?
What's the context length of Qwen 3.5 235B-A17B (MoE)?
Source: huggingface.co/Qwen/Qwen3.5-235B-A17B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Qwen 3.5 235B-A17B (MoE) runs on your specific hardware before committing money.