RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Families/Text & Reasoning/Qwen
Text & Reasoning
Open-weight
Apache 2.0 (most variants)

Qwen

by Alibaba (Qwen Team)

Alibaba's flagship open-weight family with permissive licensing across most variants. Qwen 3 235B-A22B is the leading open-weight MoE for production reasoning; Qwen 3 32B dense is the strongest 32B-class chat model.

Best entry point for local use

Start with Qwen 3 32B at Q4_K_M via Ollama — it fits on a single RTX 4090 24 GB and delivers MMLU 88.5%, GSM8K 94.2%, outperforming Llama 3.1 70B at less than half the VRAM. The 32B is Qwen's efficiency sweet spot: Apache 2.0 license with no MAU cap, strong bilingual English-Chinese performance, and best-in-class math scores (MATH 500 ~82%). If you have limited VRAM (<12 GB), use Qwen 3 8B Q4 — runs on MacBook Pro M4 Max at 22+ tok/s. Skip Qwen 3 235B MoE for first-time deployment — the expert offloading complexity is unnecessary for most workloads; the 32B dense handles 95% of use cases. Skip Qwen 2.5-Coder unless you specifically need code-first behavior — the base Qwen 3 models have improved code generation that obsoletes the dedicated coder variants for general use.

Deployment guidance

For single-user local: Ollama + qwen3:32b Q4_K_M on RTX 4090 24 GB or Apple M3 Ultra via MLX-LM. For multi-user serving: vLLM 0.6.5+ with AWQ 4-bit on 2× L40S — Qwen's GQA architecture enables efficient prefix caching at high concurrency. For MoE frontier: SGLang v0.2.5+ with the DeepSeek/Qwen MoE backend on 4× H100 SXM for Qwen 3 235B-A22B FP8 — ~8,000 tok/s at batch 64. For mobile: llama.cpp Qwen 3 8B Q4_0 on Snapdragon X Elite — ~15 tok/s. Always keep MoE router weights at FP16. Verify chat template is <|im_start|> format — Llama-format templates silently degrade Qwen instruction-following. See GPU buyer guide.

Featured models

Models in this family with our verdicts

Qwen 3 235B-A22BQwen 3 30B-A3BQwen 3 32B

Recommended runtimes

llama.cppOllamaSGLangvLLM

Related families

DeepSeekLlama

Related — keep moving

Compare hardware
  • RTX 3090 vs RTX 4090 →
  • RTX 4090 vs RTX 5090 →
Buyer guides
  • Best GPU for Qwen models →
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Runtimes that fit
  • llama.cpp →
  • Ollama →
  • SGLang →
  • vLLM →
Alternatives
DeepSeekLlama
Before you buy

Verify Qwen runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →