WizardLM-2 8x22B
Microsoft's RLHF-heavy fine-tune of Mixtral 8x22B. Briefly the top open chat model on LMSYS at release.
Positioning
WizardLM-2 8x22B is an RLHF-heavy fine-tune of Mixtral 8x22B, released by Microsoft's WizardLM team under the permissive Apache 2.0 license. With 141B total parameters (dense count) and a 65,536-token context window, it briefly held the top spot among open chat models on the LMSYS leaderboard at launch. Its Mixture-of-Experts architecture activates only a subset of parameters per token, making inference cost closer to a dense ~30B model than a dense 141B model.
Strengths
- Apache 2.0 license: Permissive for commercial use, fine-tuning, and redistribution without restrictions.
- Large context window: 65,536 tokens enables processing of long documents, codebases, or multi-turn conversations.
- RLHF-tuned for reasoning: Designed to follow instructions and produce coherent, step-by-step reasoning, as evidenced by its brief top ranking on LMSYS.
- Efficient MoE architecture: With 141B total parameters but only ~30B active per token, inference requires less memory and compute than a dense model of equivalent total size.
Limitations
- Massive memory footprint: Even at Q4_K_M (~79.3 GB), adding KV cache and framework overhead pushes total VRAM requirements well beyond consumer and most workstation GPUs.
- No community-verified benchmarks: We lack independent measurements for this model. Published vendor metrics should be treated as best-case.
- Dependency on base model quality: As a fine-tune of Mixtral 8x22B, its performance is bounded by the base model's capabilities.
- Limited ecosystem support: Being a fine-tune rather than a base model, some tools and frameworks may not have optimized support out of the box.
What it takes to run this locally
Quantized sizes (disk): FP16 ~282 GB, Q8_0 ~150 GB, Q6_K ~116.3 GB, Q5_K_M ~100.5 GB, Q4_K_M ~79.3 GB, Q3_K_M ~68.7 GB, Q2_K ~45.8 GB. Add ~30–50% for KV cache and framework overhead at typical context lengths. This model is firmly in the datacenter deployment class — requiring multiple high-memory GPUs (e.g., 4× A100 80GB or 8× RTX 6000 Ada) even at aggressive quantization. Consumer and workstation setups are not viable.
Should you run this locally?
Yes if you have access to multi-GPU datacenter hardware and need a permissively licensed, reasoning-tuned chat model with a large context window. No if you lack the infrastructure to run 80+ GB models, or if a smaller fine-tune (e.g., on Mixtral 8x7B) meets your needs.
Catalog cross-links
- Mixtral 8x22B
- WizardLM-2 7B
- Apache 2.0 license guide
Overview
Microsoft's RLHF-heavy fine-tune of Mixtral 8x22B. Briefly the top open chat model on LMSYS at release.
How to run it
WizardLM-2 8x22B is a 141B MoE model (22B active per token × 8 experts). Run at Q4_K_M via Ollama (ollama pull wizardlm2:8x22b) or llama.cpp with -ngl 999 -fa -c 16384. Q4_K_M file size ~75 GB on disk. Minimum VRAM: 48 GB — RTX A6000 (48GB) works at 8K context. RTX 4090 24GB: Q3_K_M (55 GB) with expert offload to RAM, or dual RTX 4090 row-split (48 GB total) at Q4_K_M. Recommended: single RTX A6000 48GB at Q4_K_M (8-16K context). Throughput: 15-25 tok/s on RTX A6000 at Q4_K_M. Mixtral-style MoE architecture — well-supported in llama.cpp. Expert routing: each token uses 2 of 8 experts (44B active). MoE efficiency means per-token compute is similar to a 44B dense model. For serving: vLLM on single A100 80GB at AWQ-INT4. WizardLM-2 is instruction-tuned — not a base model. Use for chat, instruction-following, and agent workflows.
Hardware guidance
Minimum: RTX 3090 24GB at Q3_K_M with expert offload (slow). Recommended: RTX A6000 48GB at Q4_K_M (8K context). Optimal: A100 80GB at AWQ-INT4 for serving. VRAM math: 141B total MoE, ~22B active × 2 experts selected = ~44B active. Q4_K_M for full 141B: ~70-80 GB. Expert offload: with --no-kv-offload, all 8 experts in VRAM = 75 GB; with expert offload to RAM, VRAM ~25-30 GB (active experts only). KV cache at 8K: ~10-15 GB. RTX 4090 24GB + expert offload: tight but functional for 4K context. Mac Studio M4 Max 64GB: Q4_K_M at 4-8 tok/s. Dual RTX 3090: row-split at Q4_K_M. Cloud: single A100 80GB at ~$5-10/hr for AWQ serving.
What breaks first
- Expert offload stall. With expert offload to system RAM, routing to a RAM-resident expert adds 30-100ms latency per token switch. Visible as generation stutter. Keep as many experts in VRAM as possible. 2. Ollama Q4_K_M size inflation. Some Ollama tags for WizardLM-2 package additional metadata that inflates the download. Check actual model size vs advertised. 3. Instruction-following degradation at Q3. Below Q4_K_M, instruction adherence weakens noticeably on this model — more than on similarly-sized dense models. The MoE expert gates become noisier at low precision. 4. WizardLM's specific chat template. Using the wrong chat template (e.g., Llama 3 instead of Vicuna-style) produces garbled or repetitive output. Verify in the hf repo's tokenizer_config.json.
Runtime recommendation
Common beginner mistakes
Mistake: Pulling Ollama's default tag without checking quant. Fix: Ollama's default may be Q4_0 or Q8_0 — verify with ollama show wizardlm2:8x22b. Q8 requires 80+ GB. Mistake: Assuming "8x22B = 176B parameters". Fix: The naming is misleading — it's ~141B total (8 experts × ~17.6B each, not 8 × 22B). Check hf repo for actual parameter count. Mistake: Using Llama 3 chat template. Fix: WizardLM-2 uses a different template. Check the model card on Hugging Face for the correct format. Mistake: Expecting 100+ tok/s because it's MoE. Fix: MoE saves compute per token vs dense of same quality, but 44B active is still substantial. Expect 15-25 tok/s on A6000, not 80+.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Strong chat quality
- Apache 2.0
Weaknesses
- Workstation-only
- Older
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 84.0 GB | 96 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of WizardLM-2 8x22B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run WizardLM-2 8x22B?
Can I use WizardLM-2 8x22B commercially?
What's the context length of WizardLM-2 8x22B?
Source: huggingface.co/microsoft/WizardLM-2-8x22B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify WizardLM-2 8x22B runs on your specific hardware before committing money.