wizard
141B parameters
Commercial OK
Reviewed June 2026

WizardLM-2 8x22B

Microsoft's RLHF-heavy fine-tune of Mixtral 8x22B. Briefly the top open chat model on LMSYS at release.

License: Apache 2.0·Released Apr 15, 2024·Context: 65,536 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
unrated

Positioning

WizardLM-2 8x22B is an RLHF-heavy fine-tune of Mixtral 8x22B, released by Microsoft's WizardLM team under the permissive Apache 2.0 license. With 141B total parameters (dense count) and a 65,536-token context window, it briefly held the top spot among open chat models on the LMSYS leaderboard at launch. Its Mixture-of-Experts architecture activates only a subset of parameters per token, making inference cost closer to a dense ~30B model than a dense 141B model.

Strengths

  • Apache 2.0 license: Permissive for commercial use, fine-tuning, and redistribution without restrictions.
  • Large context window: 65,536 tokens enables processing of long documents, codebases, or multi-turn conversations.
  • RLHF-tuned for reasoning: Designed to follow instructions and produce coherent, step-by-step reasoning, as evidenced by its brief top ranking on LMSYS.
  • Efficient MoE architecture: With 141B total parameters but only ~30B active per token, inference requires less memory and compute than a dense model of equivalent total size.

Limitations

  • Massive memory footprint: Even at Q4_K_M (~79.3 GB), adding KV cache and framework overhead pushes total VRAM requirements well beyond consumer and most workstation GPUs.
  • No community-verified benchmarks: We lack independent measurements for this model. Published vendor metrics should be treated as best-case.
  • Dependency on base model quality: As a fine-tune of Mixtral 8x22B, its performance is bounded by the base model's capabilities.
  • Limited ecosystem support: Being a fine-tune rather than a base model, some tools and frameworks may not have optimized support out of the box.

What it takes to run this locally

Quantized sizes (disk): FP16 ~282 GB, Q8_0 ~150 GB, Q6_K ~116.3 GB, Q5_K_M ~100.5 GB, Q4_K_M ~79.3 GB, Q3_K_M ~68.7 GB, Q2_K ~45.8 GB. Add ~30–50% for KV cache and framework overhead at typical context lengths. This model is firmly in the datacenter deployment class — requiring multiple high-memory GPUs (e.g., 4× A100 80GB or 8× RTX 6000 Ada) even at aggressive quantization. Consumer and workstation setups are not viable.

Should you run this locally?

Yes if you have access to multi-GPU datacenter hardware and need a permissively licensed, reasoning-tuned chat model with a large context window. No if you lack the infrastructure to run 80+ GB models, or if a smaller fine-tune (e.g., on Mixtral 8x7B) meets your needs.

Catalog cross-links

  • Mixtral 8x22B
  • WizardLM-2 7B
  • Apache 2.0 license guide

Overview

Microsoft's RLHF-heavy fine-tune of Mixtral 8x22B. Briefly the top open chat model on LMSYS at release.

How to run it

WizardLM-2 8x22B is a 141B MoE model (22B active per token × 8 experts). Run at Q4_K_M via Ollama (ollama pull wizardlm2:8x22b) or llama.cpp with -ngl 999 -fa -c 16384. Q4_K_M file size ~75 GB on disk. Minimum VRAM: 48 GB — RTX A6000 (48GB) works at 8K context. RTX 4090 24GB: Q3_K_M (55 GB) with expert offload to RAM, or dual RTX 4090 row-split (48 GB total) at Q4_K_M. Recommended: single RTX A6000 48GB at Q4_K_M (8-16K context). Throughput: 15-25 tok/s on RTX A6000 at Q4_K_M. Mixtral-style MoE architecture — well-supported in llama.cpp. Expert routing: each token uses 2 of 8 experts (44B active). MoE efficiency means per-token compute is similar to a 44B dense model. For serving: vLLM on single A100 80GB at AWQ-INT4. WizardLM-2 is instruction-tuned — not a base model. Use for chat, instruction-following, and agent workflows.

Hardware guidance

Minimum: RTX 3090 24GB at Q3_K_M with expert offload (slow). Recommended: RTX A6000 48GB at Q4_K_M (8K context). Optimal: A100 80GB at AWQ-INT4 for serving. VRAM math: 141B total MoE, ~22B active × 2 experts selected = ~44B active. Q4_K_M for full 141B: ~70-80 GB. Expert offload: with --no-kv-offload, all 8 experts in VRAM = 75 GB; with expert offload to RAM, VRAM ~25-30 GB (active experts only). KV cache at 8K: ~10-15 GB. RTX 4090 24GB + expert offload: tight but functional for 4K context. Mac Studio M4 Max 64GB: Q4_K_M at 4-8 tok/s. Dual RTX 3090: row-split at Q4_K_M. Cloud: single A100 80GB at ~$5-10/hr for AWQ serving.

What breaks first

  1. Expert offload stall. With expert offload to system RAM, routing to a RAM-resident expert adds 30-100ms latency per token switch. Visible as generation stutter. Keep as many experts in VRAM as possible. 2. Ollama Q4_K_M size inflation. Some Ollama tags for WizardLM-2 package additional metadata that inflates the download. Check actual model size vs advertised. 3. Instruction-following degradation at Q3. Below Q4_K_M, instruction adherence weakens noticeably on this model — more than on similarly-sized dense models. The MoE expert gates become noisier at low precision. 4. WizardLM's specific chat template. Using the wrong chat template (e.g., Llama 3 instead of Vicuna-style) produces garbled or repetitive output. Verify in the hf repo's tokenizer_config.json.

Runtime recommendation

llama.cpp with -ngl 999 for single-GPU local use. Ollama for quick-start (same llama.cpp backend). vLLM for multi-user serving on A100. Mixtral-style MoE is well-supported across all three. Avoid MLX-LM unless on Apple Silicon — CUDA llama.cpp is faster.

Common beginner mistakes

Mistake: Pulling Ollama's default tag without checking quant. Fix: Ollama's default may be Q4_0 or Q8_0 — verify with ollama show wizardlm2:8x22b. Q8 requires 80+ GB. Mistake: Assuming "8x22B = 176B parameters". Fix: The naming is misleading — it's ~141B total (8 experts × ~17.6B each, not 8 × 22B). Check hf repo for actual parameter count. Mistake: Using Llama 3 chat template. Fix: WizardLM-2 uses a different template. Check the model card on Hugging Face for the correct format. Mistake: Expecting 100+ tok/s because it's MoE. Fix: MoE saves compute per token vs dense of same quality, but 44B active is still substantial. Expect 15-25 tok/s on A6000, not 80+.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Strengths

  • Strong chat quality
  • Apache 2.0

Weaknesses

  • Workstation-only
  • Older

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M84.0 GB96 GB

Get the model

HuggingFace

Original weights

huggingface.co/microsoft/WizardLM-2-8x22B

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of WizardLM-2 8x22B.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Step up
More capable — bigger memory footprint
No verdicted models in the next tier up yet.

Frequently asked

What's the minimum VRAM to run WizardLM-2 8x22B?

96GB of VRAM is enough to run WizardLM-2 8x22B at the Q4_K_M quantization (file size 84.0 GB). Higher-quality quantizations need more.

Can I use WizardLM-2 8x22B commercially?

Yes — WizardLM-2 8x22B ships under the Apache 2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of WizardLM-2 8x22B?

WizardLM-2 8x22B supports a context window of 65,536 tokens (about 66K).

Source: huggingface.co/microsoft/WizardLM-2-8x22B

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify WizardLM-2 8x22B runs on your specific hardware before committing money.