Hunyuan Large 389B MoE

Tencent's frontier MoE. 389B total / 52B active. License permits commercial use with restrictions on companies above MAU thresholds.

License: Tencent Hunyuan License·Released Nov 5, 2024·Context: 256,000 tokens

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

unrated

Positioning

Hunyuan Large 389B MoE is Tencent's frontier-class mixture-of-experts model, released under the Tencent Hunyuan License. With 389B total parameters and approximately 52B activated per token, it offers the capacity of a massive dense model while keeping per-token inference cost closer to a dense 52B-parameter model. Its 256K context window positions it for long-document and multi-turn applications. The license permits commercial use but includes restrictions for companies above certain MAU thresholds, making it a viable option for organizations that can accept those terms.

Strengths

Massive total capacity with efficient inference: As an MoE with 389B total parameters but only ~52B active per token, the model delivers high representational power while keeping per-token compute similar to a dense 52B model.
Very long context window: The 256K token context length enables processing of entire books, long codebases, or extended conversation histories without truncation.
Commercial license with clear terms: The Tencent Hunyuan License explicitly allows commercial use, with restrictions only for very large-scale deployments (above MAU thresholds), making it suitable for many businesses.
Frontier-class architecture from a major vendor: Tencent's investment in this model signals a commitment to competitive open-weight AI, with the architectural innovations typical of frontier labs.

Limitations

Extremely high hardware requirements: At FP16, the model requires ~778 GB of disk space, and even at Q4_K_M it needs ~219 GB plus substantial overhead for KV cache and framework. This places it firmly in datacenter territory.
License restrictions for large-scale use: Companies with user bases above the license's MAU threshold may face additional terms or fees, limiting deployment for high-traffic services.
No independently verified benchmarks available: As a relatively new model, community-run evaluations are sparse. Operators should treat vendor-published metrics as best-case and expect variability in real-world conditions.
MoE routing overhead: While per-token compute is low, the MoE architecture introduces routing and memory bandwidth challenges that can impact latency, especially at low batch sizes.

What it takes to run this locally

Hunyuan Large 389B MoE is a datacenter-class model. Quantized sizes range from 778 GB (FP16) down to ~126 GB (Q2_K), but even the smallest quant requires a multi-GPU server with high memory bandwidth. For example, Q4_K_M (219 GB) plus ~30-50% overhead for KV cache and framework at typical context lengths means you need at least 285-330 GB of total GPU memory. This typically requires multiple A100 80GB or H100 80GB GPUs. Consumer and workstation hardware (single 24-48 GB GPUs) cannot run this model.

Should you run this locally?

Yes if you have access to a multi-GPU datacenter cluster (e.g., 4+ A100 80GB or H100 80GB) and need a frontier-capable model with a permissive commercial license for your use case. The MoE architecture gives you high capacity at lower per-token compute than a dense model of similar total size.

No if you lack the hardware budget for multi-GPU setups, need to deploy on consumer or workstation GPUs, or cannot accept the license's MAU-based restrictions. For smaller-scale deployments, consider smaller dense or MoE models in the 7B-70B range.

Catalog cross-links

Hunyuan Large 389B MoE
DeepSeek V2 (another large MoE with different license terms)
Mixtral 8x22B (a smaller MoE that fits on workstation hardware)

Overview

Tencent's frontier MoE. 389B total / 52B active. License permits commercial use with restrictions on companies above MAU thresholds.

How to run it

Hunyuan-Large is Tencent's 389B MoE model (~52B active). Run at Q4_K_M via llama.cpp with -ngl 999 -fa -c 8192. Q4_K_M file size ~180 GB on disk (total params), but only ~52B active per token. Minimum VRAM: 96 GB — dual RTX A6000 (48GB each) with tensor-split, or single A100 80GB if Q4_K_M fits within 80 GB (check actual file size). For GPU-poor setups: CPU-only inference at Q4_K_M is viable on a server with 256+ GB RAM — ~3-6 tok/s on high-core-count Xeon/Epyc. MoE experts are loaded on-demand; llama.cpp offloads experts to system RAM when VRAM is tight, at the cost of speed when routing hits RAM-resident experts. For serving: vLLM on 2-4× A100 with tensor-parallel=2 at AWQ-INT4. Context: 32K max advertised; practical usable range at Q4_K_M on 96 GB is ~8-16K.

Hardware guidance

Minimum: A100 80GB at Q4_K_M (180 GB on disk, ~80-90 GB active resident at Q4 = tight but feasible). Recommended: dual RTX A6000 96 GB total at Q4_K_M with row-split (8-16K context). Budget path: CPU-only on 256 GB RAM server at Q4_K_M (3-6 tok/s). VRAM math: 389B total, ~52B active. Q4_K_M for active subset ≈ 30 GB. Expert weights (inactive) stored in VRAM or RAM depending on offload strategy. llama.cpp with KV offload and expert offload to RAM reduces VRAM requirement but adds latency on expert switches. RTX 4090 24GB: Q3_K_M with aggressive expert offload to RAM. Mac Studio M4 Ultra 128GB can run Q4_K_M at ~4-8 tok/s. Cloud: 2× A100 at ~$16-30/hr.

What breaks first

Expert routing stall. When experts are offloaded to system RAM, a routing decision that hits a RAM-resident expert adds 50-200ms latency. At low batch sizes this creates visible stutter in generation. Keep as many experts in VRAM as possible. 2. Chinese-language bias. Hunyuan-Large is Tencent's model — training data is Chinese-heavy. English quality is competitive but may show Chinese-culture bias in nuanced prompts. 3. AWQ on MoE. AWQ-INT4 quantization on MoE architectures can produce worse degradation on expert-routing stability versus dense models. Test routing correctness at Q4 before deploying. 4. Tensor-split imbalance. llama.cpp row-split across mismatched GPUs (e.g., A6000 + RTX 3090) causes the faster GPU to idle waiting for the slower GPU. Use identical GPU pairs.

Runtime recommendation

llama.cpp with -ngl 999 and expert offload tuning for single-node. vLLM for multi-user serving with tensor-parallel=2 on A100. SGLang if vLLM MoE routing is unstable. Avoid Ollama — MoE expert offload isn't exposed in Ollama's config surface and default settings may cause OOM.

Common beginner mistakes

Mistake: Assuming "52B active" means it fits in 32 GB VRAM. Fix: All 389B experts must be accessible (disk/RAM/VRAM). The 52B is what's computed per token, not what's stored. Minimum ~180 GB storage for Q4. Mistake: Expecting consistent generation speed. Fix: Expert routing means some tokens route to VRAM-resident experts (fast) and some to RAM-resident experts (50-200ms stall). Speed varies per token. Mistake: Using Q8 for the full MoE. Fix: Q8 for 389B is ~350 GB — requires 4-8× A100. Start at Q4_K_M. Mistake: Ignoring license restrictions. Fix: Tencent's license for Hunyuan-Large may restrict commercial use. Verify on huggingface.co/tencent/Hunyuan-Large before production deployment.