other
200B parameters
Commercial OK
Reviewed June 2026

GLM-5

Zhipu's GLM-5 currently leads the Open LLM Leaderboard 2026. Strong reasoning and bilingual EN/ZH capability.

License: GLM License·Released Feb 5, 2026·Context: 200,000 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
unrated

Positioning

GLM-5 is a 200-billion-parameter dense model released by Zhipu AI (Z.AI) under the GLM License. With a context window of 200,000 tokens, it is designed for frontier-level reasoning and bilingual (English/Chinese) tasks. As of early 2026, it leads the Open LLM Leaderboard, indicating strong performance in open-weight evaluations. Its dense architecture means all 200B parameters are active during inference, placing it in the datacenter deployment class.

Strengths

  • Frontier-scale dense architecture: With 200B active parameters, GLM-5 offers maximum model capacity for complex reasoning, though at high computational cost.
  • Very long context window: 200,000 tokens of native context enables processing of large documents, codebases, or multi-turn conversations without truncation.
  • Bilingual strength: Designed and optimized for both English and Chinese, making it suitable for cross-lingual applications.
  • Leaderboard-topping performance: Currently ranks first on the Open LLM Leaderboard 2026, a community benchmark for open-weight models.

Limitations

  • Extremely high hardware requirements: At FP16, the model requires ~400 GB of disk space, and with KV cache and overhead, total memory needs can exceed 600 GB, necessitating multi-GPU datacenter setups.
  • Dense architecture: Unlike MoE models, all 200B parameters are active per token, leading to high compute and memory costs per inference step.
  • Restrictive license: The GLM License may impose limitations on commercial use or redistribution; operators should review terms carefully.
  • Limited community benchmarks: While leaderboard results are promising, independent third-party evaluations on specific tasks (e.g., coding, math) are not yet widely available.

What it takes to run this locally

Quantized sizes range from 65 GB (Q2_K) to ~400 GB (FP16). For practical deployment, add 30–50% for KV cache and framework overhead at typical context lengths. This means even the smallest Q2_K quant (65 GB) plus overhead (~20–30 GB) requires a workstation with 80+ GB of VRAM, such as dual 48GB GPUs. For higher quality (Q4_K_M ~112.5 GB + overhead), multi-GPU datacenter hardware (e.g., 4× A100 80GB) is necessary. No token-per-second measurements are available.

Should you run this locally?

Yes if you have access to multi-GPU datacenter hardware (e.g., 4–8 A100/H100 nodes) and need a dense, high-capacity model for bilingual reasoning tasks where the GLM License permits your use case.

No if you lack the hardware budget for multi-GPU setups, require a permissive license (e.g., Apache 2.0), or prefer an MoE architecture that offers lower inference cost per token.

Catalog cross-links

Overview

Zhipu's GLM-5 currently leads the Open LLM Leaderboard 2026. Strong reasoning and bilingual EN/ZH capability.

How to run it

GLM-5 is Tsinghua/Zhipu AI's large MoE model. Run at Q4_K_M via llama.cpp with -ngl 999 -fa -c 8192. Parameter scale: ~300-400B total, ~40-60B active (speculated). Q4_K_M file size ~150-200 GB on disk. Minimum VRAM: 160 GB — 4× RTX A6000 (48GB each) with row-split, or 2× A100 80GB. The GLM architecture uses a prefix-encoder design that differs from standard decoder-only Llama — ecosystem support is narrower. llama.cpp has experimental GLM support; verify your specific GLM-5 variant is supported before provisioning. For serving: vLLM may not support GLM's prefix architecture. Test with raw llama.cpp server deployment. Context: 128K advertised; realistic usable at Q4 on 4× A6000 is ~8-16K. Throughput: ~5-12 tok/s on 4× A6000 at 8K context. Chinese-first model; English performance is competitive but verify for your use case.

Hardware guidance

Minimum: 2× A100 80GB at Q4_K_M (160 GB pool for ~150-200 GB weights — tight, may need Q3). Recommended: 4× A100 80GB at Q4_K_M for headroom and context. Budget: 4× RTX A6000 48GB (192 GB) at Q4_K_M — row-split works but throughput limited by PCIe bandwidth. VRAM math: MoE with ~300-400B total, ~40-60B active. Expert weights (inactive) must be hosted in VRAM or RAM. Q4_K_M for full MoE: ~150-200 GB. Expert offload to system RAM on a 256 GB server reduces VRAM requirement to ~80-100 GB (active experts only) but adds latency on routing. RTX 4090 24GB × 8 = 192 GB — borderline at Q3. Mac Studio M4 Ultra 192 GB can attempt Q4_K_M but expect 2-5 tok/s. Cloud: 4× A100 at ~$30-50/hr.

What breaks first

  1. GLM prefix architecture. GLM-5's encoder-decoder hybrid design means standard llama.cpp GGUF conversion may fail or produce incorrect outputs. Test against GLM's reference outputs before trusting results. 2. Chinese-first tokenizer. GLM-5's tokenizer is optimized for Chinese — English and non-CJK scripts may have higher token counts, increasing effective prompt cost and reducing effective context. 3. vLLM incompatibility. vLLM's pipeline assumes standard decoder-only architecture. GLM-5 may not be supported or may require custom model implementation. Verify before allocating cluster time. 4. MoE routing bottleneck on PCIe. Expert routing across 4× PCIe A6000s causes stalls when an expert is on a different GPU. NVLink-bridged A100s mitigate this.

Runtime recommendation

llama.cpp is the primary (and possibly only) option for GLM-5. Verify GLM architecture support in your llama.cpp build (b4590+). vLLM support is uncertain — test with a small instance before provisioning. Avoid Ollama unless GLM-5 is explicitly in the supported model list. Avoid MLX-LM — GLM prefix architecture not supported.

Common beginner mistakes

Mistake: Assuming all GLM-5 variants (GLM-5, GLM-5-Pro) use the same architecture. Fix: Pro/base variants may differ in architecture and tokenizer. Verify the specific variant against your inference stack. Mistake: Converting GLM-5 to GGUF with a standard Llama conversion script. Fix: GLM's architecture is not Llama-compatible. Use a GLM-specific conversion script or check llama.cpp's convert-hf-to-gguf.py for GLM support. Mistake: Expecting English performance equal to Chinese. Fix: GLM-5 is Chinese-first. English benchmarks may be 5-15% lower on equivalent tasks. Test your specific English workloads. Mistake: Running at 128K context without checking KV cache explosion. Fix: MoE KV cache at 128K on Q4 is ~80-120 GB. Scale context down or provision more GPUs.

Strengths

  • Top of leaderboards
  • Bilingual EN/ZH
  • Reasoning-tuned

Weaknesses

  • Less Western ecosystem support

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M120.0 GB140 GB

Get the model

HuggingFace

Original weights

huggingface.co/THUDM/GLM-5

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of GLM-5.

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Step up
More capable — bigger memory footprint
No verdicted models in the next tier up yet.

Frequently asked

What's the minimum VRAM to run GLM-5?

140GB of VRAM is enough to run GLM-5 at the Q4_K_M quantization (file size 120.0 GB). Higher-quality quantizations need more.

Can I use GLM-5 commercially?

Yes — GLM-5 ships under the GLM License, which permits commercial use. Always read the license text before deployment.

What's the context length of GLM-5?

GLM-5 supports a context window of 200,000 tokens (about 200K).

Source: huggingface.co/THUDM/GLM-5

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify GLM-5 runs on your specific hardware before committing money.