stepfun

1000B parameters

Restricted

Reviewed June 2026

Step-3

StepFun's 1T-parameter MoE. 38B active. One of the largest open-weight models; cluster-only at any quant. Restricted license.

License: Step License·Released Sep 30, 2025·Context: 65,536 tokens

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

unrated

Positioning

Step-3 is a 1-trillion-parameter Mixture-of-Experts (MoE) model from StepFun, with approximately 38 billion parameters activated per token. It is one of the largest open-weight models ever released, but its restricted Step License limits commercial and redistribution rights. With a 65,536-token context window, it is designed for frontier research workloads where raw scale is paramount. Its MoE architecture means inference cost is closer to a dense ~38B-parameter model than a dense 1T-parameter model, but the total parameter count still demands massive memory and compute.

Strengths

Massive scale with efficient inference: As an MoE with 1T total parameters but only ~38B active per token, Step-3 offers the representational capacity of a very large model while keeping per-token computation closer to that of a dense ~38B model.
Long context window: 65,536 tokens of context support research into long-document understanding, multi-turn reasoning, and in-context learning at scale.
Open-weight availability: Despite the restrictive license, the model weights are publicly accessible, enabling academic and research institutions to study and experiment with a model of this size.
Unique position in the open-weight landscape: Step-3 is among the very few models at the 1T-parameter scale with open weights, making it a reference point for scaling research.

Limitations

Restrictive license: The Step License is not permissive; commercial use and redistribution are likely limited. Operators must review the license terms carefully before any deployment.
Extreme hardware requirements: Even at the lowest quant (Q2_K ~325 GB), Step-3 requires multiple datacenter GPUs (e.g., 8× A100 80GB or more) just to load. No consumer or workstation hardware can run it.
No community-verified benchmarks: We do not have independent measurements of Step-3's performance on standard tasks. Published vendor metrics should be treated as best-case until verified by the community.
High operational complexity: Running a 1T-parameter MoE requires sophisticated parallelism (tensor/pipeline sharding), significant engineering effort, and substantial energy costs.

What it takes to run this locally

Step-3 cannot run on consumer or workstation hardware. Even at the smallest quant (Q2_K 325 GB), plus ~30-50% overhead for KV cache and framework (≈422-488 GB total), it requires a multi-GPU datacenter cluster. For example, 8× A100 80GB (640 GB total) could potentially load a Q2_K quant with careful memory management, but inference would still demand high-bandwidth interconnects and optimized inference frameworks. FP16 (2000 GB) is only feasible on large clusters (e.g., 32× A100 80GB or more).

Should you run this locally?

Yes if you are a research institution with access to a multi-GPU datacenter cluster and need to study or experiment with a model at the 1T-parameter scale under an open-weight license (with license restrictions acceptable to your use case).

No if you lack the hardware, engineering support, or license permission for your intended use. For most operators, smaller models (e.g., 70B-400B) will be more practical and cost-effective.

Catalog cross-links

Step-2 – smaller model from the same family
Mixtral 8x22B – another large MoE with a permissive license
A100 GPU – typical hardware for running models of this scale

Overview

StepFun's 1T-parameter MoE. 38B active. One of the largest open-weight models; cluster-only at any quant. Restricted license.

How to run it

Step-3 is a 1T MoE (speculated ~40-60B active) from StepFun. No consumer path. Run on 4-8× H100 SXM at FP8 via vLLM with tensor-parallel=4. If vLLM MoE routing support is immature, fall back to SGLang with --tp 4. Q4 quantization (200 GB on disk) needs 4× A100 80GB or 2× H100 80GB minimum at 4K context. Bump to 8× H100 for 16K context. Expected throughput: 15-30 tok/s per user at FP8 on 4× H100 (estimate — validation is thin). No viable single-GPU path. No viable Apple Silicon path even at Mac Studio M3 Ultra 192 GB — Q2 may load but 2-4 tok/s makes it academic. Verify StepFun's license and weight availability before allocating cluster time.

Hardware guidance

Minimum: 4× A100 80GB at Q4 (speculative — Step-3 tooling is unvalidated). Recommended: 4-8× H100 SXM at FP8. VRAM math: MoE with ~1T total, ~40-60B active per token. Q4 full weights ~200 GB on disk. KV cache at 16K context adds ~15-25 GB per replica. 4× H100 (320 GB total) covers Q4 weights + KV cache for batch=1. For FP8, 8× H100 (640 GB) is necessary. RTX 6000 Ada 48GB is insufficient per card for tensor-parallel splits. Mac Studio M3 Ultra 192 GB at Q2 is the only consumer-adjacent path (3-6 tok/s expected) but untested. Cloud: RunPod/Lambda H100 cluster at $25-40/hr/node.

What breaks first

vLLM MoE routing: Step-3 uses StepFun's custom MoE architecture. vLLM's generic MoE kernels may not fuse correctly, causing silent correctness failures or NaN outputs. Validate against known reference outputs before trusting results. 2. Tensor-parallel communication: At 4-8 nodes, NCCL ring latency becomes dominant. MFU below 30% is common on non-NVLink clusters. 3. Weight availability: As of mid-2026, Step-3 weights may not be publicly downloadable. Verify hf repo exists before provisioning compute. 4. Quantization toolchain gap: llama.cpp may not support Step-3's architecture — GGUF quantization depends on architecture-specific kernels. Expect 2-4 weeks of engineering to add support if missing.

Runtime recommendation

Best path today: vLLM with tensor-parallel=4 on H100s. If vLLM MoE routing fails, SGLang is the fallback — both support custom MoE with --tp. Avoid Ollama and llama.cpp unless Step-3 architecture support is confirmed. Avoid MLX-LM — Apple Silicon is not viable for this model size at useful throughput.

Common beginner mistakes

Mistake: Assuming Ollama pull step-3 works. Fix: Check Ollama's supported model list first — Step-3 likely isn't added yet. Use vLLM or SGLang. Mistake: Renting single H100 and expecting it to load. Fix: MoE with ~200 GB weights at Q4 needs minimum 4× 80GB GPUs. Single H100 has 80 GB. Do the VRAM math before renting. Mistake: Trusting benchmark scores without independent validation. Fix: Step-3 has minimal third-party eval data as of mid-2026. Run your own benchmarks on your hardware before committing to production. Mistake: Assuming Apache/MIT license. Fix: StepFun's license is unconfirmed. Verify commercial terms before any production deployment.

Strengths

Frontier scale; strong on multilingual

Weaknesses

Multi-machine cluster only
Restricted license

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
AWQ-INT4	565.0 GB	640 GB

Get the model

HuggingFace

Original weights

huggingface.co/stepfun-ai/Step-3

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Step-3.

NVIDIA GB200 NVL72

13824GB · nvidia

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier

Models in the same parameter band as this one

Step up

More capable — bigger memory footprint

No verdicted models in the next tier up yet.

Step down

Smaller — faster, runs on weaker hardware

Frequently asked

What's the minimum VRAM to run Step-3?

640GB of VRAM is enough to run Step-3 at the AWQ-INT4 quantization (file size 565.0 GB). Higher-quality quantizations need more.

Can I use Step-3 commercially?

Step-3 is released under the Step License, which has restrictions for commercial use. Review the license terms before using it in a product.

What's the context length of Step-3?

Step-3 supports a context window of 65,536 tokens (about 66K).

Source: huggingface.co/stepfun-ai/Step-3

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

NVIDIA GB200 NVL72 →

Before you buy

Verify Step-3 runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →