Expert Parallelism

Expert parallelism is a parallelism strategy specific to MoE models: each GPU holds a different subset of the experts, and tokens are routed to whichever GPU owns the expert they activate. Distinct from tensor parallelism (split each layer's weights) and pipeline parallelism (split layers across devices).

The advantage: at inference, only the active experts run, so expert-parallel MoE serves at lower per-token compute than a dense model of the same total parameters. The cost: routing tokens between GPUs adds an all-to-all communication step at every MoE layer.

Mixtral 8x7B, DeepSeek-V3, and Qwen3-MoE typically deploy with combined expert + tensor parallelism on multi-GPU servers. Single-GPU deployments use a degenerate form where the routing happens inside one device.

Related terms

See also