RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Custom LLM Architecture Design
  6. /Ch. 8
Custom LLM Architecture Design

08. Mixture of Experts

Chapter 8 of 24 · 15 min
KEY INSIGHT

MoE enables scaling model parameters without scaling compute per token. A 16-expert MoE with top_k=2 has 8x more parameters than a dense model but only 2x the per-token computation—assuming perfect load balancing.

Mixture of Experts (MoE) scaling decouples parameter count from computation cost. Instead of activating all experts for every token, MoE routes tokens to a subset of experts, enabling massive models with constant per-token computation.

class MoELayer(nn.Module):
    """Mixture of Experts layer with top-k routing."""
    def __init__(self, d_model, d_ffn, n_experts, top_k=2, bias=False):
        super().__init__()
        self.n_experts = n_experts
        self.top_k = top_k
        
        # Each expert is an independent FFN
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(d_model, d_ffn, bias=bias),
                nn.GELU(),
                nn.Linear(d_ffn, d_model, bias=bias)
            )
            for _ in range(n_experts)
        ])
        
        # Router network: maps input to expert probabilities
        self.router = nn.Linear(d_model, n_experts, bias=False)
    
    def forward(self, x):
        """
        x: (batch, seq_len, d_model)
        Returns: (batch, seq_len, d_model)
        """
        batch_size, seq_len, d_model = x.shape
        
        # Flatten batch and sequence for routing
        x_flat = x.view(-1, d_model)
        
        # Compute routing logits
        router_logits = self.router(x_flat)  # (batch*seq, n_experts)
        
        # Select top-k experts
        weights, indices = torch.topk(router_logits, self.top_k, dim=-1)
        
        # Softmax over selected experts only
        weights = F.softmax(weights, dim=-1)
        
        # Process with selected experts
        output = torch.zeros_like(x_flat)
        
        for i in range(self.top_k):
            expert_idx = indices[:, i]
            expert_weight = weights[:, i]
            
            # For each expert, accumulate weighted outputs
            for e_idx in range(self.n_experts):
                mask = (expert_idx == e_idx)
                if mask.any():
                    expert_input = x_flat[mask]
                    expert_output = self.experts[e_idx](expert_input)
                    output[mask] += expert_weight[mask].unsqueeze(-1) * expert_output
        
        return output.view(batch_size, seq_len, d_model)

Failure mode: Load imbalance. Without constraints, a few experts receive most tokens while others remain underutilized. This wastes capacity and causes training instability. Addressing this requires auxiliary load balancing losses (covered in Chapter 10).

Failure mode: Memory inefficiency from routing complexity. Naive MoE implementations batch tokens per expert, leading to variable-length sequences and padding overhead. Production implementations use specialized kernels (like MoE from Megatron-LM).

Failure mode: Expert collapse early in training. Router learns to route all tokens to one or two experts before auxiliary losses take effect. Clipping router logits or adding entropy regularization mitigates this.

EXERCISE

Create a small MoE with 4 experts and test routing on random input. Print the expert selection counts to verify routing works. Add entropy regularization to the router to encourage wider expert usage.

← Chapter 7
Transformer Block Design
Chapter 9 →
Expert Routing