RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Custom LLM Architecture Design
  6. /Ch. 11
Custom LLM Architecture Design

11. Mamba State Space Model

Chapter 11 of 24 · 15 min
KEY INSIGHT

Mamba's "selective" property is crucial: unlike standard SSMs where A, B, C are constant, Mamba makes these parameters input-dependent. This allows the model to selectively ignore or emphasize state based on content, enabling content-aware filtering.

Mamba replaces attention with a selective state space model that processes sequences linearly in context length. Unlike transformers' O(N²) attention, Mamba achieves O(N) complexity while maintaining long-range dependencies.

The Mamba SSM (State Space Model) discretizes continuous parameters using zero-order hold:

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MambaBlock(nn.Module):
    """
    Mamba Selective State Space Model block.
    Based on 'Mamba: Linear-Time Sequence Modeling with Selective State Spaces'
    """
    def __init__(self, d_model, d_state=16, d_conv=4, expand=2):
        super().__init__()
        self.d_model = d_model
        self.d_state = d_state
        self.d_conv = d_conv
        self.d_inner = d_model * expand
        
        # Input projection
        self.in_proj = nn.Linear(d_model, self.d_inner * 2, bias=False)
        
        # Convolutional extension
        self.conv1d = nn.Conv1d(
            in_channels=self.d_inner,
            out_channels=self.d_inner,
            kernel_size=d_conv,
            padding=d_conv - 1,
            groups=self.d_inner,
        )
        
        # SSM parameters (selective, thus input-dependent)
        # D: skip connection (input-dependent)
        self.x_proj = nn.Linear(self.d_inner, d_state * 2 + 1, bias=False)
        
        # dt_proj: input-dependent step size
        self.dt_proj = nn.Linear(d_state * 2 + 1, self.d_inner, bias=True)
        
        # A: state matrix (input-dependent during forward)
        self.A_log = nn.Parameter(torch.randn(self.d_inner, d_state))
        self.A = nn.Parameter(torch.exp(torch.randn(self.d_inner, d_state)))
        
        # B, C: input-dependent during forward (from x_proj)
        
        # Output projection
        self.out_proj = nn.Linear(self.d_inner, d_model, bias=False)
        
        self.selective_scan = selective_scan
    
    def forward(self, x):
        """
        x: (batch, seq_len, d_model)
        """
        batch, seq_len, d_model = x.shape
        
        # Split input for A/B/C and D branches
        xz = self.in_proj(x)
        x_inner, z = xz.chunk(2, dim=-1)
        
        # Convolution (over seq dimension)
        x_conv = x_inner.transpose(1, 2)  # (batch, d_inner, seq)
        x_conv = self.conv1d(x_conv)[:, :, :seq_len]
        x_conv = x_conv.transpose(1, 2)  # (batch, seq, d_inner)
        x_conv = F.silu(x_conv)
        
        # Compute SSM parameters (selective: depend on current input)
        x_dbl = self.x_proj(x_conv)  # (batch, seq, d_state*2 + 1)
        dt, B, C = x_dbl.chunk(3, dim=-1)
        dt = self.dt_proj(dt)
        D = torch.sigmoid(z)
        
        # Selective scan
        y = self.selective_scan(
            x_conv, dt, self.A, B, C
        )
        
        # Output with gating
        return self.out_proj(y * D)

Failure mode: State dimension too large. d_state=16 is typical but varies by model size. Larger d_state improves expressiveness but increases memory (O(d_state × d_inner) parameters). Setting d_state=4 severely limits SSM expressiveness.

Failure mode: Discretization instability. The A matrix (state matrix) exponential creates large magnitudes for negative A_log values. Initialize A_log with small absolute values (standard normal is appropriate) or use careful initialization from the reference implementation.

EXERCISE

Implement the basic (non-selective) SSM with fixed A, B, C. Test on a copy task (output the input shifted by one position). Verify that the SSM struggles with content-based selection, then explain why selectivity in Mamba addresses this.

← Chapter 10
Load Balancing
Chapter 12 →
Selective State Space