Neural network architectures

State Space Models (Mamba)

State Space Models (SSMs), notably the Mamba architecture, are a class of sequence models that process tokens in linear time relative to sequence length, unlike transformers' quadratic attention. Mamba uses a selective state space mechanism that dynamically adjusts its internal state based on input, enabling efficient handling of long contexts (up to 1M tokens) with constant memory per token. For local AI operators, this means models like Mamba-2.8B can run on consumer GPUs with 8-12 GB VRAM at high throughput (~50 tok/s on RTX 4090), rivaling similarly sized transformers but with lower VRAM usage and faster inference on long sequences.

Deeper dive

SSMs originate from control theory, modeling sequences as a continuous-time system with hidden states updated via linear differential equations. Mamba introduces selectivity: the state transition and input/output matrices depend on the input, allowing the model to 'forget' irrelevant information and focus on salient tokens. This overcomes a key limitation of earlier SSMs (e.g., S4) which struggled with content-based reasoning. Architecturally, Mamba replaces the attention mechanism with a selective scan operation that is hardware-efficient (uses parallel scans on GPU). The result is a model that scales linearly with sequence length (O(N)) vs. transformers' O(N²). For operators, this means Mamba models can handle very long contexts (e.g., 128K tokens) on a single GPU without running out of memory, though they may lag behind transformers on tasks requiring precise cross-token interactions (e.g., copy tasks).

Practical example

Running Mamba-2.8B on an RTX 4090 (24 GB VRAM) at Q4 quantization (~2 GB) with a 128K context uses ~6 GB VRAM total, achieving ~50 tok/s. A comparable transformer (e.g., Llama 3.1 8B Q4) would need ~5 GB for weights plus ~8 GB for a 128K context (due to KV cache), totaling ~13 GB, and might hit memory limits on 16 GB cards. Mamba's lower memory footprint allows operators to run larger models or longer contexts on the same hardware.

Workflow example

In llama.cpp, operators can load Mamba models via the --mamba flag or by using a Mamba-specific GGUF file. For example: ./main -m mamba-2.8b.Q4_K_M.gguf -n 512 --temp 0.7. The runtime will allocate a fixed-size state buffer (not a growing KV cache), so context length only affects prompt processing time, not VRAM. In Hugging Face Transformers, loading a Mamba model is similar to transformers: from transformers import MambaForCausalLM; model = MambaForCausalLM.from_pretrained('state-spaces/mamba-2.8b'). Operators should note that Mamba models may not support all sampling methods (e.g., top-k) in all runtimes.

Reviewed by Fredoline Eruo. See our editorial policy.