SSM vs Mamba vs Transformers

The Fundamental Question

How do you let a model
remember the past?

Every sequence model — whether it's reading a paragraph, listening to audio, or watching a video — faces the same core challenge: information from earlier in the sequence needs to influence decisions later. How you solve that determines everything about your model's speed, memory, and ability to generalize.

Three fundamentally different answers have emerged. Transformers say: look at everything. SSMs say: compress into a fixed state. Mamba says: selectively decide what to remember. Each answer comes with profound tradeoffs.

🔭

Transformer

Full attention over all tokens. Every position attends to every other. Lossless memory — but O(n²) cost that makes long sequences brutally expensive.

📡

Classic SSM

Fixed linear recurrence. Compress the entire past into a constant-size state vector. O(n) cost — but the compression is time-invariant, so the model can't selectively remember.

🎯

Mamba

Selective state space. Parameters of the recurrence are input-dependent — the model learns what to keep and what to forget at each step. O(n) cost with context-aware memory.

⚡

Hybrid

Mamba layers interleaved with attention layers. Best of both worlds — attention for precise recall, SSM for efficient long-range context. The direction the field is moving.

The Transformer

Attention is All You Need — and also all your memory

O(n²) memory O(n²) compute Full attention Parallelizable

Transformers, introduced by Vaswani et al. in 2017, solve the memory problem in the most direct possible way: don't compress anything. Just look at everything. Self-attention lets every token directly attend to every other token in the sequence.

This is extraordinarily powerful. A word at position 500 can directly see a word at position 1 with zero information loss. There's no compression bottleneck, no degradation of signal over distance. This is why transformers utterly dominate NLP benchmarks — their memory is lossless.

How Attention Actually Works

For each token, you compute three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I contribute?). Then:

Self-Attention — Core Equation

Attention(Q, K, V) = softmax( Q·Kᵀ / √d_k ) · V

where Q = X·W_Q ← queries (what token i wants)
K = X·W_K ← keys (what token j has)
V = X·W_V ← values (what token j contributes)

Attention matrix is n × n ← THIS is the problem

The attention matrix is n × n. For a sequence of 1,000 tokens, that's 1,000,000 entries. For 100,000 tokens, it's 10 billion entries. Memory and compute scale as the square of sequence length — O(n²). This is not a minor implementation issue. It is a fundamental architectural ceiling.

Attention Matrix Growth — O(n²) n = 8 tokens

Memory at different sequence lengths

← drag to change sequence length →

Each cell represents one attention score (Q·Kᵀ). The entire grid must fit in GPU memory simultaneously.

At 128K context (what GPT-4 supports), the attention matrix alone requires ~33 billion float16 values. This is why long-context inference is so expensive — the KV cache grows linearly with context length, and attention computation over that cache is quadratic.

Transformer Block Architecture

Output Projection

LayerNorm

Feed-Forward (MLP)

LayerNorm

Multi-Head Self-Attention

⟵

KV Cache grows with n

Token Embeddings + Positional Encoding

× N layers

The KV Cache Problem in Inference

During inference, transformers cache Key and Value matrices for all previous tokens. For Llama-3-70B with 128K context, the KV cache alone is ~140 GB per request. This is why long-context serving is so expensive — it's not compute, it's memory. A single long conversation can exhaust the entire VRAM of a server node.

And yet — despite all of this — transformers dominate. The reason is simple: they are right. When you process language, any token really can depend on any earlier token. Compressing that arbitrarily is dangerous. The quadratic cost is the honest price of lossless sequence memory.

The question isn't whether transformers are right. It's whether you can approximate what they're doing — with O(n) cost — without losing too much.

State Space Models

Control theory meets deep learning — compress time into a vector

O(n) memory O(n log n) compute Linear recurrence Parallelizable

State Space Models come from control theory — they've been used in signal processing, physics simulations, and robotics for decades. The idea is conceptually simple: maintain a hidden state that summarizes everything seen so far, and update it as new input arrives.

If Transformers are like a person who re-reads the entire book before answering each question, SSMs are like a person taking notes — compressing what they've read into a finite notebook that they update continuously.

The SSM Equations

Continuous-Time SSM (Linear ODE)

State update (how the hidden state evolves):
h'(t) = A · h(t) + B · x(t)

Output (what we read from the state):
y(t) = C · h(t) + D · x(t)

where A ∈ ℝ^N×N ← state transition matrix (N = state size)
B ∈ ℝ^N×1 ← input-to-state projection
C ∈ ℝ^1×N ← state-to-output projection

This is a linear ODE from physics. The state h(t) evolves continuously. But neural networks are discrete — they process tokens one at a time — so you discretize it using a step size Δ (delta):

Discretized SSM (what actually runs in the model)

Discretize using Zero-Order Hold (ZOH):
Ā = exp(A · Δ) ← discrete state matrix
B̄ = (A)^-1(Ā - I) · B ← discrete input matrix

Recurrence (run this for each token t = 1, 2, ... n):
h_t = Ā · h_t-1 + B̄ · x_t
y_t = C · h_t

Look at that recurrence carefully. h_t = Ā·h_t-1 + B̄·x_t is just like an RNN cell — but the matrices Ā and B̄ are fixed for the entire sequence. They don't depend on the input. This is the fundamental limitation of classic SSMs: they treat every position identically. The model can't "pay more attention" to an important word vs a filler word.

SSM Recurrence — State Evolution Over Time step 0

h_t = Ā·h_t-1 + B̄·x_t

Each state h_t is computed from the previous state h_t-1 and the current token x_t. Information propagates left to right through the fixed recurrence.

S4 — Structured State Spaces for Sequences

The problem with naive SSMs is the state matrix A. For the recurrence to work well, A needs to be a stable dynamical system — it can't have eigenvalues that cause the state to explode or vanish. S4 (Gu et al., 2021) solved this by constraining A to a specific structure called the HIPPO matrix.

📐

HIPPO — High-order Polynomial Projection Operator: The HIPPO matrix constructs A such that the hidden state h_t is provably an optimal polynomial approximation of the entire input history x₁…x_t. It's a mathematically principled way to compress an arbitrary-length past into a finite state while minimizing information loss. This is the theoretical backbone of the SSM revolution.

S4 also showed that the recurrence can be computed as a convolution during training — making it fully parallelizable on GPUs. At inference, you run it as a fast recurrence. This dual nature — parallel training, O(1) per-step inference — is what makes SSMs so appealing.

S4 / Classic SSM Block

Output y_t

C · h_t

←

State h_t = Ā·h_t-1 + B̄·x_t

Fixed matrices Ā, B̄, C — same for every position in the sequence

Input x_t

Why Classic SSMs Still Fall Short

Despite the elegance, classic SSMs have one crippling weakness: the parameters Ā and B̄ are time-invariant. Every input token goes through the same linear transformation regardless of its content. Ask an SSM "what was the person's name mentioned at the beginning of this 10,000-word document?" and it has no way to selectively preserve that specific piece of information while discarding filler text. The compression is content-agnostic. Mamba was built to fix exactly this.

Mamba

Selective State Spaces — the model decides what to remember

O(n) memory O(n) compute Selective recurrence Hardware-aware

Mamba (Gu & Dao, 2023) makes one surgical change to the classic SSM: it makes the parameters input-dependent. The matrices B, C and the step size Δ are no longer fixed — they are computed from the current input token. This is called a selective state space model.

This seemingly small change is actually enormous. A model where B, C, Δ vary per token can selectively absorb or ignore each input. It can hold onto a name for 10,000 tokens and then surface it exactly when needed. It can forget irrelevant filler words immediately. It behaves much more like attention — but in O(n) time.

The Selectivity Mechanism

Mamba — Selective SSM (S6)

Classic SSM: B, C, Δ are fixed parameters
Mamba: B, C, Δ are functions of the input x_t

Δ_t = softplus( Linear_Δ(x_t) ) ← how long to "dwell" on this token
B_t = Linear_B(x_t) ← how strongly to absorb x_t into state
C_t = Linear_C(x_t) ← what to read out of the state

State update (same recurrence, but now time-varying):
Ā_t = exp(A · Δ_t)
B̄_t = (A)^-1(Ā_t - I) · B_t
h_t = Ā_t · h_t-1 + B̄_t · x_t
y_t = C_t · h_t

The key intuition: Δ_t is a "focus dial" per token. When Δ is large, Ā ≈ 0 and the state nearly resets — the model "forgets" previous context and focuses only on x_t. When Δ is small, Ā ≈ I and the state changes little — the model "remembers" its history and largely ignores x_t. This mimics what attention does, but without materializing the n×n matrix.

Selectivity — Different Δ per Token

Large Δ → model focuses on current token (resets state). Small Δ → model preserves memory (ignores current token). Mamba learns this per-token from content.

The Mamba Block Architecture

The Mamba block is structured around a gated design. The input goes through two parallel branches — one linear path and one selective SSM path — which are combined multiplicatively (a GLU-style gate). This is similar to how SwiGLU works in transformer FFN layers.

Mamba Block (one layer)

Output + Residual

LayerNorm

→

Linear projection (out)

BRANCH A — SSM

Selective SSM (S6)

Depthwise Conv1D

SiLU activation

⊗

BRANCH B — GATE

SiLU(Linear(x))

Linear expand (×2)

←

LayerNorm

Input x

The Hardware-Aware Parallel Scan

Here's Mamba's second major innovation — one that's less discussed but equally important: the hardware-aware algorithm.

Making B and C input-dependent breaks the convolutional trick that made S4 fast. You can no longer compute the entire sequence as a single FFT-based convolution. The recurrence is now truly sequential — each h_t depends on h_t-1.

The naive approach: compute h₀, h₁, ..., h_n sequentially. This works but is slow because it can't use GPU parallelism. Mamba solves this with a parallel prefix scan (parallel associative scan):

🔬

Parallel Scan Intuition: The recurrence h_t = Ā_t·h_t-1 + B̄_t·x_t is an associative operation. That means you can compute it in a binary tree fashion — pairs of elements first, then pairs of pairs — reducing the sequential depth from O(n) to O(log n) parallel steps. This is the same trick used in prefix sum / cumulative sum algorithms. On a GPU with thousands of cores, O(log n) parallel steps is orders of magnitude faster than O(n) sequential steps.

💾

Memory-Efficient Execution: Mamba also avoids materializing the full state sequence in HBM (high-bandwidth memory). The kernel fuses the scan computation and keeps intermediate states in SRAM (on-chip cache, ~100× faster than HBM). This isn't just a software trick — it fundamentally changes the memory access pattern and is why Mamba achieves 5× higher throughput than a transformer of equivalent parameter count on long sequences.

What Mamba Gets Right — and What It Doesn't

Mamba is genuinely impressive for long sequences. On tasks like DNA modeling (sequences of 1M+ tokens), audio waveform modeling, and long-document processing, it matches or beats transformers at a fraction of the compute. Inference is O(1) per new token — the state size is constant regardless of how many tokens you've processed.

But Mamba isn't perfect. On tasks that require precise recall of specific information — especially from arbitrary positions in the past — transformers still win. If you need to copy a 20-digit number from the beginning of a 10,000-token prompt, attention can reach back directly. Mamba has to hope that number survived 10,000 recurrence steps without being compressed away.

The "Needle in a Haystack" Problem

A well-known evaluation benchmark hides a key fact ("The magic word is PINEAPPLE") deep inside a long document and asks about it at the end. Transformers ace this — attention directly retrieves the relevant token. Mamba-1 struggles significantly — the state compression often loses exact values. Mamba-2 and hybrid models (Jamba, Zamba) are specifically designed to address this failure mode.

Mamba 2 & Hybrids

Structured State Space Duality — and the best of both worlds

O(n) memory Higher throughput SSD / Hybrid Tensor parallel

Mamba 2 (Dao & Gu, 2024) rethinks the architecture from a theoretical perspective and arrives at a new formulation called Structured State Space Duality (SSD). The central insight: a specific class of SSMs is mathematically equivalent to a form of linear attention with matrix-valued states.

This duality is profound. It means you can implement the same computation either as a recurrence (efficient at inference) or as a matrix multiplication (efficient at training) — and switch between them automatically. Mamba 2 gets:

2×

Training Speed

By restructuring the inner loop to be matrix-multiplication heavy, Mamba 2 fits naturally into tensor cores on modern GPUs — the same hardware that transformers exploit. Training speed roughly doubles over Mamba 1.

8×

Larger State

The SSD formulation allows expanding the state dimension from 16 (Mamba 1) to 64 or 128 without prohibitive cost. Larger state = more memory capacity = better recall.

∞

Tensor Parallelism

Mamba 1 was hard to shard across multiple GPUs. The SSD formulation is naturally compatible with tensor parallelism, enabling multi-GPU scaling.

Hybrid Models — The Pragmatic Answer

If SSMs are great at long-range context and transformers are great at precise recall, why not use both? That's exactly what hybrid models do — interleave Mamba layers with attention layers in a single model.

Model	Architecture	Attention Layers	Why It Works
Jamba (AI21)	Mamba + MoE + Attn	1 in 8 layers	Attention handles precise recall; SSM handles long context cheaply
Zamba (Zyphra)	Mamba + shared Attn	1 shared block	Single attention block is shared across layers — minimal overhead
Falcon Mamba	Pure Mamba 1	0	First Mamba model competitive with transformers at 7B scale
RWKV-6	Linear Attn	0 (linear)	Different approach to O(n) — reformulates attention as RNN
Griffin (Google)	Linear Recurrence + MQA	Mixed	Real-gated linear recurrence with sliding window attention

💡

The emerging consensus: pure Mamba excels on modalities where long-range context matters but precise verbatim recall is rare — audio, DNA sequences, time series, video. For language modeling — especially instruction following and retrieval — hybrid models with a small fraction of attention layers consistently beat pure Mamba at the same parameter count. The cost of those attention layers is surprisingly small when you only need 1 in 8.

Head to Head

The Numbers That Matter

Memory Complexity vs Sequence Length

Transformer (O(n²))

Mamba (O(n))

Hybrid (between)

At 1K tokens: all similar. At 100K tokens: the gap is ~10,000×.

Dimension	Transformer	Classic SSM (S4)	Mamba	Mamba 2 / Hybrid
Compute complexity	O(n²)	O(n log n)	O(n)	O(n)
Memory (sequence)	O(n²)	O(n)	O(1) per step	O(1) + O(w)
Inference per token	O(n) grows	O(1) constant	O(1) constant	O(1) mostly
Selective memory	✅ Full attention	❌ Fixed recurrence	✅ Input-dependent	✅ Both
Parallelizable training	✅ Fully	✅ Via convolution	✅ Parallel scan	✅ Fully
Precise verbatim recall	✅ Excellent	⚠️ Degrades	⚠️ Degrades (long)	✅ Good
Long context (>100K)	❌ Prohibitive	✅ Efficient	✅ Efficient	✅ Efficient
Positional encoding	Required (RoPE etc.)	Implicit via Δ	Implicit via Δ	Mixed
In-context learning	✅ Strong	⚠️ Weak	✅ Reasonable	✅ Strong
Multi-GPU scaling	✅ Mature	⚠️ Limited	⚠️ Difficult	✅ Via Mamba 2
Best use case	General NLP, reasoning	Signal, time series	Audio, DNA, video	General long-context

A Brief History

How We Got Here

Jun 2017

Attention Is All You Need

Vaswani et al. introduce the Transformer. Self-attention replaces RNNs. The field shifts overnight.

Transformer

Oct 2021

S4 — Structured State Spaces

Gu et al. show that SSMs with HIPPO initialization can match transformers on long-range tasks at linear cost. The SSM revival begins.

S4HIPPO

2022

S4 Variants — H3, Hyena, S5

A wave of improvements: H3 adds associative recall, Hyena uses implicit convolutions, S5 adds multi-input multi-output structure.

H3HyenaS5

Dec 2023

Mamba — Selective State Spaces

Gu & Dao make SSM parameters input-dependent. Mamba-3B matches Transformer-3B on language modeling and runs 5× faster. A landmark paper.

MambaS6

Mar 2024

Jamba (AI21 Labs)

First hybrid Mamba+Transformer+MoE model at 52B total / 12B active parameters. 256K context window. Ships as a commercial API.

JambaMoE

May 2024

Mamba 2 — Structured State Space Duality

Dao & Gu prove SSMs and linear attention are dual. Mamba 2 is 2–8× faster, supports larger state dimensions, and is tensor-parallelism ready.

Mamba 2SSD

Jun 2024

Falcon Mamba 7B

TII releases the first pure Mamba model competitive with Llama-3-8B on standard benchmarks. Validates Mamba at production scale.

Falcon Mamba

2025+

The Hybrid Era

Zamba 2, Griffin, RWKV-6, and many others. The consensus: for language, hybrids outperform pure SSMs and pure transformers at matched compute.

Zamba 2GriffinRWKV-6

Final Word

When to Use What

🔭

Use Transformers when…

You need maximum reasoning quality. Short-to-medium context (≤32K). Tasks requiring precise fact recall, multi-step reasoning, or strong in-context learning. You have the GPU budget.

🎯

Use Mamba when…

Your sequences are long (>100K) or infinite (streaming). Modalities like audio, DNA, video where time-series dynamics matter. Real-time inference on edge devices where O(1) per-step cost is critical.

⚡

Use Hybrids when…

You need long context AND strong recall (the general case for language models). Hybrid models with 1-in-8 attention layers are now the pragmatic default for new model architecture research.

📡

Watch RWKV + xLSTM

Alternative linear attention families are maturing fast. RWKV-7, xLSTM, and linear transformers offer different tradeoff points. The space is far from settled.

🔬

My take: Attention won because language is genuinely structured — words relate to words across arbitrary distances, and lossless access to that structure matters. Mamba is not a replacement for attention — it's a complement. The most honest description of the field right now is: we are learning where linear recurrence is sufficient and where attention is necessary, and building models that use each appropriately. The transformer era isn't ending. It's being refined.

SSM vs Mambavs Transformers

How do you let a modelremember the past?

Transformer

Classic SSM

Mamba

Hybrid

The KV Cache Problem in Inference

Why Classic SSMs Still Fall Short

The "Needle in a Haystack" Problem

Training Speed

Larger State

Tensor Parallelism

The Numbers That Matter

How We Got Here

When to Use What

Use Transformers when…

Use Mamba when…

Use Hybrids when…

Watch RWKV + xLSTM

SSM vs Mamba
vs Transformers

How do you let a model
remember the past?