Deep Dive · Architecture Analysis

SSM vs Mamba
vs Transformers

Attention is quadratic. The world is sequential. Here's how three paradigms — self-attention, structured state spaces, and selective state spaces — each try to model time, and why it matters more than ever.

March 2026 · Prateek Singh, PhD
Sequence Modeling Architecture Mamba State Spaces
scroll

How do you let a model
remember the past?

Every sequence model — whether it's reading a paragraph, listening to audio, or watching a video — faces the same core challenge: information from earlier in the sequence needs to influence decisions later. How you solve that determines everything about your model's speed, memory, and ability to generalize.

Three fundamentally different answers have emerged. Transformers say: look at everything. SSMs say: compress into a fixed state. Mamba says: selectively decide what to remember. Each answer comes with profound tradeoffs.

🔭

Transformer

Full attention over all tokens. Every position attends to every other. Lossless memory — but O(n²) cost that makes long sequences brutally expensive.

📡

Classic SSM

Fixed linear recurrence. Compress the entire past into a constant-size state vector. O(n) cost — but the compression is time-invariant, so the model can't selectively remember.

🎯

Mamba

Selective state space. Parameters of the recurrence are input-dependent — the model learns what to keep and what to forget at each step. O(n) cost with context-aware memory.

Hybrid

Mamba layers interleaved with attention layers. Best of both worlds — attention for precise recall, SSM for efficient long-range context. The direction the field is moving.

01
The Transformer
Attention is All You Need — and also all your memory
O(n²) memory O(n²) compute Full attention Parallelizable

Transformers, introduced by Vaswani et al. in 2017, solve the memory problem in the most direct possible way: don't compress anything. Just look at everything. Self-attention lets every token directly attend to every other token in the sequence.

This is extraordinarily powerful. A word at position 500 can directly see a word at position 1 with zero information loss. There's no compression bottleneck, no degradation of signal over distance. This is why transformers utterly dominate NLP benchmarks — their memory is lossless.

For each token, you compute three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I contribute?). Then:

Self-Attention — Core Equation
Attention(Q, K, V) = softmax( Q·Kᵀ / √d_k ) · V

where Q = X·WQ ← queries (what token i wants)
       K = X·WK ← keys (what token j has)
       V = X·WV ← values (what token j contributes)

Attention matrix is n × n ← THIS is the problem

The attention matrix is n × n. For a sequence of 1,000 tokens, that's 1,000,000 entries. For 100,000 tokens, it's 10 billion entries. Memory and compute scale as the square of sequence length — O(n²). This is not a minor implementation issue. It is a fundamental architectural ceiling.

Attention Matrix Growth — O(n²) n = 8 tokens

Memory at different sequence lengths

← drag to change sequence length →

Each cell represents one attention score (Q·Kᵀ). The entire grid must fit in GPU memory simultaneously.

At 128K context (what GPT-4 supports), the attention matrix alone requires ~33 billion float16 values. This is why long-context inference is so expensive — the KV cache grows linearly with context length, and attention computation over that cache is quadratic.

Transformer Block Architecture
Output Projection
LayerNorm
+
Feed-Forward (MLP)
LayerNorm
+
Multi-Head Self-Attention
KV Cache grows with n
Token Embeddings + Positional Encoding
× N layers

The KV Cache Problem in Inference

During inference, transformers cache Key and Value matrices for all previous tokens. For Llama-3-70B with 128K context, the KV cache alone is ~140 GB per request. This is why long-context serving is so expensive — it's not compute, it's memory. A single long conversation can exhaust the entire VRAM of a server node.

And yet — despite all of this — transformers dominate. The reason is simple: they are right. When you process language, any token really can depend on any earlier token. Compressing that arbitrarily is dangerous. The quadratic cost is the honest price of lossless sequence memory.

The question isn't whether transformers are right. It's whether you can approximate what they're doing — with O(n) cost — without losing too much.

02
State Space Models
Control theory meets deep learning — compress time into a vector
O(n) memory O(n log n) compute Linear recurrence Parallelizable

State Space Models come from control theory — they've been used in signal processing, physics simulations, and robotics for decades. The idea is conceptually simple: maintain a hidden state that summarizes everything seen so far, and update it as new input arrives.

If Transformers are like a person who re-reads the entire book before answering each question, SSMs are like a person taking notes — compressing what they've read into a finite notebook that they update continuously.

Continuous-Time SSM (Linear ODE)
State update (how the hidden state evolves):
h'(t) = A · h(t) + B · x(t)

Output (what we read from the state):
y(t) = C · h(t) + D · x(t)

where A ∈ ℝN×N ← state transition matrix (N = state size)
       B ∈ ℝN×1 ← input-to-state projection
       C ∈ ℝ1×N ← state-to-output projection

This is a linear ODE from physics. The state h(t) evolves continuously. But neural networks are discrete — they process tokens one at a time — so you discretize it using a step size Δ (delta):

Discretized SSM (what actually runs in the model)
Discretize using Zero-Order Hold (ZOH):
Ā = exp(A · Δ) ← discrete state matrix
= (A)-1(Ā - I) · B ← discrete input matrix

Recurrence (run this for each token t = 1, 2, ... n):
ht = Ā · ht-1 + · xt
yt = C · ht

Look at that recurrence carefully. ht = Ā·ht-1 + B̄·xt is just like an RNN cell — but the matrices Ā and B̄ are fixed for the entire sequence. They don't depend on the input. This is the fundamental limitation of classic SSMs: they treat every position identically. The model can't "pay more attention" to an important word vs a filler word.

SSM Recurrence — State Evolution Over Time step 0
ht = Ā·ht-1 + B̄·xt

Each state ht is computed from the previous state ht-1 and the current token xt. Information propagates left to right through the fixed recurrence.

The problem with naive SSMs is the state matrix A. For the recurrence to work well, A needs to be a stable dynamical system — it can't have eigenvalues that cause the state to explode or vanish. S4 (Gu et al., 2021) solved this by constraining A to a specific structure called the HIPPO matrix.

📐

HIPPO — High-order Polynomial Projection Operator: The HIPPO matrix constructs A such that the hidden state ht is provably an optimal polynomial approximation of the entire input history x1…xt. It's a mathematically principled way to compress an arbitrary-length past into a finite state while minimizing information loss. This is the theoretical backbone of the SSM revolution.

S4 also showed that the recurrence can be computed as a convolution during training — making it fully parallelizable on GPUs. At inference, you run it as a fast recurrence. This dual nature — parallel training, O(1) per-step inference — is what makes SSMs so appealing.

S4 / Classic SSM Block
Output yt
C · ht
State ht = Ā·ht-1 + B̄·xt
Fixed matrices Ā, B̄, C — same for every position in the sequence
Input xt

Why Classic SSMs Still Fall Short

Despite the elegance, classic SSMs have one crippling weakness: the parameters Ā and B̄ are time-invariant. Every input token goes through the same linear transformation regardless of its content. Ask an SSM "what was the person's name mentioned at the beginning of this 10,000-word document?" and it has no way to selectively preserve that specific piece of information while discarding filler text. The compression is content-agnostic. Mamba was built to fix exactly this.

03
Mamba
Selective State Spaces — the model decides what to remember
O(n) memory O(n) compute Selective recurrence Hardware-aware

Mamba (Gu & Dao, 2023) makes one surgical change to the classic SSM: it makes the parameters input-dependent. The matrices B, C and the step size Δ are no longer fixed — they are computed from the current input token. This is called a selective state space model.

This seemingly small change is actually enormous. A model where B, C, Δ vary per token can selectively absorb or ignore each input. It can hold onto a name for 10,000 tokens and then surface it exactly when needed. It can forget irrelevant filler words immediately. It behaves much more like attention — but in O(n) time.

Mamba — Selective SSM (S6)
Classic SSM: B, C, Δ are fixed parameters
Mamba: B, C, Δ are functions of the input xt

Δt = softplus( LinearΔ(xt) ) ← how long to "dwell" on this token
Bt = LinearB(xt) ← how strongly to absorb xt into state
Ct = LinearC(xt) ← what to read out of the state

State update (same recurrence, but now time-varying):
Āt = exp(A · Δt)
t = (A)-1(Āt - I) · Bt
ht = Āt · ht-1 + t · xt
yt = Ct · ht

The key intuition: Δt is a "focus dial" per token. When Δ is large, Ā ≈ 0 and the state nearly resets — the model "forgets" previous context and focuses only on xt. When Δ is small, Ā ≈ I and the state changes little — the model "remembers" its history and largely ignores xt. This mimics what attention does, but without materializing the n×n matrix.

Selectivity — Different Δ per Token

Large Δ → model focuses on current token (resets state). Small Δ → model preserves memory (ignores current token). Mamba learns this per-token from content.

The Mamba block is structured around a gated design. The input goes through two parallel branches — one linear path and one selective SSM path — which are combined multiplicatively (a GLU-style gate). This is similar to how SwiGLU works in transformer FFN layers.

Mamba Block (one layer)
Output + Residual
LayerNorm
Linear projection (out)
BRANCH A — SSM
Selective SSM (S6)
Depthwise Conv1D
SiLU activation
BRANCH B — GATE
SiLU(Linear(x))
Linear expand (×2)
LayerNorm
Input x

Here's Mamba's second major innovation — one that's less discussed but equally important: the hardware-aware algorithm.

Making B and C input-dependent breaks the convolutional trick that made S4 fast. You can no longer compute the entire sequence as a single FFT-based convolution. The recurrence is now truly sequential — each ht depends on ht-1.

The naive approach: compute h0, h1, ..., hn sequentially. This works but is slow because it can't use GPU parallelism. Mamba solves this with a parallel prefix scan (parallel associative scan):

🔬

Parallel Scan Intuition: The recurrence ht = Āt·ht-1 + B̄t·xt is an associative operation. That means you can compute it in a binary tree fashion — pairs of elements first, then pairs of pairs — reducing the sequential depth from O(n) to O(log n) parallel steps. This is the same trick used in prefix sum / cumulative sum algorithms. On a GPU with thousands of cores, O(log n) parallel steps is orders of magnitude faster than O(n) sequential steps.

💾

Memory-Efficient Execution: Mamba also avoids materializing the full state sequence in HBM (high-bandwidth memory). The kernel fuses the scan computation and keeps intermediate states in SRAM (on-chip cache, ~100× faster than HBM). This isn't just a software trick — it fundamentally changes the memory access pattern and is why Mamba achieves 5× higher throughput than a transformer of equivalent parameter count on long sequences.

Mamba is genuinely impressive for long sequences. On tasks like DNA modeling (sequences of 1M+ tokens), audio waveform modeling, and long-document processing, it matches or beats transformers at a fraction of the compute. Inference is O(1) per new token — the state size is constant regardless of how many tokens you've processed.

But Mamba isn't perfect. On tasks that require precise recall of specific information — especially from arbitrary positions in the past — transformers still win. If you need to copy a 20-digit number from the beginning of a 10,000-token prompt, attention can reach back directly. Mamba has to hope that number survived 10,000 recurrence steps without being compressed away.

The "Needle in a Haystack" Problem

A well-known evaluation benchmark hides a key fact ("The magic word is PINEAPPLE") deep inside a long document and asks about it at the end. Transformers ace this — attention directly retrieves the relevant token. Mamba-1 struggles significantly — the state compression often loses exact values. Mamba-2 and hybrid models (Jamba, Zamba) are specifically designed to address this failure mode.

04
Mamba 2 & Hybrids
Structured State Space Duality — and the best of both worlds
O(n) memory Higher throughput SSD / Hybrid Tensor parallel

Mamba 2 (Dao & Gu, 2024) rethinks the architecture from a theoretical perspective and arrives at a new formulation called Structured State Space Duality (SSD). The central insight: a specific class of SSMs is mathematically equivalent to a form of linear attention with matrix-valued states.

This duality is profound. It means you can implement the same computation either as a recurrence (efficient at inference) or as a matrix multiplication (efficient at training) — and switch between them automatically. Mamba 2 gets:

Training Speed

By restructuring the inner loop to be matrix-multiplication heavy, Mamba 2 fits naturally into tensor cores on modern GPUs — the same hardware that transformers exploit. Training speed roughly doubles over Mamba 1.

Larger State

The SSD formulation allows expanding the state dimension from 16 (Mamba 1) to 64 or 128 without prohibitive cost. Larger state = more memory capacity = better recall.

Tensor Parallelism

Mamba 1 was hard to shard across multiple GPUs. The SSD formulation is naturally compatible with tensor parallelism, enabling multi-GPU scaling.

If SSMs are great at long-range context and transformers are great at precise recall, why not use both? That's exactly what hybrid models do — interleave Mamba layers with attention layers in a single model.

ModelArchitectureAttention LayersWhy It Works
Jamba (AI21) Mamba + MoE + Attn 1 in 8 layers Attention handles precise recall; SSM handles long context cheaply
Zamba (Zyphra) Mamba + shared Attn 1 shared block Single attention block is shared across layers — minimal overhead
Falcon Mamba Pure Mamba 1 0 First Mamba model competitive with transformers at 7B scale
RWKV-6 Linear Attn 0 (linear) Different approach to O(n) — reformulates attention as RNN
Griffin (Google) Linear Recurrence + MQA Mixed Real-gated linear recurrence with sliding window attention
💡

The emerging consensus: pure Mamba excels on modalities where long-range context matters but precise verbatim recall is rare — audio, DNA sequences, time series, video. For language modeling — especially instruction following and retrieval — hybrid models with a small fraction of attention layers consistently beat pure Mamba at the same parameter count. The cost of those attention layers is surprisingly small when you only need 1 in 8.

The Numbers That Matter

Memory Complexity vs Sequence Length
Transformer (O(n²))
Mamba (O(n))
Hybrid (between)

At 1K tokens: all similar. At 100K tokens: the gap is ~10,000×.

Dimension Transformer Classic SSM (S4) Mamba Mamba 2 / Hybrid
Compute complexity O(n²) O(n log n) O(n) O(n)
Memory (sequence) O(n²) O(n) O(1) per step O(1) + O(w)
Inference per token O(n) grows O(1) constant O(1) constant O(1) mostly
Selective memory ✅ Full attention ❌ Fixed recurrence ✅ Input-dependent ✅ Both
Parallelizable training ✅ Fully ✅ Via convolution ✅ Parallel scan ✅ Fully
Precise verbatim recall ✅ Excellent ⚠️ Degrades ⚠️ Degrades (long) ✅ Good
Long context (>100K) ❌ Prohibitive ✅ Efficient ✅ Efficient ✅ Efficient
Positional encoding Required (RoPE etc.) Implicit via Δ Implicit via Δ Mixed
In-context learning ✅ Strong ⚠️ Weak ✅ Reasonable ✅ Strong
Multi-GPU scaling ✅ Mature ⚠️ Limited ⚠️ Difficult ✅ Via Mamba 2
Best use case General NLP, reasoning Signal, time series Audio, DNA, video General long-context

How We Got Here

Jun 2017
Attention Is All You Need
Vaswani et al. introduce the Transformer. Self-attention replaces RNNs. The field shifts overnight.
Transformer
Oct 2021
S4 — Structured State Spaces
Gu et al. show that SSMs with HIPPO initialization can match transformers on long-range tasks at linear cost. The SSM revival begins.
S4HIPPO
2022
S4 Variants — H3, Hyena, S5
A wave of improvements: H3 adds associative recall, Hyena uses implicit convolutions, S5 adds multi-input multi-output structure.
H3HyenaS5
Dec 2023
Mamba — Selective State Spaces
Gu & Dao make SSM parameters input-dependent. Mamba-3B matches Transformer-3B on language modeling and runs 5× faster. A landmark paper.
MambaS6
Mar 2024
Jamba (AI21 Labs)
First hybrid Mamba+Transformer+MoE model at 52B total / 12B active parameters. 256K context window. Ships as a commercial API.
JambaMoE
May 2024
Mamba 2 — Structured State Space Duality
Dao & Gu prove SSMs and linear attention are dual. Mamba 2 is 2–8× faster, supports larger state dimensions, and is tensor-parallelism ready.
Mamba 2SSD
Jun 2024
Falcon Mamba 7B
TII releases the first pure Mamba model competitive with Llama-3-8B on standard benchmarks. Validates Mamba at production scale.
Falcon Mamba
2025+
The Hybrid Era
Zamba 2, Griffin, RWKV-6, and many others. The consensus: for language, hybrids outperform pure SSMs and pure transformers at matched compute.
Zamba 2GriffinRWKV-6

When to Use What

🔭

Use Transformers when…

You need maximum reasoning quality. Short-to-medium context (≤32K). Tasks requiring precise fact recall, multi-step reasoning, or strong in-context learning. You have the GPU budget.

🎯

Use Mamba when…

Your sequences are long (>100K) or infinite (streaming). Modalities like audio, DNA, video where time-series dynamics matter. Real-time inference on edge devices where O(1) per-step cost is critical.

Use Hybrids when…

You need long context AND strong recall (the general case for language models). Hybrid models with 1-in-8 attention layers are now the pragmatic default for new model architecture research.

📡

Watch RWKV + xLSTM

Alternative linear attention families are maturing fast. RWKV-7, xLSTM, and linear transformers offer different tradeoff points. The space is far from settled.

🔬

My take: Attention won because language is genuinely structured — words relate to words across arbitrary distances, and lossless access to that structure matters. Mamba is not a replacement for attention — it's a complement. The most honest description of the field right now is: we are learning where linear recurrence is sufficient and where attention is necessary, and building models that use each appropriately. The transformer era isn't ending. It's being refined.