Attention is quadratic. The world is sequential. Here's how three paradigms — self-attention, structured state spaces, and selective state spaces — each try to model time, and why it matters more than ever.
Every sequence model — whether it's reading a paragraph, listening to audio, or watching a video — faces the same core challenge: information from earlier in the sequence needs to influence decisions later. How you solve that determines everything about your model's speed, memory, and ability to generalize.
Three fundamentally different answers have emerged. Transformers say: look at everything. SSMs say: compress into a fixed state. Mamba says: selectively decide what to remember. Each answer comes with profound tradeoffs.
Full attention over all tokens. Every position attends to every other. Lossless memory — but O(n²) cost that makes long sequences brutally expensive.
Fixed linear recurrence. Compress the entire past into a constant-size state vector. O(n) cost — but the compression is time-invariant, so the model can't selectively remember.
Selective state space. Parameters of the recurrence are input-dependent — the model learns what to keep and what to forget at each step. O(n) cost with context-aware memory.
Mamba layers interleaved with attention layers. Best of both worlds — attention for precise recall, SSM for efficient long-range context. The direction the field is moving.
Transformers, introduced by Vaswani et al. in 2017, solve the memory problem in the most direct possible way: don't compress anything. Just look at everything. Self-attention lets every token directly attend to every other token in the sequence.
This is extraordinarily powerful. A word at position 500 can directly see a word at position 1 with zero information loss. There's no compression bottleneck, no degradation of signal over distance. This is why transformers utterly dominate NLP benchmarks — their memory is lossless.
For each token, you compute three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I contribute?). Then:
The attention matrix is n × n. For a sequence of 1,000 tokens, that's 1,000,000 entries. For 100,000 tokens, it's 10 billion entries. Memory and compute scale as the square of sequence length — O(n²). This is not a minor implementation issue. It is a fundamental architectural ceiling.
← drag to change sequence length →
Each cell represents one attention score (Q·Kᵀ). The entire grid must fit in GPU memory simultaneously.
At 128K context (what GPT-4 supports), the attention matrix alone requires ~33 billion float16 values. This is why long-context inference is so expensive — the KV cache grows linearly with context length, and attention computation over that cache is quadratic.
During inference, transformers cache Key and Value matrices for all previous tokens. For Llama-3-70B with 128K context, the KV cache alone is ~140 GB per request. This is why long-context serving is so expensive — it's not compute, it's memory. A single long conversation can exhaust the entire VRAM of a server node.
And yet — despite all of this — transformers dominate. The reason is simple: they are right. When you process language, any token really can depend on any earlier token. Compressing that arbitrarily is dangerous. The quadratic cost is the honest price of lossless sequence memory.
The question isn't whether transformers are right. It's whether you can approximate what they're doing — with O(n) cost — without losing too much.
State Space Models come from control theory — they've been used in signal processing, physics simulations, and robotics for decades. The idea is conceptually simple: maintain a hidden state that summarizes everything seen so far, and update it as new input arrives.
If Transformers are like a person who re-reads the entire book before answering each question, SSMs are like a person taking notes — compressing what they've read into a finite notebook that they update continuously.
This is a linear ODE from physics. The state h(t) evolves continuously. But neural networks are discrete — they process tokens one at a time — so you discretize it using a step size Δ (delta):
Look at that recurrence carefully. ht = Ā·ht-1 + B̄·xt is just like an RNN cell — but the matrices Ā and B̄ are fixed for the entire sequence. They don't depend on the input. This is the fundamental limitation of classic SSMs: they treat every position identically. The model can't "pay more attention" to an important word vs a filler word.
Each state ht is computed from the previous state ht-1 and the current token xt. Information propagates left to right through the fixed recurrence.
The problem with naive SSMs is the state matrix A. For the recurrence to work well, A needs to be a stable dynamical system — it can't have eigenvalues that cause the state to explode or vanish. S4 (Gu et al., 2021) solved this by constraining A to a specific structure called the HIPPO matrix.
HIPPO — High-order Polynomial Projection Operator: The HIPPO matrix constructs A such that the hidden state ht is provably an optimal polynomial approximation of the entire input history x1…xt. It's a mathematically principled way to compress an arbitrary-length past into a finite state while minimizing information loss. This is the theoretical backbone of the SSM revolution.
S4 also showed that the recurrence can be computed as a convolution during training — making it fully parallelizable on GPUs. At inference, you run it as a fast recurrence. This dual nature — parallel training, O(1) per-step inference — is what makes SSMs so appealing.
Despite the elegance, classic SSMs have one crippling weakness: the parameters Ā and B̄ are time-invariant. Every input token goes through the same linear transformation regardless of its content. Ask an SSM "what was the person's name mentioned at the beginning of this 10,000-word document?" and it has no way to selectively preserve that specific piece of information while discarding filler text. The compression is content-agnostic. Mamba was built to fix exactly this.
Mamba (Gu & Dao, 2023) makes one surgical change to the classic SSM: it makes the parameters input-dependent. The matrices B, C and the step size Δ are no longer fixed — they are computed from the current input token. This is called a selective state space model.
This seemingly small change is actually enormous. A model where B, C, Δ vary per token can selectively absorb or ignore each input. It can hold onto a name for 10,000 tokens and then surface it exactly when needed. It can forget irrelevant filler words immediately. It behaves much more like attention — but in O(n) time.
The key intuition: Δt is a "focus dial" per token. When Δ is large, Ā ≈ 0 and the state nearly resets — the model "forgets" previous context and focuses only on xt. When Δ is small, Ā ≈ I and the state changes little — the model "remembers" its history and largely ignores xt. This mimics what attention does, but without materializing the n×n matrix.
Large Δ → model focuses on current token (resets state). Small Δ → model preserves memory (ignores current token). Mamba learns this per-token from content.
The Mamba block is structured around a gated design. The input goes through two parallel branches — one linear path and one selective SSM path — which are combined multiplicatively (a GLU-style gate). This is similar to how SwiGLU works in transformer FFN layers.
Here's Mamba's second major innovation — one that's less discussed but equally important: the hardware-aware algorithm.
Making B and C input-dependent breaks the convolutional trick that made S4 fast. You can no longer compute the entire sequence as a single FFT-based convolution. The recurrence is now truly sequential — each ht depends on ht-1.
The naive approach: compute h0, h1, ..., hn sequentially. This works but is slow because it can't use GPU parallelism. Mamba solves this with a parallel prefix scan (parallel associative scan):
Parallel Scan Intuition: The recurrence ht = Āt·ht-1 + B̄t·xt is an associative operation. That means you can compute it in a binary tree fashion — pairs of elements first, then pairs of pairs — reducing the sequential depth from O(n) to O(log n) parallel steps. This is the same trick used in prefix sum / cumulative sum algorithms. On a GPU with thousands of cores, O(log n) parallel steps is orders of magnitude faster than O(n) sequential steps.
Memory-Efficient Execution: Mamba also avoids materializing the full state sequence in HBM (high-bandwidth memory). The kernel fuses the scan computation and keeps intermediate states in SRAM (on-chip cache, ~100× faster than HBM). This isn't just a software trick — it fundamentally changes the memory access pattern and is why Mamba achieves 5× higher throughput than a transformer of equivalent parameter count on long sequences.
Mamba is genuinely impressive for long sequences. On tasks like DNA modeling (sequences of 1M+ tokens), audio waveform modeling, and long-document processing, it matches or beats transformers at a fraction of the compute. Inference is O(1) per new token — the state size is constant regardless of how many tokens you've processed.
But Mamba isn't perfect. On tasks that require precise recall of specific information — especially from arbitrary positions in the past — transformers still win. If you need to copy a 20-digit number from the beginning of a 10,000-token prompt, attention can reach back directly. Mamba has to hope that number survived 10,000 recurrence steps without being compressed away.
A well-known evaluation benchmark hides a key fact ("The magic word is PINEAPPLE") deep inside a long document and asks about it at the end. Transformers ace this — attention directly retrieves the relevant token. Mamba-1 struggles significantly — the state compression often loses exact values. Mamba-2 and hybrid models (Jamba, Zamba) are specifically designed to address this failure mode.
Mamba 2 (Dao & Gu, 2024) rethinks the architecture from a theoretical perspective and arrives at a new formulation called Structured State Space Duality (SSD). The central insight: a specific class of SSMs is mathematically equivalent to a form of linear attention with matrix-valued states.
This duality is profound. It means you can implement the same computation either as a recurrence (efficient at inference) or as a matrix multiplication (efficient at training) — and switch between them automatically. Mamba 2 gets:
By restructuring the inner loop to be matrix-multiplication heavy, Mamba 2 fits naturally into tensor cores on modern GPUs — the same hardware that transformers exploit. Training speed roughly doubles over Mamba 1.
The SSD formulation allows expanding the state dimension from 16 (Mamba 1) to 64 or 128 without prohibitive cost. Larger state = more memory capacity = better recall.
Mamba 1 was hard to shard across multiple GPUs. The SSD formulation is naturally compatible with tensor parallelism, enabling multi-GPU scaling.
If SSMs are great at long-range context and transformers are great at precise recall, why not use both? That's exactly what hybrid models do — interleave Mamba layers with attention layers in a single model.
| Model | Architecture | Attention Layers | Why It Works |
|---|---|---|---|
| Jamba (AI21) | Mamba + MoE + Attn | 1 in 8 layers | Attention handles precise recall; SSM handles long context cheaply |
| Zamba (Zyphra) | Mamba + shared Attn | 1 shared block | Single attention block is shared across layers — minimal overhead |
| Falcon Mamba | Pure Mamba 1 | 0 | First Mamba model competitive with transformers at 7B scale |
| RWKV-6 | Linear Attn | 0 (linear) | Different approach to O(n) — reformulates attention as RNN |
| Griffin (Google) | Linear Recurrence + MQA | Mixed | Real-gated linear recurrence with sliding window attention |
The emerging consensus: pure Mamba excels on modalities where long-range context matters but precise verbatim recall is rare — audio, DNA sequences, time series, video. For language modeling — especially instruction following and retrieval — hybrid models with a small fraction of attention layers consistently beat pure Mamba at the same parameter count. The cost of those attention layers is surprisingly small when you only need 1 in 8.
At 1K tokens: all similar. At 100K tokens: the gap is ~10,000×.
| Dimension | Transformer | Classic SSM (S4) | Mamba | Mamba 2 / Hybrid |
|---|---|---|---|---|
| Compute complexity | O(n²) | O(n log n) | O(n) | O(n) |
| Memory (sequence) | O(n²) | O(n) | O(1) per step | O(1) + O(w) |
| Inference per token | O(n) grows | O(1) constant | O(1) constant | O(1) mostly |
| Selective memory | ✅ Full attention | ❌ Fixed recurrence | ✅ Input-dependent | ✅ Both |
| Parallelizable training | ✅ Fully | ✅ Via convolution | ✅ Parallel scan | ✅ Fully |
| Precise verbatim recall | ✅ Excellent | ⚠️ Degrades | ⚠️ Degrades (long) | ✅ Good |
| Long context (>100K) | ❌ Prohibitive | ✅ Efficient | ✅ Efficient | ✅ Efficient |
| Positional encoding | Required (RoPE etc.) | Implicit via Δ | Implicit via Δ | Mixed |
| In-context learning | ✅ Strong | ⚠️ Weak | ✅ Reasonable | ✅ Strong |
| Multi-GPU scaling | ✅ Mature | ⚠️ Limited | ⚠️ Difficult | ✅ Via Mamba 2 |
| Best use case | General NLP, reasoning | Signal, time series | Audio, DNA, video | General long-context |
You need maximum reasoning quality. Short-to-medium context (≤32K). Tasks requiring precise fact recall, multi-step reasoning, or strong in-context learning. You have the GPU budget.
Your sequences are long (>100K) or infinite (streaming). Modalities like audio, DNA, video where time-series dynamics matter. Real-time inference on edge devices where O(1) per-step cost is critical.
You need long context AND strong recall (the general case for language models). Hybrid models with 1-in-8 attention layers are now the pragmatic default for new model architecture research.
Alternative linear attention families are maturing fast. RWKV-7, xLSTM, and linear transformers offer different tradeoff points. The space is far from settled.
My take: Attention won because language is genuinely structured — words relate to words across arbitrary distances, and lossless access to that structure matters. Mamba is not a replacement for attention — it's a complement. The most honest description of the field right now is: we are learning where linear recurrence is sufficient and where attention is necessary, and building models that use each appropriately. The transformer era isn't ending. It's being refined.