Parallelizing recurrence with DeltaNet

Back to Blogswritten by brainoid labsApr 06 , 2026

Breaking the Quadratic Barrier: How DeltaNet and Parallelized Recurrence are Scaling LLMs

The AI industry is currently caught in a tug-of-the-war between two architectures. On one side, we have Transformers, the undisputed kings of accuracy, held back by the “Quadratic Tax”—where doubling your sequence length quadruples your compute cost. On the other, Linear RNNs and State Space Models (SSMs) like Mamba, which offer lightning-fast, constant-memory inference but often stumble on “recall-intensive” tasks like long-document synthesis or complex RAG.

Enter DeltaNet. Recently highlighted at NeurIPS, this architecture isn’t just another “Linear Transformer” clone. It represents a fundamental shift in how we handle associative recall without the massive overhead of a KV cache.

The Problem: The “Unwieldy” KV Cache

In standard Softmax Attention, every new token must look back at every previous token. This requires storing Key/Value (KV) vectors for every single element in the sequence. For long-context windows (100k+ tokens), this KV cache becomes a memory monster that is difficult to manage on modern GPU clusters.

While Linear Attention tried to solve this by using a dot-product kernel to enable RNN-style inference, it historically lacked the “memory” to compete with Transformers on retrieval tasks.

The Solution: The Delta Rule & Householder Transformations

DeltaNet upgrades the linear recurrence by using a delta rule-like update. Instead of just adding information to a hidden state, it actively retrieves and updates value vectors associated with specific keys.

The Mathematical Breakthrough

The real challenge with DeltaNet was training efficiency. The original sequential algorithm was a nightmare for GPU parallelization. The researchers solved this by reparameterizing DeltaNet as a matrix-valued RNN using generalized Householder transformations.

By leveraging the compact WY representation for products of Householder matrices, they eliminated the need to materialize massive hidden states at every timestep.

W_{t} = W_{t - 1} + α_{t} (v_{t} - W_{t - 1} k_{t}) k_{t}^{⊤}

This allows for a chunkwise parallel strategy, making it possible to train DeltaNets on massive datasets with the same hardware efficiency as standard Transformers.

Performance: How Does it Stack Up?

In benchmarks involving 1.3B parameter models trained on 100B tokens, DeltaNet didn’t just participate—it led the pack.

Model	Language Modeling	Zero-Shot Tasks	In-Context Retrieval
Mamba (SSM)	Strong	Good	Average
GLA (Gated Linear)	Strong	Good	Good
DeltaNet	Superior	Superior	Excellent

Beyond pure linear models, the researchers experimented with Hybrids. By combining DeltaNet layers with sliding window attention, they created models that actually outperform ordinary Transformers while maintaining a significantly leaner memory footprint.

Why This Matters for the Next Generation of Agents

As we move toward “world models” and autonomous researchers, the ability to recall specific details from a 10,000-step trajectory is non-negotiable. DeltaNet provides a path toward:

Sub-1ms Latency: High-performance inference without the linear growth of the KV cache.
Infinite Context: Constant-memory footprints that don’t choke on long-form data.
True Associative Recall: The ability to “remember” key-value pairs as effectively as a Transformer.