Article 記事

Residual Connections Have Been Broken This Whole Time

author Jonathan Conway
timestamp 4 April 2026
classification public

I’m late to this one. Kimi’s Attention Residuals paper dropped in late February and I only got around to reading it properly last week. But it kept nagging at me because it touches something fundamental about how transformers propagate information through depth, and it connects directly to problems we think about constantly in agent memory systems.

The short version: every modern LLM uses residual connections that treat all previous layers equally. Layer 1’s output and layer 50’s output get blended together with the same weight, no matter what. This means the model can’t prioritize what matters. Early computations get buried under the noise of everything that came after. Later layers are forced to shout louder and louder just to be heard. The result is wasted capacity, unstable training, and models that are worse at tasks requiring compositional reasoning. The Kimi team finally did something about it.

The Problem Nobody Talks About

Here’s what a standard residual connection does. At each layer, you take the previous hidden state, run it through the layer’s transformation (attention or FFN), and add the result back:

h_l = h_{l-1} + f_{l-1}(h_{l-1})

Simple. Elegant. The thing that made training deep networks possible in the first place (He et al., 2015). But unroll that recurrence and look at what layer 32 actually receives:

h_32 = h_1 + f_1(h_1) + f_2(h_2) + ... + f_31(h_31)

Every single layer output gets added with a weight of exactly 1. The embedding. Layer 1’s output. Layer 17’s output. Layer 31’s output. All treated identically. There is no mechanism for layer 32 to say “I care a lot about what layer 4 computed but layer 12’s output is irrelevant to me right now.”

This sounds theoretical until you look at what happens in practice.

The PreNorm Dilution Problem

With PreNorm (the dominant paradigm since GPT-2), the hidden state magnitude grows as O(L) with depth. The norm of the accumulated residual stream gets bigger and bigger. Each new layer’s contribution gets proportionally smaller relative to the whole. By the time you’re 50 layers deep, early layer information is buried under the sheer magnitude of everything that came after it.

The Kimi team’s paper phrases this well: “early-layer information is buried and cannot be selectively retrieved.” They also cite work showing that a significant fraction of layers in deep networks can be pruned with minimal loss. Those layers aren’t contributing much. They’re just adding noise to an already bloated residual stream.

Three Specific Failure Modes

The paper identifies three structural problems with fixed unit weight residuals:

No selective access. Attention layers and MLP layers receive the same accumulated state, even though they do very different things. An attention layer routing information between positions and an MLP layer performing feature transformation both get the same compressed history. There’s no way for either to pick and choose what prior computation matters for its task.

Irreversible loss. Once information is aggregated into the residual stream, it’s gone. If layer 8 computed something useful but layers 9 through 15 piled on top of it, layer 16 can’t dig it back out. The aggregation is destructive and irreversible.

Output growth. Later layers are forced to produce increasingly large outputs just to remain visible against the growing residual. This destabilizes training. You can see it empirically: output magnitudes grow monotonically with depth in standard PreNorm transformers.

Attention Residuals: The Fix

The insight comes from a duality that, in retrospect, feels obvious. RNNs had this same problem over the sequence dimension. They compressed all prior timesteps into a single hidden state, and each new timestep couldn’t selectively access earlier ones. The Transformer fixed this by replacing the recurrence with attention over all previous positions.

Attention Residuals (AttnRes) does the same thing, but over depth instead of sequence length. Instead of adding all previous layer outputs with weight 1, each layer computes learned softmax attention weights over all prior outputs:

h_l = sum(alpha_{i->l} * v_i)  for i in [0, l-1]

where alpha weights sum to 1 via softmax

Each layer gets a single learned pseudo-query vector w_l (just d dimensions, one per layer). It computes dot-product attention against RMSNorm’d keys from all previous layer outputs. The softmax normalization means the weights are bounded, summing to 1, so output magnitudes stay controlled regardless of depth.

Three Approaches to Residual Connections

The parameter overhead is negligible: one RMSNorm and one d-dimensional query vector per layer. In a 48B parameter model, this is rounding error.

The Practical Version: Block AttnRes

Full AttnRes sounds great until you try to train a 48B model with pipeline parallelism. The problem: every layer output needs to be kept alive for all subsequent layers, and transmitted across pipeline stages. Memory grows as O(Ld) and communication overhead makes it impractical.

Block AttnRes is the engineering compromise. Group the L layers into N blocks (they use ~8). Within each block, layers use standard summation. Across blocks, layers attend over the N block-level representations with softmax. Memory drops from O(Ld) to O(Nd).

The design is clever because the pseudo-query vectors are decoupled from the forward computation, meaning all queries within a block can be batched into a single matrix multiply. Combined with cross-stage caching (previously received blocks don’t need re-transmission) and a two-phase computation strategy that amortizes memory reads, Block AttnRes becomes a practical drop-in replacement.

Training overhead: less than 4% under pipeline parallelism. Inference latency overhead: less than 2%.

Does It Actually Work?

Yes. Across every metric they tested.

Scaling Law Curves

Scaling laws. They sweep five model sizes from 194M to 528M activated parameters. Block AttnRes consistently outperforms the PreNorm baseline across the entire compute range. At 5.6 PFLOP/s-days, Block AttnRes matches the loss of a baseline trained with 1.25x more compute. Full AttnRes does slightly better, but the gap shrinks to 0.001 at the largest scale.

48B model results. They integrate Block AttnRes into Kimi Linear (48B total / 3B activated, MoE) and pre-train on 1.4T tokens. The results are striking:

  • GPQA-Diamond: 36.9 to 44.4 (+7.5 points)
  • Minerva Math: 53.5 to 57.1 (+3.6 points)
  • HumanEval: 59.1 to 62.2 (+3.1 points)
  • MMLU: 73.5 to 74.6 (+1.1 points)
  • BBH: 76.3 to 78.0 (+1.7 points)

The gains are largest on multi-step reasoning tasks (GPQA-Diamond, Math), which makes intuitive sense. These tasks benefit most from later layers being able to selectively retrieve information from earlier computational stages rather than working with a diluted average of everything.

Why Should You Care?

If you’re not training foundation models from scratch, you might be wondering what any of this has to do with you. More than you’d think.

Cheaper models that perform the same. The scaling law results show Block AttnRes matches a baseline trained with 1.25x more compute. Flip that around: labs adopting this technique can train equivalent models for 20% less money. A single training run for a frontier model costs tens of millions of dollars. 20% off that number is a meaningful amount of money, and those savings eventually flow downstream to API pricing and the cost of running inference for products built on top of these models.

Better reasoning and code generation. The biggest improvements aren’t on trivia benchmarks. They’re on multi-step reasoning (+7.5 on GPQA-Diamond) and code generation (+3.1 on HumanEval). These are the tasks that matter most for real-world applications: AI agents that can plan and execute multi-step workflows, coding assistants that can hold a coherent chain of thought across a complex function, and reasoning systems that need to build conclusions on top of intermediate results. The ability for later layers to selectively retrieve earlier computational steps is exactly what these tasks demand.

Better memory in AI agents. This one is closer to home for us. The core insight of AttnRes, that uniform accumulation buries useful information, is the same problem that plagues agent memory systems today. When an AI agent stuffs every user interaction into a flat context window, early memories get drowned out by sheer volume. A user’s dietary restriction mentioned in conversation one gets buried under 200 subsequent chats about recipe suggestions. AttnRes validates a principle we’ve been building around: selective, weighted retrieval beats uniform accumulation every time, whether you’re aggregating across model layers or across months of user history.

A signal of where architecture research is heading. The era of “just make it bigger” is winding down. The next generation of improvements will come from revisiting parts of the transformer architecture that everyone assumed were good enough. Residual connections were one of those assumptions. If a change this small (one learned vector per layer) produces consistent gains across every benchmark, it suggests there are more wins hiding in the plumbing of these models.

Training Dynamics Tell the Real Story

The benchmark numbers are convincing, but the training dynamics plots are where you really see what AttnRes fixes.

Training Dynamics

Output magnitudes. In the baseline, per-block output magnitudes grow monotonically with depth, reaching ~14 at the final block. This is PreNorm dilution in action: deeper layers must produce bigger outputs to be heard over the accumulated residual. With Block AttnRes, output magnitudes stay bounded and show a periodic pattern that resets at block boundaries. The network is no longer in an arms race against its own residual stream.

Gradient distribution. The baseline shows disproportionately large gradients at the earliest layers and near-zero gradients at later ones. This is the flip side of dilution: early layers get hammered because they affect everything downstream, while later layers barely matter. AttnRes produces a substantially more uniform gradient distribution. Every layer gets useful learning signal.

Validation loss. AttnRes tracks below the baseline throughout training, with the gap widening during the decay phase. This suggests the improvement compounds as training progresses.

The Structured Matrix View

One of the more interesting theoretical contributions is framing all residual variants as instances of a depth mixing matrix M. Standard residuals produce an all-ones lower-triangular M. Highway networks produce a 1-semiseparable M. AttnRes produces a dense, rank-L matrix with input-dependent weights.

This unification reveals that standard residuals are performing depth-wise linear attention (fixed weights), while AttnRes performs depth-wise softmax attention (learned, input-dependent weights). It’s the same transition that Transformers made over the sequence dimension. Prior methods like Highway networks and mHC are intermediate points on this spectrum.

The analogy goes even further. Test-Time Training (TTT) formalized residual connections as gradient descent steps, with each layer acting as one update. Under this lens, AttnRes replaces gradient descent (additive accumulation) with attention (selective retrieval), mirroring the historical RNN-to-Transformer transition.

What This Means For Agent Memory

I covered the user-facing angle above, but there’s a deeper technical connection worth spelling out.

Spreading activation in a knowledge graph is solving the same problem AttnRes solves for depth. Instead of treating all stored nodes equally, activation flows through weighted edges, selectively amplifying relevant context and letting irrelevant information decay. The graph structure is the equivalent of learned attention weights over the memory “depth.”

The lesson from AttnRes is that the fix isn’t more sophisticated ranking after retrieval. The fix is structural: give the system a mechanism to selectively weight its sources before aggregation, not after.

Honest Assessment

A few things the paper doesn’t address:

The comparison is against a baseline that uses the same hyperparameters optimized for standard residuals. AttnRes might benefit from different hyperparameter choices, which would make the comparison more favorable. But it also means the baseline gets a home-field advantage, so the improvements are likely conservative.

The 48B results are on a single training recipe (Kimi Linear’s). We don’t know how AttnRes interacts with other architectures, training recipes, or data mixes. The scaling law experiments across five sizes are encouraging, but there’s always the question of whether this generalizes beyond Kimi’s specific setup.

About 8 blocks is presented as the practical choice, but the ablation (Figure 6 in the paper) shows that finer-grained blocking (smaller block size S, meaning more blocks) approaches Full AttnRes performance at the cost of more memory. As hardware improves and cross-stage bandwidth increases, smaller blocks or even full AttnRes might become practical. This is a moving target.

The paper doesn’t compare against some concurrent work (ANCRe, MUDDFormer) that also addresses cross-layer access, though they do compare against DenseFormer and mHC and outperform both.

The Takeaway

Residual connections are the plumbing of modern LLMs. Everyone uses them, nobody thinks about them. The Kimi team demonstrated that this plumbing has a real, measurable flaw (PreNorm dilution, uncontrolled output growth, no selective access), and that fixing it with depth-wise softmax attention yields consistent improvements from small models to 48B, for less than 2% inference overhead.

The deeper insight is the duality between depth and sequence. Both involve aggregating information from a growing history. Both benefit from selective attention over that history rather than uniform accumulation. The Transformer made this transition for sequences in 2017. AttnRes makes it for depth in 2026.

If you’re training models and still using standard residuals, this paper is worth reading in full. The Block AttnRes variant is a practical drop-in replacement, the code is open source on GitHub (MoonshotAI/Attention-Residuals), and the engineering work to make it scale (cross-stage caching, two-phase computation) is well documented.

We’re past the point where residual connections should be static. The rest of the transformer learned input-dependent weights years ago. It’s about time the skip connections caught up.