Residual connections assume every layer matters equally — these results say they're wrong by design

03 [Evaluation] Residual connections assume every layer matters equally — these results say they’re wrong by design

Standard residual connections in transformers add each layer’s output with a fixed weight of 1.0. That uniform accumulation was never theoretically justified; it was a training stability hack that stuck. As depth increases, hidden states grow without bound, and each individual layer’s signal gets progressively diluted. The model learns despite this, not because of it.

Attention Residuals (AttnRes) replaces those fixed unit-weight additions with softmax attention (how the model decides what to focus on) over all preceding layer outputs. Rather than summing every layer, each layer dynamically selects which earlier representations to pull from, with learned weights that vary per input. The mechanism is structurally analogous to cross-layer attention: layer N looks back at layers 1 through N-1 and decides how much each contributed something worth preserving. The problem is cost — attending over all preceding layers scales quadratically with depth, which is prohibitive for large-scale training.

Block AttnRes addresses this by partitioning the network into fixed-size layer blocks, then attending over block-level representations rather than every individual layer output. Memory footprint drops substantially while preserving most of the per-layer selectivity gains. Coarse-grained selection across blocks captures most of the benefit: which phase of computation matters most for a given input is more important than which exact layer within that phase.

The limitation is real. Results come from models at scales where depth-induced hidden-state growth is detectable and tractable to study. Whether the gains hold at 70B+ parameters, where training dynamics differ and residual scaling is sometimes already patched through weight initialization tricks, remains untested. Block size selection also introduces a new hyperparameter with non-obvious optimal values across architectures.

For practitioners, the implication is architectural rather than fine-tuning-level. This is not a drop-in for existing trained models. It changes the residual connection pattern at training time. The relevant decision point is pre-training or architecture search for new model families, not inference optimization for deployed models.

Key takeaways:

Fixed unit-weight residual accumulation causes hidden-state magnitude to grow with depth, diluting shallow-layer contributions; softmax attention over preceding layers restores input-dependent selectivity and controls this growth at the source
Block AttnRes shows that coarse block-level aggregation recovers most of the full per-layer attention gain, meaning the performance benefit is concentrated in selecting between processing phases, not individual layers
Teams designing new model architectures from scratch, particularly for deeper networks where residual dilution compounds, should treat fixed-weight residual connections as a design choice worth revisiting, not a given

Source: Attention Residuals

Source: Attention Residuals