Attention Residuals

(github.com)

88 points | by GaggiX 4 hours ago

6 comments

Murfalo 1 hour ago
Amazingly, the first author is a high school student! https://nathanchen.me/public/About%20me.html
[-]
- brcmthrowaway 52 minutes ago
  We're about to get an onslaught of young Chinese geniuses (raised in China). It's pure statistics
  Sadly, same can't be said about India (infrastructure/food security lags China).
  [-]
  - jldugger 25 minutes ago
    > It's pure statistics
    I'm not so sure about that: https://www.populationpyramid.net/china/2026/ suggests peak high school in china was years ago.
jryio 2 hours ago
This is the key piece
> Full AttnRes is straightforward but requires O(Ld) memory at scale. Block AttnRes partitions layers into N blocks, accumulates within each block via standard residuals, and applies attention only over block-level representations. With ~8 blocks, it recovers most of Full AttnRes's gains while serving as a practical drop-in replacement with marginal overhead.
jjcm 3 hours ago
Two things stand out to me with this:
1. Drops compute required for training by ~20%. This approach wont just help the ever escalating model sizes larger companies are pushing for, it means things like autoresearch can iterate on new model architectures faster.
2. WAY lower bandwidth requirements for inference. Means with approaches like this it should run on consumer hardware far better. It apparently requires 1/6th the memory bandwidth of a traditional approach for better results.
This is a big improvement if it can be generalized. They're claiming it's a drop in replacement, so it seems like it can as well.
[-]
- dvt 2 hours ago
  > Drops compute required for training by ~20%.
  This is not true. Authors claim that w.r.t. training, their method adds negigible overhead for AttnRes with no memory impact (but is way more complicated for Block AttnRes since we need to use pipelining for larger models, hence the O(Ld) & O(Nd) figures, with N ≪ L).
  > WAY lower bandwidth requirements for inference.
  Also not true. Paper has nothing to do with inference, apart from the benchmarks. If you're looking at the graph about "compute advantage," it's about training compute. They do some interpolation to get to the 1.25x number, basically answering the question "if non-AttnRes architecture were trained, how much compute would it take to get to the same loss as AttnRes?" (The answer being ~20% more compute.) It's an interesting claim, but there's all kinds of weird and unexpected convergence that can happen, so take it with a grain of salt.
  [-]
  - observationist 1 hour ago
    I think what they're getting at is that for a given unit of compute, this method achieves 125% performance.
    If model A reaches performance level 100 using 100 units of compute using old methods, and you train model B using AttnRes, aiming at performance level 100, it costs you 80 units of compute.
    It probably doesn't map precisely, but that's where people are diverging from the claim - it doesn't explicitly say anything about reduced inference or training time, but that's the implicit value of these sorts of things. Less compute to equivalent performance can be a huge win for platforms at scale as well as for local models.
    [-]
    - dvt 1 hour ago
      > I think what they're getting at is that for a given unit of compute, this method achieves 125% performance.
      This is not what they're getting at; I explained exactly what they're getting at. I mean, your equivalence of "loss" (what authors actually measured) and "performance" is just bizarre. We use benchmarks to measure performance, and the numbers there were like 1-5% better (apart from the GPQA-Diamond outlier).
      Do people even read these papers?
- com2kid 2 hours ago
  > 2. WAY lower bandwidth requirements for inference. Means with approaches like this it should run on consumer hardware far better. It apparently requires 1/6th the memory bandwidth of a traditional approach for better results.
  That should be the headline right there. Giant side 60 font headline.
  Some people have PhDs in burying the lede!
  [-]
  - talloaktrees 2 hours ago
    except it's not true
    [-]
    - observationist 1 hour ago
      It's not not true, it's just that things are getting lost in the excitement. There are some specific cases where there's a big boost, it's just not exactly what people are hoping.
      >>>The "1/6th" specifically appears in community comparisons to DeepSeek's mHC (multi-lane highway connections, a prior technique for better depth-wise information flow in deep models). Several Chinese-language sources and downstream discussions (e.g., translated articles, YouTube breakdowns, and blogs like houdao.com) state that Block AttnRes achieves comparable (or better) performance to mHC while using only one-sixth of the data read/write volume (or memory bandwidth pressure) during inference/engineering deployment.
      There are specific cases where that speedup does occur; it's not going to translate exactly into local models or other architectures or hardware.
      [-]
      - djsjajah 1 hour ago
        No. It seems to me that the comment is objectively incorrect. The original comment was talking about inference and from what I can tell, it is strictly going to run slower than the model trained to the same loss without this approach (it has "minimal overhead"). The main point is that you wont need to train that model for as long.
jszymborski 3 hours ago
This is reminds me of the input gates of an LSTM.
westurner 2 hours ago
ScholarlyArticle: "Attention Residuals" (2026) https://arxiv.org/abs/2603.15031 :
> Abstract: Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. [...]
[-]
- czbond 37 minutes ago
  Ah - now I understand how this has 2k+ (supposedly legitimate) Github stars in less than a week. Thank you - I was more skeptical
vibe42 3 hours ago
[dead]