This is great - always worth reading anything from Sebastian. I would also highly recommend his Build an LLM From Scratch book. I feel like I didn’t really understand the transformer mechanism until I worked through that book.
On the LLM Architecture Gallery, it’s interesting to see the variations between models, but I think the 30,000ft view of this is that in the last seven years since GPT-2 there have been a lot of improvements to LLM architecture but no fundamental innovations in that area. The best open weight models today still look a lot like GPT-2 if you zoom out: it’s a bunch of attention layers and feed forward layers stacked up.
Another way of putting this is that astonishing improvements in capabilities of LLMs that we’ve seen over the last 7 years have come mostly from scaling up and, critically, from new training methods like RLVR, which is responsible for coding agents going from barely working to amazing in the last year.
That’s not to say that architectures aren’t interesting or important or that the improvements aren’t useful, but it is a little bit of a surprise, even though it shouldn’t be at this point because it’s probably just a version of the Bitter Lesson.
The most interesting takeaway from this gallery is the convergence. After years of experimentation -- MoE routing, state-space models, linear attention, hybrid architectures -- competitive open-weight models have settled on a narrow design space: dense decoder-only transformer with RMSNorm, rotary position embeddings, SwiGLU activations, and grouped-query attention. The architectural diffs between these models are minor variations on the same template.
The real differentiation has moved to training recipes and data pipelines. DeepSeek-R1's architectural novelty is modest; its training pipeline (reinforcement learning on reasoning chains) is where the interesting work happened. Same story with Llama 3 -- the architecture barely changed from Llama 2, but the training data and post-training were completely reworked.
This mirrors what happened in chip design: ISA differences between x86 and ARM matter less than microarchitecture and process node. Once the field finds a good-enough architecture, progress shifts to the systems around it.
This is amazing, such a nice presentation. It reminds me of the Neural Network Zoo [1], which was also a nice visualization of different architectures.
Is there a sort order? Would be so nice to understand the threads of evolutions and revolution in the progression. A bit of a family tree and influence layout? It would also be nice to have a scaled view so you can sense the difference in sizes over time.
Thank you so much! As a (bio)statistician, I've always wanted a "modular" way to go from "neural networks approximate functions" to a high-level understanding about how machine learning practitioners have engineered real-life models.
Interesting collection. The architecture differences show up in surprising ways when you actually look at prompt patterns across models. Longer context windows don't just let you write more, they change what kind of input structure works best.
Competitiveness doesn't really come from architecture, but from scale, data, and fine-tuning data. There has been little innovation in architecture over the last few years, and most innovations are for the purpose of making it more efficient to run training or inference (fit in more data), not "fundamentally smarter"
If your definition of "competitive" is loose enough, you can write your own Markov chain in an evening. Transformer models rely on a lot of prior art that has to be learned incrementally.
Thanks! This is cool. Can you tell me if you learnt anything interesting/surprising when pulling this together? As in did it teach you something about LLM Architecture that you didn't know before you began?
On the LLM Architecture Gallery, it’s interesting to see the variations between models, but I think the 30,000ft view of this is that in the last seven years since GPT-2 there have been a lot of improvements to LLM architecture but no fundamental innovations in that area. The best open weight models today still look a lot like GPT-2 if you zoom out: it’s a bunch of attention layers and feed forward layers stacked up.
Another way of putting this is that astonishing improvements in capabilities of LLMs that we’ve seen over the last 7 years have come mostly from scaling up and, critically, from new training methods like RLVR, which is responsible for coding agents going from barely working to amazing in the last year.
That’s not to say that architectures aren’t interesting or important or that the improvements aren’t useful, but it is a little bit of a surprise, even though it shouldn’t be at this point because it’s probably just a version of the Bitter Lesson.
The real differentiation has moved to training recipes and data pipelines. DeepSeek-R1's architectural novelty is modest; its training pipeline (reinforcement learning on reasoning chains) is where the interesting work happened. Same story with Llama 3 -- the architecture barely changed from Llama 2, but the training data and post-training were completely reworked.
This mirrors what happened in chip design: ISA differences between x86 and ARM matter less than microarchitecture and process node. Once the field finds a good-enough architecture, progress shifts to the systems around it.
[1] https://www.asimovinstitute.org/neural-network-zoo/
Is there a sort order? Would be so nice to understand the threads of evolutions and revolution in the progression. A bit of a family tree and influence layout? It would also be nice to have a scaled view so you can sense the difference in sizes over time.
I’m thinking it’s still llama / dense decoder only transformer.
I even brought my popcorn :(