<aside> ๐ก
What is this? This is a fairly technical overview of the various improvements that have been made to the decoder-only transformer architecture originally proposed in Attention is All You Need. Specifically, this highlights the main changes from AIAYN to Llama 3. As of May 2024.
</aside>
I feel like I have a solid understanding of the standard transformer model โ mostly thanks to the awesome people in the ML community who publish quality educational content. Andrej Karpathy (and many others!) Iโm talking about you ๐๏ธ๐๏ธ
However, there are many improvements that have been incorporated in SoTA architectures for decoder-only transformers โ improvements that I didnโt fully understand. Like what is Grouped-Query Attention (GQA)? SwiGLU? RMSNorm? RoPE? Why did we make all these changes?
So, here is my approach at defining what is a standard SoTA transformer (based on Llama / OpenELM).
Summary of the changes from Attention is All You Need to current state-of-the-art. Each element is expanded on below.
Quick Note: This doesnโt cover several interesting and relevant SoTA topics like MoE or other computational improvements like KV-Caching. Maybe somedayโฆ
Ok โ so putting this all together gives us our SoTA decoder-only transformer:
Feel free to use these graphics. Slides link for those of you who want to tweak too ๐