<aside> ๐Ÿ’ก

What is this? This is a fairly technical overview of the various improvements that have been made to the decoder-only transformer architecture originally proposed in Attention is All You Need. Specifically, this highlights the main changes from AIAYN to Llama 3. As of May 2024.

</aside>

I feel like I have a solid understanding of the standard transformer model โ€” mostly thanks to the awesome people in the ML community who publish quality educational content. Andrej Karpathy (and many others!) Iโ€™m talking about you ๐Ÿ‘๏ธ๐Ÿ‘๏ธ

However, there are many improvements that have been incorporated in SoTA architectures for decoder-only transformers โ€” improvements that I didnโ€™t fully understand. Like what is Grouped-Query Attention (GQA)? SwiGLU? RMSNorm? RoPE? Why did we make all these changes?

So, here is my approach at defining what is a standard SoTA transformer (based on Llama / OpenELM).

Summary of the changes from Attention is All You Need to current state-of-the-art. Each element is expanded on below.

Summary of the changes from Attention is All You Need to current state-of-the-art. Each element is expanded on below.

Quick Note: This doesnโ€™t cover several interesting and relevant SoTA topics like MoE or other computational improvements like KV-Caching. Maybe somedayโ€ฆ

Normalization (RMSNorm)

Positional Embeddings (RoPE)

Attention (GQA)

Feed Forward Networks (SwiGLU)

Kernel-Optimized Attention (FlashAttention-2)

Tokenization (Llama Tokenizer)

Other Minor Adjustments


Ok โ€” so putting this all together gives us our SoTA decoder-only transformer:

Feel free to use these graphics. Slides link for those of you who want to tweak too ๐Ÿ™‚

Feel free to use these graphics. Slides link for those of you who want to tweak too ๐Ÿ™‚