<aside> 💡

What is this? This is a fairly technical overview of the various improvements that have been made to the decoder-only transformer architecture originally proposed in Attention is All You Need. Specifically, this highlights the main changes from AIAYN to Llama 3. As of May 2024.

</aside>

I feel like I have a solid understanding of the standard transformer model — mostly thanks to the awesome people in the ML community who publish quality educational content. Andrej Karpathy (and many others!) I’m talking about you 👁️👁️

However, there are many improvements that have been incorporated in SoTA architectures for decoder-only transformers — improvements that I didn’t fully understand. Like what is Grouped-Query Attention (GQA)? SwiGLU? RMSNorm? RoPE? Why did we make all these changes?

So, here is my approach at defining what is a standard SoTA transformer (based on Llama / OpenELM).

Summary of the changes from Attention is All You Need to current state-of-the-art. Each element is expanded on below.

Summary of the changes from Attention is All You Need to current state-of-the-art. Each element is expanded on below.

Quick Note: This doesn’t cover several interesting and relevant SoTA topics like MoE or other computational improvements like KV-Caching. Maybe someday…

Normalization (RMSNorm)

Positional Embeddings (RoPE)

Attention (GQA)

Feed Forward Networks (SwiGLU)

Kernel-Optimized Attention (FlashAttention-2)

Tokenization (Llama Tokenizer)

Other Minor Adjustments

Ok — so putting this all together gives us our SoTA decoder-only transformer:

Feel free to use these graphics. Slides link for those of you who want to tweak too 🙂