Transformers
Deep Learning (DL)What is Transformers?
Transformers are the deep learning architecture that powers virtually all modern AI language models, including ChatGPT, Claude, Gemini, and Llama. Introduced in 2017 in a landmark paper titled 'Attention Is All You Need,' transformers solved a fundamental problem: how to process all parts of an input (like every word in a long document) simultaneously rather than one at a time. The key innovation is a mechanism called 'self-attention,' which lets the model weigh the importance of every word relative to every other word when making predictions. This means a transformer reading the sentence 'The bank by the river' can immediately connect 'bank' with 'river' to understand it means a riverbank, not a financial institution. Because transformers process data in parallel rather than sequentially, they can be trained much faster on massive datasets using modern GPUs. This parallelism, combined with self-attention, enabled the scaling of language models to billions of parameters.
Technical Deep Dive
Transformers are sequence-to-sequence architectures based entirely on self-attention mechanisms, eliminating the recurrence and convolution used in prior models. The original architecture (Vaswani et al., 2017) consists of an encoder-decoder structure with multi-head scaled dot-product self-attention, position-wise feedforward networks, residual connections, and layer normalization. Self-attention computes Query-Key-Value triplets to produce weighted representations where attention weights reflect token-to-token relevance. Positional encodings (sinusoidal or learned) inject sequence order information. Variants include encoder-only models (BERT, for classification and extraction), decoder-only models (GPT series, for generation), and encoder-decoder models (T5, BART, for seq2seq tasks). Key innovations since the original include rotary positional embeddings (RoPE), grouped query attention, flash attention for memory-efficient computation, and mixture-of-experts layers for conditional scaling. Transformers have been extended beyond NLP to vision (ViT), audio (Whisper), protein folding (AlphaFold2), and multimodal domains.
Why It Matters
Transformers are the architectural breakthrough behind ChatGPT, Claude, Google Translate's modern engine, GitHub Copilot, DALL-E, and nearly every cutting-edge AI system released since 2020.
Related Concepts
Part of
- Deep Learning (DL) (architectures)
Includes
- Foundation Models (enables)
- Large Language Models (LLM) (powers)
Connected to
- Foundation Models (enables)
- Large Language Models (LLM) (powers)