Chapter 9: The Transformer

By 2017, recurrent neural networks were the dominant architecture for sequence tasks: translation, language modeling, text generation. LSTMs and GRUs had solved the worst of the vanishing gradient problem that plagued vanilla RNNs. They worked. But they had two fundamental limitations that no amount of clever gating could fix.

First, sequential processing. An RNN processes tokens one at a time, left to right. The hidden state at position t depends on the hidden state at position t-1, which depends on t-2, and so on. You cannot compute the representation of position 50 until you have computed positions 1 through 49. This makes RNNs inherently serial. You can't parallelize the core computation, which means you can't fully exploit the thousands of cores sitting on a GPU. Training is slow. Scaling is painful.

Second, long-range dependencies. Even with LSTM gates designed to preserve information over long spans, the model still struggles when the relevant context is hundreds of tokens away. Information has to survive being passed through every intermediate step, getting compressed and potentially degraded at each one. In practice, LSTMs work well for spans of maybe 100-200 tokens. For a paragraph, that's fine. For a document, it's not.

That's the idea behind attention. And in June 2017, a team of eight researchers at Google published a paper that rebuilt sequence modeling from the ground up around that idea.

"Attention Is All You Need"

The paper, by Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin, introduced the Transformer architecture.¹ The title was provocative and accurate: they removed recurrence entirely, replacing it with a mechanism called self-attention that lets every position in a sequence attend directly to every other position. No sequential bottleneck. No information decay over distance.

The results were immediate. The Transformer matched or beat the best RNN-based models on machine translation (English-German, English-French) while training significantly faster — 3.5 days on 8 GPUs for the base model, compared to weeks for comparable RNN systems. But the real impact wasn't the benchmark scores. It was that the architecture scaled. More data, more parameters, more compute produced predictably better results. That property — which RNNs and LSTMs never reliably exhibited — turned out to be the thing that mattered most.

Within two years, the Transformer had replaced RNNs in virtually every NLP task. Within five years, it had expanded into vision (Vision Transformer, ViT), audio (Whisper), protein folding (AlphaFold 2), and code generation. Every major large language model — GPT, Claude, Gemini, Llama — is a Transformer.

Attention: The Core Mechanism

Attention is not unique to the Transformer. Bahdanau, Cho, and Bengio introduced attention for neural machine translation in 2014, as an addition to an RNN encoder-decoder.² Their insight: when translating a sentence, instead of compressing the entire source sentence into a single fixed-length vector, let the decoder look back at all encoder positions and focus on the relevant ones for each output word.

The Transformer takes that idea and makes it the entire computation. No recurrence underneath. Attention all the way down.

The mechanism works through three learned projections called Query, Key, and Value. The intuition maps cleanly onto how you might think about information retrieval:

The attention operation computes, for each query, how much it should attend to each key, then returns a weighted sum of the corresponding values. High attention weight means "this position is relevant to what I'm computing." Low weight means "ignore this."

The Attention Equation

QK^T is a matrix multiplication between the queries and the transpose of the keys. Each entry in the resulting matrix is a dot product between one query vector and one key vector. The dot product measures similarity — how much two vectors point in the same direction. So QK^T produces a matrix of "relevance scores": for every pair of positions (i, j), how much should position i attend to position j?

If you have a sequence of n tokens and each Q and K vector has dimension d_k, then Q is an n × d_k matrix, K^T is d_k × n, and the product QK^T is n × n — one score for every pair of positions.

/ √d_k is the scaling factor, and it matters more than it looks. The dot product of two d_k-dimensional vectors, when the components are roughly standard normal, has a variance that grows proportionally with d_k. For the original Transformer with d_k = 64, the raw dot products could have magnitudes around 8. That sounds modest, but the next step is softmax, which is exponential. A difference of 8 between logits means one value is about e⁸ ≈ 3000 times larger than another after exponentiation. The softmax output would be essentially one-hot — all the attention concentrated on a single position — with gradients near zero everywhere else. Dividing by √d_k rescales the logits back to a range where softmax produces a useful distribution and gradients can flow.

softmax normalizes each row of scores into a probability distribution that sums to 1. Position i's attention weights over all positions become non-negative numbers that sum to 1 — interpretable as "what fraction of my attention do I give to each position?"

× V applies those weights. The output for position i is a weighted sum of all value vectors, where the weights are the attention scores. If position i attends strongly to position j, then position j's value vector contributes heavily to position i's output.

A Concrete Example

When the model processes the word "it," it needs to figure out what "it" refers to. The query vector for "it" encodes something like "I'm a pronoun looking for my antecedent." The key vector for "cat" encodes something like "I'm a noun, an animal, a subject." The dot product between these two is high — the query and key are aligned. The key for "mat" is less aligned (it's a noun, but not a plausible antecedent for "tired"). So "it" attends strongly to "cat" and receives "cat"'s value vector, which carries the semantic content forward.

This is a simplification. The model doesn't literally learn "pronoun resolution" as a named concept. But it learns patterns of attention that achieve the same effect, entirely from seeing billions of examples of which words are relevant to which other words.

Self-Attention

In the original Bahdanau attention, queries came from the decoder and keys/values came from the encoder — the decoder was attending to the input. Self-attention is what happens when Q, K, and V all come from the same sequence. Every position attends to every other position within the same input.

Mechanically, self-attention works by taking the input matrix X (where each row is the embedding for one token) and multiplying it by three learned weight matrices:

The same input X produces all three projections, but through different weight matrices. This means Q, K, and V are different views of the same data — the model learns to project each token into a "what am I looking for?" space, a "what do I offer?" space, and a "what's my content?" space, separately.

Self-attention is what makes the Transformer a fundamentally different architecture from anything that came before. In an RNN, the representation of a word is built from the words before it, sequentially. In self-attention, the representation of every word is built from all words simultaneously, weighted by relevance. Position 1 has direct access to position 500, with no intermediate compression. The computational graph is fully connected, not sequential.

Multi-Head Attention

A single attention operation computes one set of relevance scores. But language has multiple simultaneous types of relationship. In "The cat sat on the mat because it was tired," the word "it" has a coreference relationship with "cat," a syntactic relationship with "was" (subject-verb), and a logical relationship with "because" (causal). A single attention head would have to compress all of these relationships into one set of weights.

The Transformer instead runs multiple attention operations in parallel, each with its own learned W_Q, W_K, W_V matrices. These are called attention heads. Each head can learn to focus on a different type of relationship.

In the original Transformer, d_model = 512 and h = 8, so each head operates on 64-dimensional vectors. The outputs of all heads are concatenated and multiplied by a final projection matrix W_O to produce the output.

The key insight: the total computation is roughly the same as a single attention head with full dimensionality. You're not multiplying the cost by 8. You're dividing the representation into 8 subspaces and letting each one learn its own attention pattern independently. Then you recombine. It's a form of parallelism built into the architecture itself.

Empirical analysis of trained Transformers shows that heads do specialize. Some heads learn syntactic relationships (subject-verb agreement), others learn positional patterns (attend to the previous token), others learn semantic similarity (attend to words with related meanings). This specialization isn't programmed — it emerges from training.³

Positional Encoding

Self-attention has a notable property: it's permutation-invariant. If you shuffle the order of the input tokens, the attention scores between any pair of tokens don't change (the dot products are the same regardless of position). But word order obviously matters — "the dog bit the man" and "the man bit the dog" have very different meanings.

RNNs get order for free because they process tokens sequentially. The Transformer, processing all tokens in parallel, has to inject position information explicitly. The original paper does this by adding a positional encoding vector to each token's embedding before it enters the attention layers.

Each position gets a unique vector. Even dimensions use sine, odd dimensions use cosine, with frequencies that decrease geometrically across dimensions. The result is that each position has a unique "fingerprint" and — crucially — the relative distance between two positions can be derived from their encodings through a linear transformation. The model can learn to use relative position information even though it's given absolute positions.

The authors chose sinusoidal encoding partly because it generalizes: the model can potentially handle sequences longer than any seen during training, since the encoding pattern is defined for any position. In practice, modern Transformers have largely moved to learned positional embeddings (GPT-2, BERT) or rotary position embeddings (RoPE)⁴, which encode relative position directly into the attention computation. But the original sinusoidal scheme established the key principle: position is information, and it has to be supplied.

The Full Transformer Block

Attention is the signature innovation, but a Transformer block has four components, not one. The full block, repeated N times (6 in the original paper, 96 or more in modern large models), looks like this:

Stack this block 6 times (or 96 times, or 128 times) and you have the core of the Transformer. Each layer refines the representations, with early layers typically learning local syntactic patterns and later layers learning global semantic relationships.

Encoder-Decoder and Its Descendants

The original Transformer was designed for machine translation, which requires reading an input (source language) and producing an output (target language). It uses an encoder-decoder architecture:

This was the original architecture. But researchers quickly discovered that you could take pieces of it and build powerful models for different tasks:

The decoder-only variant won the scaling race. GPT-2 (2019), GPT-3 (2020), GPT-4 (2023), Claude, Gemini, Llama — all decoder-only Transformers. The reason is pragmatic: decoder-only models have one training objective (predict the next token), one attention pattern (causal mask), and one architecture to scale. This simplicity made it easier to study scaling behavior and push to extreme sizes. BERT-style encoder models are still widely used for tasks like search and classification, but the generative frontier is decoder-only.

Why Transformers Won

The Transformer displaced RNNs, LSTMs, and convolutional approaches for sequence modeling within roughly two years. That's fast, even by machine learning standards. Four properties explain it.

Parallelism

Variant	Structure	Key example	Use case
Encoder-decoder	Full original architecture	T5 (Raffel et al., 2020)	Translation, summarization
Encoder-only	Just the encoder, bidirectional attention	BERT (Devlin et al., 2019)	Classification, NER, understanding
Decoder-only	Just the decoder, causal (left-to-right) attention	GPT (Radford et al., 2018)	Text generation, language modeling

An RNN's hidden state at position t depends on position t-1. You cannot parallelize across positions. A Transformer's attention operation is a matrix multiplication — all positions are computed simultaneously. On modern GPUs with thousands of cores, this is the difference between a sequential for-loop and a single parallel operation. Training a large Transformer is expensive in absolute terms, but the cost scales with hardware in a way that RNN training fundamentally cannot.

Long-Range Dependencies

In an RNN, the connection between position 1 and position 500 passes through 499 sequential steps. Each step potentially degrades the signal. In a Transformer, position 1 and position 500 are connected in a single attention operation — one matrix multiplication, one softmax. The path length between any two positions is O(1), not O(n). This is the architectural reason Transformers handle long documents, multi-turn conversations, and complex reasoning chains better than RNNs.

Scalability

This is the property that matters most and was least obvious at the time of the original paper. Transformers exhibit remarkably predictable scaling laws: performance (measured by loss) improves as a smooth power-law function of model size, dataset size, and compute budget. Kaplan et al. (2020) at OpenAI quantified this empirically — if you double the parameters, you get a predictable improvement in loss, and this relationship holds over many orders of magnitude.⁶

RNNs and LSTMs never showed this kind of predictable scaling. Performance tended to plateau or become unstable at large scales. The Transformer's combination of residual connections, layer normalization, and attention (with no recurrent state to destabilize) creates an architecture that just keeps getting better when you make it bigger. This is what enabled the jump from GPT-2 (1.5B parameters) to GPT-3 (175B) to GPT-4 (rumored >1T) — each increase delivered proportional improvements.

Generality

The attention mechanism makes no assumptions about the structure of its input. It works on any set of vectors. This means the same architecture, with minimal modification, works for text, images (by treating image patches as tokens), audio (spectrograms as sequences), protein sequences, code, and even graph-structured data. The Transformer didn't just win at language — it became the universal architecture.

The Cost

The Transformer is not without limitations, and they're worth understanding because they shape the practical constraints of every model built on it.

Quadratic attention. Self-attention computes a score between every pair of positions. For a sequence of length n, that's an n × n matrix — O(n²) in both memory and computation. For a 4,096-token context window, that's ~16 million attention scores per layer per head. For 128,000 tokens (GPT-4's context), it's ~16 billion. This quadratic cost is the primary reason context windows have hard limits, and it's driven significant research into efficient attention variants — sparse attention, linear attention, Flash Attention (which doesn't reduce the theoretical complexity but dramatically improves the memory access patterns).⁷

No inherent recurrence. A Transformer has no mechanism for iterative refinement within a single forward pass. An RNN can, in principle, run for as many steps as needed. A Transformer gets exactly N layers of processing, regardless of whether the problem is easy or hard. This is a real limitation for tasks that require variable amounts of computation — and it's one reason chain-of-thought prompting works: it converts serial reasoning into sequential tokens, giving the model more "steps" within its fixed architecture.

Positional encoding limitations. The model has no built-in sense of order. The position information is added (via encoding or embedding), but it's just another vector component that can be overridden or ignored by the learned weights. In practice, positional understanding is good but not perfect, especially for very long sequences or tasks that require precise counting.

The Architecture That Became Infrastructure

The Transformer paper has over 130,000 citations as of early 2026. It is, by several measures, the most impactful machine learning paper ever published. But the truly remarkable thing is not the architecture itself — it's that the Transformer turned out to be the architecture where scale works.

The attention mechanism is elegant. Multi-head attention is clever. Positional encoding is a necessary solution to a real problem. But individually, none of these ideas were unprecedented. Attention existed. Feed-forward networks existed. Residual connections and layer normalization existed. What Vaswani et al. did was assemble them into a configuration that turned out to be extraordinarily well-suited to massive-scale training on GPU hardware — and then the scaling laws did the rest.

The Transformer is not the final architecture. Research continues on state-space models (Mamba), mixture-of-experts, and other approaches that address the quadratic attention bottleneck.⁸ But as of now, every frontier model is a Transformer, and the path from the 2017 paper to the current state of AI is a straight line through scaling.

¹ Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30 (NeurIPS 2017). The paper was originally posted to arXiv in June 2017 and presented at NeurIPS in December 2017.

² Bahdanau, D., Cho, K., & Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015). Originally posted to arXiv in September 2014. This paper introduced additive attention for sequence-to-sequence models.

³ Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). "What Does BERT Look At? An Analysis of BERT's Attention." Proceedings of the 2019 ACL Workshop BlackboxNLP. Showed that specific BERT attention heads correspond to syntactic relations like direct objects and coreference.

⁴ Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2024). "RoFormer: Enhanced Transformer with Rotary Position Embedding." Neurocomputing, 568, 127063. Originally posted to arXiv in April 2021. RoPE encodes relative position through rotation of the query and key vectors, and is used in LLaMA, Mistral, and other modern models.

⁵ Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). "Transformer Feed-Forward Layers Are Key-Value Memories." Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). Showed that FFN layers in Transformers store factual associations and can be interpreted as key-value memory systems.

⁶ Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361. Demonstrated power-law relationships between model performance and compute, dataset size, and parameter count.

⁷ Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." Advances in Neural Information Processing Systems 35 (NeurIPS 2022). Achieves 2-4x wall-clock speedup on attention by reducing memory reads/writes, without approximation.

⁸ Gu, A. & Dao, T. (2024). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." Proceedings of the 41st International Conference on Machine Learning (ICML 2024). A state-space model that matches Transformer quality at smaller scales with linear (rather than quadratic) sequence-length scaling. Originally posted to arXiv in December 2023.

The Transformer

The Problem