Chapter 10: Tokenization

A neural network is a function that takes vectors of numbers as input, multiplies them by matrices of numbers, and produces vectors of numbers as output. Every operation inside a transformer — attention, feedforward layers, layer normalization — is linear algebra applied to floating-point arrays. There is no instruction anywhere in the architecture that says "read this word" or "understand this sentence." The model operates entirely in the domain of numbers.

So text has to be converted. The question is how. The choice of conversion scheme — the tokenizer — determines what the model sees, how efficiently it uses its context window, and where entire categories of failure modes come from. Tokenization is not a preprocessing detail. It is a design decision that shapes everything downstream.

The Extremes: Characters vs. Words

The simplest approach is to treat each character as a token. The English alphabet has 26 lowercase letters, 26 uppercase, 10 digits, and a handful of punctuation marks. A character-level vocabulary is small — maybe 256 entries if you include all of ASCII, or a few thousand if you handle Unicode broadly.

The problem is sequence length. The sentence "The transformer architecture revolutionized NLP" is 49 characters. That's 49 timesteps the model has to process, 49 positions that need attention computations. And attention is O(n²) in the sequence length — doubling the number of tokens quadruples the attention cost. Character-level models also have to learn spelling from scratch. The model has to figure out that t-h-e means one thing and t-h-a-t means another, which wastes capacity on patterns that are already solved by higher-level representations.

The opposite extreme is word-level tokenization. Split on whitespace and punctuation, assign each unique word an integer. "The" gets ID 1, "cat" gets ID 42, and so on. This is compact — "The transformer architecture revolutionized NLP" becomes just 5 tokens. But the vocabulary explodes. English has roughly 170,000 words in current use, and you also need every inflection (run, runs, running, ran), every proper noun, every technical term. A word-level vocabulary for a general-purpose model might need millions of entries. Most of those entries appear rarely in training data, so the model barely learns their embeddings. And the killer problem: any word not in the vocabulary is simply unrepresentable. The model can't handle it at all.

Neither extreme works well. What you want is something in between: a vocabulary large enough that common words get their own token, but flexible enough that rare or novel words can be constructed from smaller pieces. That's subword tokenization, and its dominant algorithm is Byte-Pair Encoding.

Byte-Pair Encoding: The Algorithm

Approach	Vocabulary size	Sequence length	Unknown words
Character-level	~256	Very long	None (all chars known)
Word-level	100K–1M+	Short	Frequent (OOV problem)
Subword (BPE)	32K–100K	Moderate	Rare (falls back to chars)

Byte-Pair Encoding was originally a data compression algorithm described by Philip Gage in 1994.¹ Its application to neural network vocabulary construction was proposed by Sennrich, Haddow, and Birch in 2016 for machine translation, and it became the standard approach for most large language models.²

The algorithm is straightforward. Start with a vocabulary consisting of individual characters (or bytes). Then, iteratively:

The number of merge operations controls the final vocabulary size. GPT-2 uses about 50,257 tokens. GPT-4 reportedly uses around 100,000. Llama 2 uses 32,000. These aren't magic numbers — they're engineering trade-offs between sequence length and vocabulary coverage.

A Concrete Example

Take a small corpus consisting of the words: low, lower, newest, widest. Suppose they appear with these frequencies:

Start by splitting every word into characters (with a special end-of-word marker _):

After enough merges, frequent words like "the" or "and" become single tokens. Less common words get split into familiar pieces: "tokenization" might become tokenization, while "untoken" might become untoken. A completely novel word like "bloxify" might fall all the way back to bloxify — built from subword units the model has seen in other contexts.

This is the power of BPE: it naturally learns a hierarchy of granularity. The most common patterns get their own tokens; everything else composes from smaller learned units. No word is truly out-of-vocabulary, because the fallback is always characters (or bytes).

What a Sentence Looks Like After Tokenization

When you type a prompt into an API, the tokenizer runs before anything else. Here's roughly what happens to a sentence under GPT-style BPE tokenization:

Note that "Tokenization" was split into two tokens (Token + ization) while "words" got its own single token but "subwords" was split into sub + words. The exact split depends on what the tokenizer learned from its training corpus. Common words are single tokens. Less common words are compositions. The token IDs shown are illustrative — the actual integers depend on the specific tokenizer's vocabulary file.

Vocabulary Size: The Trade-off

Vocabulary size is one of the most consequential hyperparameters in model design, even though it rarely gets the attention that hidden dimension or number of layers does.

There's a sweet spot that depends on the training data size, the languages covered, and the target use cases. Multilingual models need larger vocabularies because they cover more scripts. Code-oriented models might dedicate vocabulary space to common programming patterns like def, return, or function. The trend in recent models has been toward larger vocabularies — GPT-4's ~100K versus GPT-2's ~50K — enabled by larger training datasets that provide sufficient examples for each token's embedding.

Subword Tokenization: The Core Insight

The reason BPE works as well as it does comes down to a linguistic observation: morphology is compositional. The word "unhappiness" is built from un- (negation prefix) + happy (root) + -ness (noun suffix). If the tokenizer splits it into unhappiness or unhappiness, the model can reuse its understanding of "un-" from "unlikely," "unusual," and "unfair." It can reuse its understanding of "-ness" from "darkness" and "sadness."

This decomposition isn't designed by linguists — it emerges from frequency statistics. BPE discovers morphological structure because morphemes (meaningful word parts) tend to recur frequently. The prefix "re-" appears in thousands of English words, so it naturally gets merged into its own token early in the BPE process. The model doesn't know it's a prefix. It just knows it's a common byte sequence.

Variants: SentencePiece and WordPiece

SentencePiece

SentencePiece, developed by Taku Kudo and John Richardson at Google in 2018, solves a problem that standard BPE has with non-English languages.³ BPE implementations typically assume whitespace separates words — you split on spaces first, then apply merges within each word. But many languages (Japanese, Chinese, Thai) don't use spaces between words. And even in English, the assumption that spaces are word boundaries creates inconsistencies.

SentencePiece treats the input as a raw stream of Unicode characters (or bytes), including spaces. It represents spaces as a special character (the "meta-space," often displayed as ▁) and applies either BPE or a different algorithm called unigram language modeling to learn the vocabulary. The result is a tokenizer that works identically regardless of the input language — it doesn't need language-specific preprocessing or word segmentation.

Llama, T5, and many multilingual models use SentencePiece. Its language-agnosticism makes it the default choice for models intended to work across languages.

WordPiece

WordPiece, originally developed by Schuster and Nakajima in 2012 for Japanese and Korean speech processing and later adopted for BERT, is similar to BPE but differs in how it selects merges.⁴ Instead of merging the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training data when treated as a language model. In practice, this means it prefers merges that produce tokens carrying high mutual information — tokens that co-occur more often than you'd expect by chance.

WordPiece marks continuation tokens with a ## prefix. The word "embedding" might tokenize as em##bed##ding. This convention makes it easy to reconstruct the original text — any token starting with ## is glued to the previous token.

In practice, the differences between BPE, SentencePiece, and WordPiece are less dramatic than you might expect. They all produce subword vocabularies of similar quality. The choice is often driven by engineering convenience and the specific model's requirements rather than fundamental accuracy differences.

Token Embeddings: From IDs to Vectors

Once the tokenizer converts text to a sequence of integer IDs, the model needs to convert those integers into vectors the network can process. This happens through the embedding table — a matrix of shape (vocabulary_size, embedding_dimension).

Each row of this matrix corresponds to one token ID. If token ID 30642 represents "Token," then row 30642 of the embedding matrix is a vector of, say, 4096 floating-point numbers. This vector is the model's representation of that token. It's not hand-designed — it's learned during training, updated by backpropagation like any other parameter in the network.

The embedding lookup is computationally trivial — it's just indexing into a matrix, not a multiplication. But the table itself is a significant chunk of the model's parameters. For a vocabulary of 100,000 tokens with 4,096-dimensional embeddings, the embedding table alone contains 409,600,000 parameters. In a 7-billion parameter model, that's roughly 6% of all parameters dedicated entirely to the input representation.

At the output end of the model, the same process runs in reverse. The transformer produces a vector for each position, and that vector is multiplied by the embedding table (or a separate output projection matrix, though many models tie the input and output embeddings to save parameters) to produce a probability distribution over the entire vocabulary. The token with the highest probability — or a token sampled from the distribution, depending on the decoding strategy — becomes the next predicted token.

What the Model Actually Sees

It's worth being precise about the full pipeline from your prompt to the model's computation:

The model has never seen your text. It has seen a sequence of learned vectors, each one a compressed summary of what that token means in context, refined through billions of training examples. The word "bank" and the word "river" are, to the model, two different index lookups that produce two different 4096-dimensional vectors — and it's only through the attention mechanism (Chapter 9) that the model figures out whether "bank" here means a financial institution or the side of a waterway.

Why Tokenization Explains Model Failures

Many behaviors that seem like "the model is dumb" are actually tokenization artifacts. Once you understand what the model sees, these failures become predictable.

Spelling and Character-Level Tasks

Ask a model to count the letters in "strawberry" and it often gets it wrong. Why? Because "strawberry" is likely tokenized as strawberry or similar. The model doesn't see individual letters — it sees subword chunks. To count letters, it would need to decompose its tokens back into characters, which requires reasoning about the internal structure of units the model treats as atomic. It's like asking you to count the number of pen strokes in a word you read instantly — you've learned to see the word as a unit, not as a sequence of motor actions.

Arithmetic

Numbers get tokenized inconsistently. "128" might be a single token, but "1285" might become 1285. "42" is one token; "4200" might be 4200. The model isn't working with place value — it's working with arbitrary chunks of digits that happen to have been frequent in the training data. Multi-digit arithmetic requires understanding that each digit represents a power of ten, but the tokenizer has destroyed that structure before the model ever sees the input.

Rare Words and Names

An unusual surname like "Krzyzewski" will be tokenized into many small pieces, each carrying little semantic information on its own. The model has to reconstruct meaning from fragments that individually mean almost nothing. Common names like "Smith" are single tokens with rich learned representations. This asymmetry means models are systematically better at reasoning about common terms than rare ones — not because of any flaw in the architecture, but because of how the input is encoded.

Multilingual Disparity

Tokenizers trained primarily on English text develop vocabularies optimized for English. The word "hello" is one token. Its Korean equivalent, 안녕하세요, might be three or four tokens. This means non-English text consumes more of the context window, gets processed less efficiently, and provides less "meaning per token" for the model to work with. It's one of the reasons multilingual performance remains uneven despite training on multilingual data.

Context Window = Token Count, Not Word Count

When a model is described as having a "128K context window," that means 128,000 tokens, not 128,000 words. The ratio of tokens to words varies by language and content type, but for English prose, a rough rule of thumb is 1 word equals roughly 1.3 tokens. Code tends to be more token-dense because of punctuation and syntax. Non-English languages with vocabularies underrepresented in the tokenizer can be 2-4x more token-dense.

This has practical consequences. If you're feeding a model a long document, the effective capacity in "words understood" depends on how efficiently your text tokenizes. A 128K context window holds roughly 96K words of English, but might hold only 40K words of Thai or Vietnamese if the tokenizer wasn't optimized for those languages.

Model	Tokenizer	Vocabulary size	Context window
GPT-2	BPE (byte-level)	50,257	1,024
GPT-4	BPE (cl100k_base)	~100,000	8K / 32K / 128K
BERT	WordPiece	30,522	512
Llama 2	SentencePiece (BPE)	32,000	4,096
Llama 3	tiktoken (BPE)	128,256	8,192
Claude 3	BPE (byte-level)	~100,000	200K

Notice the trend: newer models tend toward larger vocabularies and larger context windows. Llama 3's jump from 32K to 128K vocabulary tokens was a deliberate choice — at the cost of a larger embedding table, it bought significantly better encoding efficiency, especially for non-English languages and code.⁵

Special Tokens

Every tokenizer includes tokens that don't correspond to text but serve structural roles. Common ones include:

In chat models, additional special tokens delimit user messages, assistant responses, system prompts, and tool calls. When you see a chat template like <|user|> or <|im_start|>, those are special tokens with their own IDs in the vocabulary — the model has learned to treat them as structural markers, not text to generate.

Byte-Level BPE: The Modern Default

GPT-2 introduced a refinement called byte-level BPE. Instead of starting with Unicode characters (which number in the hundreds of thousands), it starts with 256 byte values — the raw bytes of UTF-8 encoding. This means the base vocabulary is always exactly 256 tokens, and any valid byte sequence can be represented. No [UNK] token needed.

The trade-off is that some characters — particularly non-ASCII characters like accented letters, CJK characters, or emoji — are represented by multiple bytes at the base level. BPE merges learn to group these bytes back together for common characters. But for rare scripts or unusual characters, the byte-level representation can be verbose. The advantage of universality is considered worth this cost.

Most modern LLMs (GPT-3, GPT-4, Llama 3, Claude) use byte-level BPE or closely related approaches. The universality — no possible input that can't be tokenized — makes it the practical default for general-purpose models.

Tokenization Is Frozen After Training

One more point that matters for understanding the full system: the tokenizer is trained before the language model and is never updated during model training. It's a fixed preprocessing step. The BPE merges are learned on a text corpus, the vocabulary is locked, and then the language model is trained with that vocabulary.

This means the tokenizer reflects the statistics of whatever corpus was used to train it — which may not match the distribution of text the model encounters at inference time. A tokenizer trained mostly on English web text in 2020 may handle 2024 slang, new technical terms, or emerging languages inefficiently. The model can still process them — byte-level BPE guarantees that — but it might need four tokens where a better-optimized tokenizer would need one.

This is why some model releases include tokenizer improvements as a major feature. Llama 3's expansion from 32K to 128K tokens was specifically aimed at better multilingual and code performance — not by changing the model architecture, but by giving it a more expressive input representation.

Key idea: The tokenizer is the model's sensory organ. Just as biological vision determines what patterns an organism can detect, the tokenizer determines what linguistic patterns the model can efficiently represent. A better tokenizer — one that captures more meaning per token — improves the model's effective capacity without changing a single weight in the transformer itself.

Next: → Chapter 11 — Pre-training. With tokenization converting text to vectors, the model needs to learn what those vectors should mean. Next-token prediction at massive scale, the data, and what emerges when you train long enough on enough text.

¹ Gage, P. (1994). "A New Algorithm for Data Compression." C Users Journal, 12(2), 23-38. The original BPE algorithm was purely a compression technique; its adaptation to NLP vocabularies came two decades later.

² Sennrich, R., Haddow, B., & Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 1715-1725. This paper established BPE as the standard approach for neural NLP vocabularies.

³ Kudo, T. & Richardson, J. (2018). "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations, 66-71.

⁴ Schuster, M. & Nakajima, K. (2012). "Japanese and Korean voice search." 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5149-5152. WordPiece was later adopted for BERT by Devlin et al. (2019), where it became widely known in the NLP community.

⁵ Meta AI (2024). "The Llama 3 Herd of Models." The Llama 3 technical report discusses the tokenizer expansion and its impact on multilingual encoding efficiency. The 128K vocabulary was a 4x increase from Llama 2's 32K.

Tokenization

Why Text Must Become Numbers