LLMs: Tokenization

1 hour ago
3 min read

If you’ve gone through the previous post, you should now have a basic intuition about how Large Language Models (LLMs) work. If not, I’d strongly recommend reading LLMs: Build Intuition before continuing.

At a high level, everything in an LLM eventually reduces to mathematical computations. But before any computation can happen, text needs to be converted into numbers.

This post focuses on exactly that: how text gets converted to a numerical representation.

Tokenized Text Example Image — Tokenized Text

Let’s take a simple example — a stanza from a well-known Beatles song:

Here comes the sun
Here comes the sun
And I say, "It's all right"
Sun, sun, sun, here it comes
Sun, sun, sun, here it comes

Our goal is to convert this text into a numerical representation that a model can understand.

A Naive Approach

A straightforward approach is to assign a unique number to each word:

here → 1
comes → 2
the → 3
sun → 4
...

So the text becomes something like:

1 2 3 4 1 2 3 4 ...

This looks simple, but it breaks quickly.

For example:

What about the word "come" instead of "comes"?
What about new words not seen during training?

Since the vocabulary is fixed, anything outside it becomes a problem.

Moving to Subwords

To fix this, we move from word-level to subword-level tokenization.

Instead of storing entire words, we store pieces of words.

So:

come → [co][me]

Or, depending on the vocabulary:

come → [c][o][m][e]
come → [com][e]

This makes the system far more flexible and generalizable.

How Do We Decide Subwords?

Since there can be multiple combinations of sub-words. This is where Byte Pair Encoding (BPE) comes in.

Intuition:

Start with the smallest units — characters.
Find the most frequent pair of adjacent tokens.
Merge them into a new token.
Repeat until you reach a desired vocabulary size.

BPE in Action

** Iteration 1 (Character Level)

[H] [e] [r] [e] [ ] [c] [o] [m] [e] [s] ...

At this stage, the vocabulary consists of individual characters.

** Iteration 2 (First Merge)

Suppose "un" appears frequently (because of "sun"):

[H] [e] [r] [e] [ ] [c] [o] [m] [e] [s] ... [un]

Now "u" and "n" are merged into a single token [un].

** Iteration 3+

This process continues:

[su] may form
[sun] may eventually form
Common patterns become single tokens

Until we reach a predefined vocabulary size (say 100 tokens).

So now, we have a tokenizer ready, trained on a defined corpus.

Let's put it into action, but before doing so, we need to handle a scenario when our tokenizer encounters something that is out of its vocabulary (OOV), which is quite possible since it is trained on finite data

Older tokenizers (like early transformer models) used a special token:

[UNK]

This represents anything not present in the vocabulary.

Example:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

text = "Hello 𐦖 world"
encoded = tokenizer.encode(text)

print(tokenizer.convert_ids_to_tokens(encoded))
# ['[CLS]', 'hello', '[UNK]', 'world', '[SEP]']

Here, the unknown symbol 𐦖 is replaced with [UNK].

Evolution

Modern tokenizers go one step further. Instead of characters as the lowest granularity, they operate at the byte level.

Why this matters

Every possible input can be represented
No unknown tokens

Example:

import tiktoken

tokenizer = tiktoken.get_encoding("cl100k_base")

encoded = tokenizer.encode("Hello 𐦖 world")
print(encoded)

Even unusual symbols are broken into valid byte-level tokens.

So now we have our input ready, it's time to go deep on how the model finds patterns in these numeric representations. Stay tuned for the next blog.