All you need to know about Tokenization in LLMs

Press enter or click to view image in full size

In this blog, I’ll explain everything about tokenization, which is an important step before pre-training a large language model (LLM). By the end, you’ll have a thorough understanding of the concept. I’ll also share the code for training your own tokenizer in the coding section.

The rest of the article will cover the following points:

What is tokenization?
Why do we need tokenization?
Naive tokenization schemes (Character level, Word level tokenizations)
Tokenization using the Unicode ord function
Tokenization using the UTF-8 encode function
The Byte Pair Encoding (BPE) algorithm
Coding a custom tokenizer for your own language model
Conclusion

What is Tokenization?

Tokenization is the process of converting text into a sequence of tokens, which can be words, subwords, or characters. These tokens are the smallest units of meaning in a text that can be processed by a language model. For example, the sentence “Hello, world!” can be tokenized into [“Hello”, “,”, “world”, “!”]. Tokenization simplifies the text and allows the model to work with manageable chunks of data.

In the LLM pre-training and inference workflows, after the tokenization, each token is assigned a unique integer. For each integer, there is a corresponding row in a lookup table, which is the vector representation of that token. This vector is then used as input for a particular token in the language model.

Why do we need Tokenization?

Suppose you want to pre-train your own language model, and you have 10 GB of language data available. However, the problem is that your data is in textual form. Language models, or any machine learning models, process numbers. Therefore, you need a method to convert your text into numbers. This is where tokenization comes into play.

To use the language model on our dataset, we’ll first tokenize the text. Then we’ll assign each token a unique integer. These integers are then vectorized and fed into the language model. The language model processes this vectorized input and outputs integers again, which are subsequently converted back into text. Without tokenization, language models would not be able to operate on just raw textual data.

Naive Tokenization Scheme for Pre-Training a Language Model

Suppose you have a corpus of text and want to tokenize it using a naive approach. You can follow these three steps:

Find All Unique Characters: Identify all unique characters in your dataset.
Create Lookup Tables: Create character-to-integer and integer-to-character lookup tables for converting strings into tokens.
Use an Embedding Table: Convert each integer token into a vectorized representation using an embedding table. This embedding table will have rows equal in number to the vocabulary size.

The language model is then trained on these vectorized representations of text.

Problem with This Approach

Using character-level tokenization in this naive approach presents several issues:

Language Diversity: This method does not account for languages other than the language for which you created vocabulary lookup tables.
Sequence Length: Due to the small vocabulary size, the sequence length of tokenized sentences will be much larger. As a result, transformers will process much less data at a time due to their fixed window size.

These limitations make character-level tokenization less efficient for pre-training a language model, especially when dealing with diverse languages and long sequences.

What should we do to solve this problem? One possible solution is to do word level tokenization rather than character level tokenization. But now it will have following problems:

Language Diversity: This method still does not account for languages other than English.
Vocabulary Size: Word-level tokenization results in a much larger vocabulary size. Consequently:

The embedding table and the language model head will have more rows, increasing the computational complexity of the model.
Transformers, with their fixed context window size, will process a larger amount of information condensed into a smaller number of tokens. This can lead to suboptimal performance, as the model may struggle to capture long-term dependencies and detailed contextual information.

A middle ground is hence some kind of sub-word tokenization scheme. BPE does exactly that. We’ll come to it in a moment :). But first let’s some other possibility to carry out tokenization.

Unicode ord function for tokenization? No.

If you don’t know about unicode ord function, it directly converts any character of any language corresponding to a unicode representation, which is an integer. So why can’t we use it for tokenizing our dataset? Well it turns out that it will have some implications involved with it.

In this case, the vocab size would be very very large. This means more computation needed at embedding table layer and language model head layer. Generally longer vocabulary would mean smaller sequence lengths for tokenized sentences, but in this case, THE TOKENIZED SEQUENCE LENGTH WILL ALSO BE LONG AT THE SAME TIME. This is because ultimately we are doing character level encoding. As a result the transformer sees much less information at a time going into it, resulting in suboptimal performance.
Secondly, this approach will not be stable as the unicode standard keeps changing.

UTF-8 Encoding for Tokenization? Yes, but with ‘BPE’ Only.

UTF-8 encodes characters into binary form, using 1 to 4 bytes per character. It is preferred over UTF-16 and UTF-32 because it offers better backward compatibility with ASCII, a simpler character encoding scheme.

However, using UTF-8 encoding for tokenization presents significant challenges:

Limited Vocabulary Size: The vocabulary size would be restricted to 256, corresponding to the number of possible byte values.
Long Tokenized Sequences: This limited vocabulary size leads to much longer tokenized sequences. Given the fixed width of input that the attention block in transformers can process, the model ends up looking at less information at a time, which results in suboptimal performance.

If we can somehow increase the vocab size while reducing the sequence lengths at the same time, this would allow us to use UTF-8 encoding for tokenization. Byte pair algorithm will allow us to compress this byte sequence to a variable amount. Let’s discuss BPE now so you can understand how can it compress the long byte sequences by increasing vocabulary length.

Byte Pair Encoding Algorithm (BPE)

The original algorithm operates by iteratively replacing the most common contiguous sequences of characters in a target text with unused ‘placeholder’ bytes. The iteration ends when no sequences can be found, leaving the target text effectively compressed. Decompression can be performed by reversing this process, querying known placeholder terms against their corresponding denoted sequence, using a lookup table. In the original paper, this lookup table is encoded and stored alongside the compressed text.

Example

Here is how BPE works:

Suppose the data to be encoded is

aaabdaaabac

The byte pair “aa” occurs most often, so it will be replaced by a byte that is not used in the data, such as “Z”. Now there is the following data and replacement table:

ZabdZabac

Z=aa

Then the process is repeated with byte pair “ab”, replacing it with “Y”:

ZYdZYac

Y=ab

Z=aa

The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with recursive byte pair encoding, replacing “ZY” with “X”:

XdXac

X=ZY

Y=ab

Z=aa

This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.

To decompress the data, simply perform the replacements in the reverse order.

Coding a custom tokenizer for training your very own LLM!

Above was a brief description of the BPE algorithm. But how do we code it in python? There are 5 steps to it. You can play around with the complete notebook here.

First, we encode our text into UTF-8 encoding, which converts each character into a sequence of integers ranging from 0 to 255. UTF-8 uses variable-length encoding, where characters can be represented by 1 to 4 bytes depending on the character.

Next, we analyze the frequency of occurrence for each pair of bytes in the encoded sequence. The goal is to identify the most frequent pair, as this pair will be replaced by a new token. By replacing these frequent pairs with tokens, we effectively reduce the sequence length of the text corpus while expanding the vocabulary size. This process is crucial for effective tokenization using UTF-8 encoding in natural language processing tasks.

This is exactly what we wanted for doing tokenization with UTF-8 encoding on text.

# Step 1: Encode the raw text
raw_bytes = text.encode('utf-8')# Step 2: Convert the list of bytes into list of integers
tokens = list(map(int, raw_bytes))
# Step 3: Get the most frequent occuring pair from the whole list of frequencies.
def get_stats(tokens):
  stats = {}
  for tok1, tok2 in zip(tokens, tokens[1:]):
    if (tok1, tok2) in stats:
      stats[(tok1, tok2)] += 1
    else:
      stats[(tok1, tok2)] = 1
  return stats
stats = get_stats(tokens)
top_pair = max(stats, key=stats.get)
# Step 4: Replace this with a new token
# Step 5: Repeat steps 3, 4 as long as you want.
vocab_size_final = 276
vocab_size_original = 256
num_merges = vocab_size_final - vocab_size_original
ids = list(tokens) # copy so we don't destroy the original list
for i in range(num_merges):
  stats = get_stats(tokens)
  top_pair = max(stats, key=stats.get)
  idx = vocab_size_original + i
  print(f"merging {top_pair} into a new token {idx}")
  tokens = merge(tokens, top_pair, idx)

When we run the above script, i.e., train the tokenizer, we get a list of merges as shown in the image below. Before training, all integers were within the range of 0 to 255. However, after training the tokenizer, as more pairs are merged, the vocabulary size continues to increase. Initially, we create a vocab list that contains initial information about which character corresponds to each of the integers from 0 to 255. Finally, after training, the merge list contains information about which pair is replaced by which token.

This would mean we have successfully trained our tokenizer. Using Merges (and vocabulary), we can convert any sequence of unicode bytes into tokenized sequence.

But how do we do that concretely? For that we would need some encode function, as well as decode function as follows. The decode function takes a list of integer ids, where each integer corresponds to an index in the vocab. It reconstructs the original text by concatenating the bytes from vocab specified by each id, converting them into a byte string (tokens). Finally, the byte string tokens is decoded into a Python string using UTF-8 encoding, with replacement for any decoding errors (errors=”replace”), and returned as the decoded text.

For example, decoding [128] using the provided decode function would return the string representation of the byte at index 128 in the vocab.

vocab = {idx: bytes([idx]) for idx in range(256)}
for (p0, p1), idx in merges.items():
vocab[idx] = vocab[p0] + vocab[p1] # Concatenation of bytes objectsdef decode(ids):
  # given ids (list of integers), return Python string
  tokens = b"".join(vocab[idx] for idx in ids)
  text = tokens.decode("utf-8", errors="replace")
  return text
  print(decode([128]))

And the encode function does opposite of decode. The encode function takes a string of text as input and converts it into a list of integers (tokens) representing the encoded sequence. Here’s how it works step by step:

Text Encoding: Initially, the function converts the input text into a list of integers using UTF-8 encoding. Each character in the text is encoded into one or more bytes, and these byte values are stored in tokens.
Merging Process: While there are at least two tokens in the list (while len(tokens) >= 2), the function calculates statistics (stats) about potential pairs of tokens that can be merged together to form new tokens. This is done using a helper function get_stats.
Choosing the Best Merge: Among all possible pairs (stats), the function selects the pair that is most eligible to be merged based on a predefined merging dictionary (merges). This dictionary keeps track of which pairs have been merged and assigned a new index (idx).
Merging and Updating Tokens: If the selected pair (pair) is found in merges, indicating it can be merged, the function merges the pair into a single token identified by idx. This process updates the tokens list by replacing the pair with the new token.
Termination: The process continues until no more eligible pairs can be found (if pair not in merges), indicating that all possible merges have been completed.
Return Tokens: Finally, the function returns the list of tokens representing the encoded sequence of the input text.

def encode(text):
  # given a string, return list of integers (the tokens)
  tokens = list(text.encode("utf-8"))    while len(tokens) >= 2:
    stats = get_stats(tokens)
    pair = min(stats, key=lambda p: merges.get(p, float("inf")))
    # The above line returns the most eligible pair to be merged and encoded.
    if pair not in merges:
      break # nothing else can be merged
    idx = merges[pair]
    tokens = merge(tokens, pair, idx)
  return tokens

This is so exciting. Again, I have shared the code here. Only by changing the text variable in the code, you can train your very own tokenizer for pre-training any LLM!

Note: Tokenizers and large language models may have different training sets. If the tokenizer’s training set contains multiple languages, then the conversion from Unicode byte sequences to tokenized sequences (for the languages the tokenizer has already seen during its training) will yield comparatively shorter tokenized sequences. As a result, the transformer will be able to look over a larger context and yield better performance. Therefore, it is better to train your tokenizer on as many languages as possible.

Conclusion

In this article, I covered the role of tokenizers in large language models. We explored how to train a tokenizer on a text dataset and use it to encode and decode raw strings into tokenized sequences and vice versa. By understanding and optimizing tokenization, we can improve the efficiency and performance of language models.

Thank you for reading the article!