March 17, 2026
It's likely that every LLM you've ever interacted with has used tokenization, because it is extremely useful for teaching AIs about language. And it turns out that it is no less useful for teaching people about language! But more on that later.
To show you what I mean by 'tokenization', let's take this sentence from the movie Conclave:
I want you to understand that, first of all, you are not in any kind of trouble.
If you wanted to break it into chunks, how would you do it? The most obvious way is to make each word a chunk. So you get this:
Words
I want you to understand that, first of all, you are not in any kind of trouble.
(Where I've made each chunk a different color.)
And that would be a perfectly valid tokenization, if a bit basic. But, despite how natural it is, I'd be willing to bet that your brain doesn't work this way at all.
For example, you've probably heard someone say "I want you to" hundreds of times. So your brain doesn't need to process every individual word. Instead, you already know the meaning of "I want you to" as a whole, and process it as one unit. Paying attention to each word individually would be almost as bad as reading each letter individually. It's just not possible to efficiently process a sentence one word at a time.
Most people don't have to worry about this, because everyone's brain figures out the chunks for them. But it raises an interesting question: how would we write a program to figure out these chunks? It's easier than you might think. And surprisingly relevant to machine learning. The very first step to training an LLM is to decide how you're going to break the input into chunks! In the context of LLMs, these chunks are called "tokens", and the idea is the same: instead of feeding the LLM one letter at a time, we typically want to feed it whole chunks of letters at a time, or maybe even whole chunks of words at a time. We do this just because it makes LLMs learn better and faster.
There is a catch though. We can't just split the sentence up any way we want. We have to decide on a predefined set of chunks. (AKA a predefined set of tokens.) This is called the "vocabulary". GPT-2's vocabulary had 50,257 chunks in it, while Meta's Llama 3 had 128,256 chunks, and Google's Gemini has 256,000 chunks. Large or small, we have to decide what the chunks will be in advance, before even starting to train the LLM.
Compression-maximization
There are many approaches to this problem. I will focus on "top-down" approaches. These are the approaches that start with a big set of possible chunks, much larger than your desired vocabulary, and then "filter out" the useless ones.
Suppose you decide you want to have 50,000 chunks in your vocabulary. Find a big dataset of sentences, and then generate millions of chunks in the dumbest way possible: just take every group of 2, 3, or 4 words and make it a chunk. If you do this, you will end up with millions of chunks, and most of them will be useless ones that will almost never come up in practice. But, from this base, we can try to "trim down" the list to 50,000 really useful ones by getting rid of those useless ones. But that only pushes the problem down! You still have to figure out how to decide which chunks are useful, and which are not.
Instead of focusing on the how, it might be easier to focus on the objective. What do we want from our chunks, that we can easily express to a computer? Here is one attractive idea: maybe we want to be able to encode a sentence in the fewest chunks possible.
That sounds pretty good! It sounds especially good when you consider that LLMs are limited in how many input tokens they can take. If you can compress the same text into fewer tokens, that means you can fit longer text into the same number of input tokens!
And it turns out it's not too difficult to figure out which chunks to keep to get the maximum compression. You just start with your big set of chunks, segment all your text into chunks, then throw away the chunks that appear the least. I tried that, and got... this:
Words
I want you to understand that, first of all, you are not in any kind of trouble.
Min-chunks
I want you to understand that, first of all, you are not in any kind of trouble.
Ugh. Look at that 'all, you' token. It great for writing sentences in the fewest chunks possible, because "all, you" probably appears in a lot of sentences. But the problem with it is that it doesn't mean anything. LLMs work best when each chunk has a coherent meaning, and "all, you" just doesn't. In fact, this has been studied, and the vocabulary that is the best for compression is not the best for LLMs. And I'd bet the same applies to people.
The Unigram algorithm
But there is a fix. A tiny tweak to the logic that lets us keep almost all the benefits of compression, but gives much better tokens. The solution is to use a measure defined by Claude Shannon almost a hundred years ago: Surprisal. Shannon defined surprisal as the negative log of a token's probability, which is essentially a long-winded way of saying that when something is rare you're more surprised to see it.
Instead of encoding the sentence in the fewest chunks possible, we encode it using the chunks that generate the "least surprisal". In other words, we're minimizing the total rarity of each chunk in the sentence. It's a subtle change, but the results speak for themselves.
Before getting into the details, let's just give the new algorithm a try and see how it does! Here's it in action:
Words
I want you to understand that, first of all, you are not in any kind of trouble.
Min-chunks
I want you to understand that, first of all, you are not in any kind of trouble.
Min-surprisal
I want you to understand that, first of all, you are not in any kind of trouble.
That's a lot better! The 'all, you' chunk is gone. It got replaced by "first of all," and "you are not". Both of those actually seem like useful chunks! They're common, and they're also coherent concepts. That has been shown to be desirable when training LLMs, but it also seems straightforwardly applicable to people learning a language! Everyone learning English needs to learn what "first of all" means, but never needs to explicitly learn what "all, you" means.
It's almost absurd how well it works. I ran it on my full English corpus, and here are the results:
I honestly find it pretty remarkable. There are like 10,000 phrases in there, but you can scroll to any point and always find something you've probably said or heard many times before.
Now, I promised I would explain how it works. It's very simple, and the starting point is the algorithm for maximum compression I mentioned earlier.
How that algorithm worked was: You just start with your big set of chunks, segment all your text into those chunks, then throw away the chunks that appear the least frequently. You actually do this iteratively, throwing away just a few chunks at a time, repeating the process until you reach your desired number of chunks. If you do this, the resulting vocabulary will be the one that encodes sentences into fewer chunks than any other.
But, let's think about why we're chunking in the first place. The goal is to save mental effort, right? You remember what "I want you to" means so you can avoid having to think about the four individual words that make it up. So, our first algorithm, which encodes sentences into the minimum possible number of chunks, does seem like it would make sense. If your brain can process a sentence by encoding it into 3 chunks, shouldn't that be less mental effort than encoding it into 5 chunks? It sounds like it makes sense, but as we saw earlier, the results were bad. That leaves us with a paradox on our hands: if it makes sense, why did it give us chunks that feel useless and confusing?
I'm not sure that anyone knows a truly non-tautological answer to this question. But I have a theory. Let's formalize the argument above slightly.
- Premise 1: To understand a sentence, you must recall the meaning of each chunk.
- Premise 2: Recalling the meaning of a chunk takes some amount of mental effort.
- Conclusion: Therefore, the amount of effort it takes to understand a sentence is proportional to the number of chunks in it.
If you want, you can take a moment to try to poke holes in that argument. But here's my answer. Premise 2 implies that the effort required to recall a chunk is always the same. But I think you'll find that common chunks, like "ice cream", are much easier for you to remember the meaning of than rare ones like "sovereign immunity". And when you think about it, it makes sense. If the brain has some analogous concept to a cache, it would make perfect sense to dedicate it to high-frequency chunks like "ice cream" and relegate "sovereign immunity" to the mental equivalent of a spinning-rust hard drive.
So, if the effort required to recall a chunk is not always the same, then it's not necessarily the case that fewer chunks is always better. Take the bad "all, you" chunk example from earlier. You probably have absolutely no problem recalling the meaning of "all" or "you" on their own. They are very common words that you can process with almost zero effort. So, there's potentially very little benefit to chunking them. In fact, if "all, you" is rarer as a chunk than "all" and "you" are on their own, and the effort of recalling a chunk is proportional to its rarity, the chunking has actually increased the effort necessary to understand the sentence!
So, let's adjust the argument.
- Premise 1: To understand a sentence, you must recall the meaning of each chunk.
- Premise 2: Recalling the meaning of a chunk takes mental effort proportional to the rarity of the chunk.
- Conclusion: Therefore, the amount of effort it takes to understand a sentence is proportional to the combined rarity of the chunks in it.
But it wouldn't be much of an theory without a test. So I took all the 6-chunk sentences in my dataset, and grabbed the ones with the lowest and highest combined rarity of their chunks. By coincidence, in addition to being 6 chunks under my model, both were also 6 words, making them a perfect comparison:
All I know is what I told you.
Positive atomic nuclei attract negative electrons.
I think it's pretty obvious which one is harder to understand! Each word in the latter required more mental effort to process than each word in the former.
At first I thought this example was cheating, but now I don't think so. In some ways, "All I know is what I told you." is actually a more complicated and more abstract sentence than "Positive atomic nuclei attract negative electrons." It doesn't feel that way, because we're so efficient at processing the first sentence, but the second sentence can be simplified to "A attracts B" which is a simple concept anyone can easily visualize. The first one cannot really be simplified or visualized in any way. So I think it serves as good evidence for the theory: the first sentence is actually more complicated and abstract than the second, but we find it incredibly easy to process because we have so much practice with all of the individual words in it. Meanwhile, the second sentence requires most of us to reach deep into our minds to recall the meaning of "positive", "atomic", "nuclei", and so on, which makes it much more effort to understand.
So, that finally brings us to the algorithm. It's called the "Unigram" algorithm. Instead of trying to find a vocabulary that minimizes the number of chunks in each sentence, we want to find a vocabulary that minimizes the combined rarity of the chunks in each sentence.
Under this model, I think an example of the perfect chunk might be "plausible deniability". It's probably much more common than all other usages of either of the words "plausible" or "deniability" on their own. So, by chunking them both into "plausible deniability", turns 2 slow lookups into one slow lookup - and coders used to optimizing their code will know the benefits of eliminating a cache miss in a hot loop! But turning "all, you" into a single chunk probably makes understanding harder, because "all" and "you" were already very common and "all, you" is probably rarer. So that chunk is replacing two fast cache hits with a single cache miss, which is a terrible deal.
Unexpected applications
Now here's what I find so interesting. This "Unigram" algorithm was developed specifically to help represent written text to computers. At first, Claude Shannon developed the theory behind it because it related to compression, then it was applied to machine learning. It turns out to be a very good algorithm for making text more understandable to computers. But if you can use it to teach computers, why not to teach people?
I started researching this because I wanted Yap to have a more nuanced understanding of sentences. One of my goals is for it to be able to accurately model my users' mistakes. When a user learning French translates a French sentence into English, and gets it wrong, I want to be able to figure out what exact parts of the French sentence they struggled with. That way I can help them study that exact part of the language.
Previously, this logic mostly operated on the level of single words, but I rewrote it to use unigram-discovered phrases as its basic unit of learning instead. The improvement in the learning experience was immediate. I now believe that this is the best way to teach a language: one chunk at a time, rather than one word. I've been using the new Yap for a while, but now it's live for everyone. Also, the multilingual dictionaries include unigram chunks as well.
Where I think it gets really interesting is to compare the frequencies of words and the frequencies of phrases. Take a look at the French wordlist for example. At the beginning, it's almost half phrases, if not more! I expect some people will think that the phrase "vous êtes" (meaning "you are") is redundant, because you could just learn "vous" and "êtes" individually. But in my opinion, this does not lead to fluency. You want to be lightning quick at understanding "vous êtes" simply because you'll hear it so often. You won't have time to try processing it as two separate words. Of course, even if you learned them separately, as you consume input in your target language your brain would inevitably figure out the chunking itself. But we can and should short-circuit the process and teach according to the chunks your brain will eventually learn anyway.
But what's truly amazing to me is not this improvement to Yap on its own, but that I think it's the first time that an idea from machine learning has been directly applied to optimize human learning as well! To be clear, I'm talking about a specific algorithm developed to facilitate training models that also happens to be directly applicable to training people. Has this ever happened before? Maybe the real machine learning... was what we learned along the way.
P.S. A fun note about Claude Shannon.
I can't think of a better example of a scientist more limited by their time. Today, people derisively refer to LLMs as "stochastic parrots" due to the randomness they use. But in 1948, Claude Shannon was deliberately trying to invent stochastic parrots. He obviously didn't have a computer, so he was stuck executing his algorithms by hand and got his random numbers from a book. Despite these limitations, he came up with 6 "approximations" of English, each more realistic than the last.
0th Symbols independent and equiprobable.
XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD.
1st Symbols independent but with frequencies of English text.
OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL.
2nd Digram structure as in English.
ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE.
3rd Trigram structure as in English.
IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE.
Words 1st Words chosen independently but with their appropriate frequencies.
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.
Words 2nd Word transition probabilities are correct but no further structure is included.
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
Shannon proudly notes: “The particular sequence of ten words 'attack on an English writer that the character of this' is not at all unreasonable. It appears then that a sufficiently complex stochastic process will give a satisfactory representation of a discrete source.”
Given his interests, it is not surprising that his work ended up being applicable to machine learning today.