But How do LLMs Work? (Part 2: Speak My Language!)

In this post, we're going to try and discover how LLMs can emulate the first and second musketeers of communication. If you haven’t already, I’d recommend a read of Part 1 of this series. It’s a short article and shouldn’t take over 2-3 minutes to read through.

Before jumping into anything else, let’s start with three simple questions.

How are the words happy and joy related?
How are the words happy and sad related?
How are the words happy and cat related?

Here’s Tom with his interpretation of these words:

A three-panel illustration featuring two stick figures and a pixel art kitten, separated by vertical dashed lines. In the left panel, a stick figure tips a wide-brimmed hat with a smile, labeled with the quoted text "Happy/Joy." The middle panel features a detailed, colorful pixel art illustration of a brown tabby kitten sitting upright with the word "meow." written above it and the label "cat" below. In the final panel on the right, the same faint stick figure appears downcast while wearing the hat, labeled with the word "Sad." — How are these 4 words related to each other?

It doesn’t take a genius to understand that:

The words happy and joy have very similar meanings:

\(\text{Happy} \approx \text{Joy}\)

And that happy and sad are opposites:

\(\text{Happy} = - \text{Sad}\)

And that cat is completely different from either happy or sad:

\(\{\text{Happy, Sad}\} \neq \text{Cat}\)

As humans, this intuition is built over years of understanding and using language. But how do computers build an understanding of the semantics behind each word?

Tom and Hank were recently having a conversation about their pets.

Two-panel comic with two stick figures, one with a cat and one with a dog. In the first panel, one stick figure says, "My kitten loves to sleep". In the second panel, the other stick figure says, "My dog loves to sleep too!". — Even without knowing what “kitten” or “dog” mean, the shared context reveals they’re related concepts.

Imagine you knew nothing about what the words kitten or dog meant. Despite this, by context alone, you could infer that these two words were similar or at least belong to the same category.

\(\underline{\text{kitten}} \text{ and } \underline{\text{dog}} \text{ are used interchangeably}\newline \implies \text{same category (pets)}\)

Now look at these two sentences:

Tom felt happy after getting a promotion.
Tom felt joy after getting a promotion.

Specifically, focus on the words happy and joy. In this context, the two are interchangeable. Though they might have different specific meanings, the fact that they are interchangeable somehow suggests that they must be similar!

Two words, despite being completely different in the letters they are made up of, will share the same meaning if they appear in similar contexts.

This very concept: that similar words are usually interchangeable, gives rise to a simple way to find the semantic relation between different words.

You shall know a word by the company it keeps.

In fact, from a couple thousand of sentences, it becomes almost trivial to understand how closely related happy and joy are; or to understand how different happy and sad are.

Over hundreds of thousands of sentences from various sources, models build a solid understanding of how different words are related to each other. The models that specialize in learning only this semantic understanding are called embedding models.

Internally, how this works is through a long list of numbers called embeddings (hence the name of such models). Each word gets its own unique long list. The idea is that similar words will have lists of numbers that are similar in their values.

You can play around with embedding models in this online playground to visualize with real-world embedding models.

A two-part illustration showing a heatmap on the left, with a grid on the left, showing numerical similarity of one word with another. A bar graph on the right shows the similarity of a set of words to the word "happy": "happy" (100%), "joy" (60%), "sad" (37%) and "cat" (28%). — Models learn the semantic similarity between all kinds of different words.

Keep in mind that though embeddings capture similarity, they do not necessarily capture meaning. Considering the words happy and joy, the model knows that they are similar; however, it doesn’t understand the meaning behind either of them.

Embeddings are processed by LLMs in a different way to extract meaning from context. This is actually the third musketeer of communication and will be dealt with in a later post.

When you communicate with an LLM, though the model responds in English, that’s usually not what the model sees. Instead, the model sees something called tokens.

Tokens usually represent a word or part of a word or punctuation mark. Text is split up into tokens; each token has an ID associated with it. Each LLM usually has its own embedding layer that converts token IDs to embeddings. This sequence of embeddings is what is processed by the LLM.

An illustration showcasing the process of converting words into a format LLMs can understand (embeddings) — Words → Token IDs → Embeddings → Semantics → …→ Output!

A tokenizer is a program that, given a word, retrieves its token ID.

You can play around with the tokenizer used by ChatGPT here. Here’s an example, where each colour corresponds to a single unique token:

A sentence that reads, "Long words and long sentences confuse the common folk." Each word is highlighted in a different colour, including the punctuation mark that has its own unique colour. — One token is either a word, a part of a word or a punctuation mark.

In this post, we looked at one of the most important principles of language is that similar words occur in similar contexts. Models learn to exploit this property to learn similarities between word semantics. This allows them to build an understanding of different words and how they relate with each other.

Next up: what exactly does the model do with these embeddings? If you have a guess, drop a comment below!

This post is a part of the “But How do LLMs Work?” series, designed to help you understand and discover LLMs through examples and intuition. If that’s something you’re interested in, I’d recommend subscribing!

If you don’t want to subscribe, that’s alright: share this post with a friend!

If you think kittens are cute, tell your friends about the (very) cute kitten in this post!

But How do LLMs Work? (Part 2: Speak My Language!)

Discussion about this post

Ready for more?