GPT-4: It's not "just statistics".

GPT-4 is just statistics, just predicting the next token, it has no real understanding.

Right?

Probably not. In parts 1 and 2, I’ll present a mathematical argument to show that if GPT-4 is doing statistics, they are not the sort of statistics that would imply a lack of understanding.

Then in part 3 I will go a step further, and run a transparent and reproducible experiment that demonstrates how a GPT-like transformer model can learn not just surface statistics of a dataset, but rather the underlying process that produced that data, in a way that would be impossible to do through lesser means.

Let’s get these goalposts planted in the ground, because “statistics” has no definite meaning beyond “involving inference from observational data”.1 By that measure anything can be statistics, including almost all of human thought. Since humans are very impressive creatures that do impressive things, and when people say “GPT-4 is just doing statistics” it is mostly meant to sound dismissive rather than impressed, clearly a certain lesser kind of statistics is implied.

So by “just statistics” I mean a heuristic based on correlations between words, standing in contrast with a mechanistic (or algorithmic) model. I think this view was most eloquently expressed by Bender et al in the paper that coined the term “stochastic parrot”:

Contrary to how it may seem when we observe its output, a [language model] is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.

— On The Dangers of Stochastic Parrots: Can a Language Model Be Too Big? (2021)

If you’re still not sure of the difference, think of a royal court astronomer attempting to predict future eclipses based on just searching for patterns in historical records of past eclipses. If this astronomer emerges from their study having found the 223-month Saros eclipse cycle, it would be fair to label this work as “just statistics”. If the astronomer, in the course of their work, derives Newton’s laws of gravity, it seems fair to label this as “not just statistics”. The astronomer would have found not only the surface-level pattern, but also the compact, underlying laws2 that correctly predict eclipses for never-observed planets in never-observed orbits.

https://commons.wikimedia.org/wiki/File:Heliocentric.jpg

In 1948, Claude Shannon published the groundbreaking thesis A Mathematical Theory of Communication, founding the field of information theory in the same stroke.3 In it, Shannon briefly examined a process for compressing text for the purpose of digital transmission,4 which he called the “series of approximations to English”.5

The text below starts with uniformly-random words comprising the first line. The second line has its words drawn according to their relative frequency of occurrence in English, and on the third line, according to the frequency with which they follow the previous word.6 As n increases, the text gradually starts to resemble more meaningful English.

n = 0: RECEIVE FALL SURPRISED FRIDAY INGREDIENT…
n = 1: REPRESENTING AND SPEEDILY IS AN GOOD…
n = 2: THE CHARACTER OF THIS POINT IS THEREFORE…

These are called n-gram models, where n is the depth of statistical correlation considered by the model (n = 3 would consider triplets of words, and so on). Among language models, n-grams fit the bill for “just statistics”, even as n becomes very large. There is no representation of concepts or abstraction; words are being sampled purely based on their correlations with the previous words. And yet, as further layers of statistical correlation are applied, the results start to become impressively natural.

Many people seem to think of GPT as a fancy n-gram. It’s not! It’s really not!

It’s an easy mistake to make, because the way GPT works is, in a certain sense, functionally equivalent to an n-gram, but that doesn’t mean GPT is an n-gram. If this sounds like a contradiction, let me just illustrate with a chess example.

We could imagine designing a chess-playing program that consists of a huge database of chess positions, and for each position, some recommended next moves. Each turn, the chess program picks a move corresponding to the current table state. As far as chess programs go, if the database is comprehensive and the recommendations are sufficiently good, this program could mimic any other chess program. Let’s call this the “playbook method”.

Now consider Stockfish, a world-leading open-source chess program. There is no way to tell, merely by playing against Stockfish and without inspecting its source code, whether or not it is using the playbook method internally. You could even use Stockfish to generate a chess playbook by repeatedly querying it and recording its output, and if you did this comprehensively, that playbook would essentially be a complete representation of Stockfish’s decisionmaking. If you wanted to emulate Stockfish or reason about its output, the playbook could be a perfectly valid mathematical model.

But that Stockfish playbook does not represent how Stockfish works, and the proof of that is that Stockfish’s source code (including its evaluation network) occupies less than 50 megabytes on my hard drive, whereas its playbook equivalent would be so big that storing it would cause my laptop to immediately collapse into a black hole.7 Since I survived to write this article, Stockfish must be doing something more clever that achieves the same goal of playing chess.

Image credit Caltech/R. Hurt (IPAC) - https://www.jpl.nasa.gov/news/black-hole-collision-may-have-exploded-with-light — Undo! Undo!

I’m not sure about GPT-4, but GPT-3's token dictionary contains about 50,000 tokens.8 Its context window — the length of correlations that it can consider, and roughly equivalent to the n of an n-gram — is 8000 tokens long. It is thought to contain 1 trillion parameters, although some sources say it’s as high as 170 trillion. That’s a lot of parameters! It sure sounds like brute-force statistics, doesn’t it?

Yeah, so… how many parameters would we need to build the n-gram model that mimics GPT-4?

A 2-gram Markov chain selecting the sentence `THE CAT CHASES THE MOUSE THAT FRIGHTENED THE DOG` as the most likely continuation of an initial state `THE CAT CHASES`. The transition probabilities are represented here by the weight of the connecting arrow. The 2-gram only remembers the previous token. It can encode some ideas, like that a mouse is unlikely to frighten another mouse, but not very compactly.

An n-gram is a type of Markov process that defines transition probabilities between states in a state space. The thing is that as you increase the order n of the n-gram, and increase the number of possible states gets big, fast. The volume of GPT-4’s state-space is 50,000⁸⁰⁰⁰ states. For each of those states, we need to define a transition probability to the next token among those 50,000 tokens. That’s about ten-to-the-power-of-38000 parameters. Storing these parameters would not just collapse your laptop into a black hole; it would collapse the entire observable universe into a black hole after barely starting the download.

Does the n-gram model really need all those parameters to mimic GPT-4? Yes, it does. And it gets worse when you consider the size of the dataset needed to calculate those parameters. This is because an n-gram cannot leverage abstractions.

Let’s look at a simple example of applying an abstraction. I have adapted the classic Wolf, Goat, and Cabbage puzzle, which can be found in many books and webpages, but changed the names to something unlikely to have been encountered verbatim.

[Prompt]

Please solve this puzzle.
An Earthling went to the Galactic Federation and met a Martian, a Venutian, and a Moonling. The Earthling wanted to take the wormhole with her back to Earth. But crossing the galaxy by wormhole, the Earthling could carry only herself and a single one of her comrades: the Martian, the Venutian, or the Moonling.
Due to ancient rivalries, if the Martian and the Venutian were left unattended together, the Martian would disintegrate the Venutian with his blaster. If the Venutian and the Moonling were left unattended together, the Venutian would melt the Moonling with her radiation beam.
The Earthling's challenge was to transport herself and her comrades to her home planet, while ensuring they all survived the journey. How did she do it?

In my testing, GPT-4 gets it right.9 The quicker and cheaper GPT-3.5 did not.10 (I’ve put their transcripts in the footnotes; you can click to read their responses).

Solving this should be quite an easy task for anyone who has learned the pattern of this riddle — even if they aren’t good at logic puzzles in general, they should be able to get this just by analogy to the original. All you need to be able to do is regurgitate the memorized Wolf, Goat, and Cabbage solution, while swapping the nouns for any <creature_1>, <creature_2>, and <creature_3>. It is really not a difficult puzzle, but let’s consider what would be necessary for an n-gram model to generate this answer.

Notice the mistakes that the GPTs didn’t make: the solution does not mention a wolf, a goat, a cabbage, or a river. Nor did it even attempt to correct my mis-spelling of Venusian. If you think of GPT as a big n-gram, this should actually surprise you! One can easily gather transition-frequency statistics for sentences like this one:

 Jack went to the → {cafe | school | office}

But you won’t find the specific sentence below even once in the entire internet,11 let alone often enough to estimate 14-gram transition frequencies:

the Earthling then returns to the Galactic Federation one last time to get the → {Martian | Venutian | Moonling}

A 7-gram language model trained on this problem could very reasonably be expected to output “Goat” just by picking up from “one last time to get the”. But for an n-gram to solve this as well as GPT-4 did, it will need transition frequencies that explicitly point to the word Venutian as the correct completion of that specific preceding state, without being able to apply the logical abstraction of “substitute whatever word was used in place of <creature_2> in cases like this”.

A pure n-gram structure is too rigid to allow that sort of indirection, so it needs to be explicitly populated somewhere in that universe-swallowingly-gigantic lookup table. Of course, someone could implement a more sophisticated language model that gets away with a smaller lookup table by tracking unexpected words in its input and identifying analogies with data that exists in its dataset, but would that still be just statistics?

The state space of a 3-gram Markov chain, shown selecting `FRIGHTENED THE DOG` as the most likely completion of `THE CAT CHASES THE MOUSE THAT`. Each axis of space represents a choice of token. A 3-gram can encode slightly more complex ideas than a 2-gram, but the cost of doing so grows rapidly. It is hard to do better if there are no regularities or patterns among the transitions, but it gets unwieldy as this gets extended to longer sentences. Learning a more compact “world model” of animal interactions can achieve the same prediction while needing fewer parameters.

An abstraction shrinks the required number of parameters by identifying a regularity in the problem space. Instead of needing to independently store transition frequencies for every version of this puzzle over 50,000+ different possible creature names, we can get away with just one pattern. This an example of compression, which is a move in the opposite direction from the process that expands Stockfish into a chess playbook. The more abstract regularities we can find that shrink the problem space, the fewer parameters we need, but as we apply them to our program, it resembles a database less and Stockfish more.

So transformers like GPT-4 aren’t doing token prediction by naïve application of raw statistics. They compress the problem space way too effectively. But will transformers ever develop Newton-like understanding, or just dig up shallow patterns like the Saros cycle?

I don’t expect GPT-4 to fill Isaac Newton’s shoes anytime soon. But let me propose a simple experiment that, I believe, captures the main essence of this question.

If we teach a transformer network to sort lists of random integers, will it just learn the statistics of sorted lists, or will it learn to implement a sorting algorithm?

You might be wondering why we should waste our time training an AI to sort lists when we already have algorithms that can do that perfectly well. Here’s why this is a worthy investigation: First, it should be easy to tell the difference between the output of a sorting algorithm and statistics.12 Second, this is a simple enough problem that the role of each neuron in the trained network might be transparently understandable, rather than training something too vast to understand at a low level. Third, we can compare the learned result against closed-form solutions, which is always a good thing for software validation. And as Bender et al point out, while ChatGPT’s replies can seem sensible and coherent “in the eye of the beholder”, it is plausible that our human psychology is tempting us into reading more depth into those replies than is actually there. The correctness of a sorted list is objectively verifiable.

Also this experiment doesn’t cost a fortune to try, and unlimited training data can be generated on-the-fly without infringing anyone’s privacy or copyrights. So there’s that.

In a nutshell, here’s why sorting is a good test to rule out stochastic parroting:

The correctness of a sorted list is not in the eye of the beholder;
We can easily generate unique lists, both for training and for testing, to verify with confidence that the answer is not merely being copied from some remote corner of a vast and proprietary ocean of training data;
The correctness of each output always depends on the entire input sequence, not just neighbouring tokens; there is no way to succeed without learning some form of counting and ordering along the way.

Sorting, here, is really a stand-in for the general idea of whether the network can learn the underlying algorithm13 that generated the training data it observes — the Newton’s laws of this dataset.

SORT (51,87,46,2,17,46,80,55,3,24,70,73,52,52,11,74,40):
---> [2,3,11,17,24,40,46,46,51,52,52,55,70,73,74,80,87];

If you expect it’s possible to sort lists by statistics alone, think just for a minute about what that would look like. Sure, it should be easy to grasp that every second or third character is a comma, and the numbers monotonically increase, and that long lists will often tend to start with [0, or [1, and end with 98]; or 99];. But how many 0s? How many 1s? When is it preferable to start with 2 and end with 87? How often should there be two 46s and two 52s near the middle? There are no statistical shortcuts; every single token in the input counts. If you’re going to sort by pattern-guessing and not by a process that involves actual sorting, it becomes astronomically unlikely that you’ll pull the exact right sequence out of that universe-sized hat. An n-gram can pull this off by mapping every input list to a corresponding output list, but as usual, not without turning my laptop into a black hole in the process.

My expectation for this experiment is that at some point in training, we’ll see the error drop off a cliff, and at that point the neural network will have configured itself to encode a crude sorting algorithm.14 It might not go straight to zero; the learning process involves stumbling around in the dark without even knowing a priori that there’s a deterministic process that generates the data,15 and perhaps it will initially have stumbled on an algorithm for sorting small sub-sequences, mixed in with some guessing. But eventually it should be sorting long lists without error, most of the time.

On the other hand, if it only learns shallow statistics, then there will be many mistakes in the output: it might contain sequences of monotonically-rising numbers, but not every number in the input will be found in the output; repeated occurrences of certain numbers in the input might be dropped, and numbers that were never in the input might be hallucinated into the output. In that case, it will only be able to sort very short lists without error, like 6 or 7 numbers at the most, and only with some luck. Let’s see what happens!

To run this experiment I cloned Andrej Karpathy’s nanoGPT project, and replaced the Shakespeare dataset with lists of random numbers between 0 and 99, up to 127 characters long.16 The context window is twice as long at 256 characters, because the first entry needs to still be in context by the time it’s gotten to the end of the list. Each character 0-9, as well as the comma separators and the parentheses are their own token.

It’s trained the normal way language models are, just predicting the next token. But to judge its performance at sorting throughout the training process, I periodically sample its output across ten random input lists of various lengths.17 Rather than attempt to judge “how good” a sort is, I score all-or-nothing: only a perfectly sorted list, with no missing or excess numbers or misplaced commas, and terminated with ‘];’ is admissible. In that sense, even scoring one point would be fairly impressive.

Here are the results after one day:

(This represents about 24 hours of training time.)

I would call this a success. I don’t expect this network will sort every given list flawlessly (yet), but it’s already performing far too well to be regurgitating memorized patterns. One thing does jump out though: what went wrong after 166,000 steps? It suddenly goes from getting everything right, to everything wrong. I looked at the log, and found that the same mistake was being made with every single list: each output was sorted correctly, with the exception of dropping a single number. For example:

INPUT (43 numbers): (59,44,48,74,43,0,85,55,36,16,84,40,35,41,70,20,43,21,24,7,18,11,75,70,63,93,47,1,44,1,65,20,87,21,94,79,29,7,18,95,66,11,91):
EXPECTED: [0,1,1,7,7,11,11,16,18,18,20,20,21,21,24,29,35,36,40,41,43,43,44,44,47,48,55,59,63,65,66,70,70,74,75,79,84,85,87,91,93,94,95];

OUTPUT: [0,1,1,7,7,11,11,16,18,18,20,20,21,21,24,29,35,36,40,41,43,43,44,44,47,48,55,63,65,66,70,70,74,75,79,84,85,87,91,93,94,95];    << did you spot the error?

All ten lists from that checkpoint have that exact same error. It seems the training process quickly corrected that mistake and the network returned to (mostly) perfect performance. Curious.

Anyway, I’m satisfied that this little network, trained on my laptop plugged into a single, anemic GPU, cannot possibly be a mere “stochastic parrot”. The math doesn’t work out. It’s sorting numbers by learning an algorithm that sorts them, and as far as I’m concerned, that means it’s doing more than just statistics.

Probably GPT-4 is as well.18

This post isn’t intended as professional engineering advice. If you are looking for professional engineering advice, please contact me with your requirements.

GPT-4: It's not "just statistics".

Discussion about this post

Ready for more?