Are you better than a language model at predicting the next word?
joel.toolsIt's a neat idea, though not what I expected from the title talking about "smart" :)
You might want to replace the single page format with showing just one question at a time, and giving instant feedback on after each answer.
First, it'd be more engaging. Even the small version of the quiz is a bit long for something where you don't know what the payoff will be. Second, you'd get to see the correct answer while still having the context on why you replied the way you did.
> not what I expected from the title talking about "smart"
I think the title is mainly a reference to the TV show “Are you smarter than a fifth grader?”
Fittingly then, is the fact that a lot of types of questions that they were asking in that TV show was mostly trivia. Which I also don’t think of as being a particularly important characteristic of being “smart”.
When I think of “smart” people, I think of people who can take limited amount of information and connect dots in ways that others can’t. Of course it also builds on knowledge. You need to have specific knowledge in the first place to make connections. But knowing facts like “the battle of so and so happened on August 18th 1924, one hundred years ago today” alone is not “smart”. A smart person is someone who uses knowledge in a surprising way. Or in a way that others would not have been able to. After the smart person made the connection others might also go like “oh that’s so obvious why didn’t I think about that” or even “yeah that’s really obvious, I could’ve thought of that too”. And yet the first person to actually make, and properly communicate that connection was the smart one. Smart exactly because they did.
If you want to practice it one question at at time, you set the question count to 1. https://joel.tools/smarter/?questions=1
When I tested it this way it resulted in less of an emotional reaction.
I retired as worldwide champion (tied) of text prediction.
you: 0/1 gpt-4o: 0/1 gpt-4: 0/1 gpt-4o-mini: 0/1 llama-2-7b: 0/1 llama-3-8b: 0/1 mistral-7b: 0/1 unigram: 0/1Uhm I was just wondering if all models could get a question correct at the same time and except this "you" model all got it correct.
you: 0/1
gpt-4o: 1/1
gpt-4: 1/1
gpt-4o-mini: 1/1
llama-2-7b: 1/1
llama-3-8b: 1/1
mistral-7b: 1/1
unigram: 1/1
I found the you model being exceptionally bad at this. Where can I see how many I got right?
If you're looking for "knowledge" try https://d.erenrich.net/are-you-smarter-than-an-llm/index.htm...
This is fun!
I bet this could be a unique testing resource for aspiring Jeapordy contestants.
Thanks - we've LLMified the title.
I made a little game/quiz where you try to guess the next word in a bunch of Hacker News comments and compete against various language models. I used llama2 to generate three alternative completions for each comment creating a multiple choice question. For the local language models that you are competing against, I consider them having picked the answer with the lowest total perplexity of prompt + answer. I am able to replicate this behavior with the OpenAI models by setting a logit_bias that limits the llm to pick only one of the allowed answers. I tried just giving the full multiple choice question as a prompt and having it pick an answer, but that led to really poor results. So I'm not able to compare with Claude or any online LLMs that don't have logit_bias.
I wouldn't call the quiz fun exactly. After playing with it a lot I think I've been able to consistently get above 50% of questions right. I have slowed down a lot answering each question, which I think LLMs have trouble doing.
"This exercise helped me to understand how language models work on a much deeper level."
I'd like to hear more on this.
It's an interesting test, pretty cool idea. Thanks for sharing
you: 4/15
gpt-4o: 0/15
gpt-4: 1/15
gpt-4o-mini: 2/15
llama-2-7b: 2/15
llama-3-8b: 3/15
mistral-7b: 4/15
unigram: 1/15
Seems like none of us is really better than flipping a coin, so I'd wager that you cannot accurately predict the next word with the given information.If one could instead sort the answers by likelihood and got scored based on how high one ranked the correct answer, things would probably look better than random.
Also I wonder how these LLMs were prompted. Were they just used to complete the text, or where they put in a "mood" where they would try to complete the text in the original author's voice?
Obviously as as human I'd try to put myself in the author's head and emulate their way of speaking, whereas an LLM might just complete things in its default voice.
On the full set of 1000 questions, the language models are getting 30-35% correct. With patience, humans can do 40-50%.
The language models were prompted with the text + each candidate answer, and the one with the lowest perplexity was picked. I tried to avoid instruction tuned models wherever possible to avoid the "voice" problem.
i'm curious, how did you arrive at "40-50%" possible human performance?
the task of "predicting the next word" can be understood as either "correctly choosing the next word in the hidden context", or "predicting the likelihood of each possible word".
the quiz is evaluating against the former, but humans are still far from being able to express a percentile likelihood for each possibility.
i only consciously arrive at a vague feeling of confidence, rather than being able to weigh the prediction of each word with fractional precision.
one might say that LLMs have above human introspective ability in that regard.
This is also a good test for noticing that you spend too much time reading HN comments.
Nice. I found you can beat this by picking the word least likely to be selected by a language model, because it seems like the alternative choices are generated by an LLM. “Pick the outlier” is the best strategy.
This is presumably also a simply strategy for detecting AI content in general - see how many “high temperature” choices it makes.
This was always my strategy for Who Wants to Be a Millionaire?. Pick the answer that would seem the most unlikely to be listed if any of the other three answers were the correct one.
What scores are you getting using this technique?
> You scored 11/15. The best language model, llama-2-7b, scored 10/15.
I see that you get a random quiz every time, so results aren't comparable between people. I think I got an easy one. Neat game! If you could find a corpus that makes it easy for average humans to beat the LLMs, and add some nice design, maybe Wordle-style daily challenge plus social sharing etc, I could see it going viral just as a way for people to "prove" that they are "smarter" than AI.
Given the high scores, I guess it was an easy one. I've taken the longer one, and got the following
> You scored 28/100. The best language model, gpt-4, scored 32/100. The unigram model, which just picks the most common word without reading the prompt, scored 28/100.
Assuming complexity averages out on N=100, small test with LLM score above ~5 is "easy"
Got 8/15, best AI model got 7/15, and unigram got 1/15.
Finally a use for all the wasted hours I’ve spent on HN — my next word prediction is marginally better than that of the AI.
I have wasted an inordinate amount of time hn. i scored 2/15
This is the best interactive website about LLMs at a meta level (so excluding prompt interfaces for actual AIs) that I've seen so far.
Quizzes can be magical.
Haven't seen any cooler new language-related interactive fun-project on the web since:
It would be great if the quiz included an intro or note about the training data, but as-is it also succeeds because it's obvious from the quiz prompts/questions that they're related to HN comments.
Sharing this with a general audience could spark funny discussions about bubbles and biases :)
I don't quite understand, what makes "Okay I've" more correct than "Okay so"? No meaningful context was provided here, how do we know "Okay I've" was at all meaningfully correct?
For the longer comments I understand, but for the ones where it's 1 or 2 words and many of the options are correct English phrases, I don't understand why there's bias towards one? Wouldn't we need a prompt here?
Also, I got bored halfway through and selected "D" for all of them
If the samples came from HN, I wonder how likely it is that the text is already a part of a dataset (ie common crawl snapshot) so that the LLMs have already seen them?
edit: judging from the comments I saw, they were all quite recent, so I guess this isn't happening. Though I do know that ChatGPT can sometimes use a Bing search tool during chats, which can actually link to recently indexed text, but I highly doubt that the gpt4o-mini API model is doing that.
Some of them are excerpts from a much larger context, which the LLM would be using for prediction, obviously giving them a gigantic edge.
I like it. It's a humorous reversal of the usual articles that boil down to "Look! I made the AI fail at something!"
My computer can compute 573034897183834790x3019487439184798 in less than a millisecond. Doesn't make it smarter than me.
This is just a test of how likely you are to generate the same word as the LLM. The LLM does not produce the "correct" next word as there are multiple correct words that fit grammatically and can be used to continue the sentence while maintaining context.
I don't see what this has to do with being "smarter" than anything. Example:
1. I see a business decision here. Arm cores have licensing fees attached to them. Arm is becoming ____
a) ether
b) a
c) the
d) more
But who's to say which is "correct"? Arm is becoming a household name. Arm is becoming the premier choice for new CPU architectures. Arm is becoming more valuable by the day. Any of b), c), or d) are equally good choices. What is there to be gained in divining which one the LLM would pick?
The LLM didn’t generate the next word. Hacker News commenters did. You can see the source of the comment on the results screen.
Do LLM's generate words on the fly or can they sort of "go back" and correct themselves? stackghost brought up a good point I didn't think about before
Beam search generates multiple potential completions and scores multiple tokens by likelihood, the picks the most likely after some threshold or length, which is close to a "go back and try again".
afaik they do not go back. keep in mind there is a context in which they are generating the response, e.g. the system prompt and the actual question.
At this point, we've all gotten quite used to the "style" of LLM outputs, and personally I doubt this is the case, however, it is possible that there is some, shall we say, corruption of the data here, since it was not possible to measure the ability of LLMs to predict the next word before there were LLMs.
I propose you do the same things, but only include HN content from before the existence of LLMs. That should ensure there is no bias towards any of the models.
If I used old comments then it's likely that the models will have trained on them. I haven't tested if that makes a difference though.
an unbiased llm shouldn't be producing "style", it should be generating outputs that closely match the training set, as such their introduction should constitute only some biasing toward the average, which also happens in language usage in humans over time. the outcome is likely indistinguishable for large general data sets and large models. i am interested to see how chatbot outputs produce human output bias in generations growing up with them though, that seems likely and will probably be substantial
But that's clearly not the case. There was a post the other day about how GPT used certain words at a rate remarkably higher than average. Also the paragraph breaks, the politesse. No, I don't have much to back it up, but generally I can tell very quickly if a chunk of text is from ChatGPT, for instance, or if an image is generated by DALL-E.
in the above, when i say llm, i mean the base models, when i say chatbot, i mean things like chatgpt, they're not the same. chatgpt is not just a frontend for the base model, studies on chatgpt covering output biasing that it has from the fine tuning, prompts and contexts and other things they do are largely not applicable to the raw model generation in this quiz, and they are also largely not applicable to llms as a whole
An LLM takes a slice of data from the world, by nature it has to organize it in some such way, depending on how its trained, and the method of organizing it is hard-coded into the model. Therefore, all models will develop some sort of style, no matter what, since somebody, or a team of people, had to figure out a way to portion out a selection of data, and this problem is intractable.
generative models are trained to generate outputs in response to an input, that closely resemble the training data. that’s literally all they do. if a base model was introducing “style” training (as we currently do it) wouldn’t even function. what you’re implying is mathematically intractable for generative models, and that’s fundamental to what they are and how they are made. the style stuff you’re referring to is a side effect of fine tuning and contexts of chatbots, it’s not a property of llms or generative models
So you agree with me? Style is fundamentally part of the set of all data used in production, and that can be “tuned” as you say, but never removed. Its the ghost in the machine, the spark of contingency. Of course, all machines bear the mark of their creators, but LLMs doubly so, as creators themselves. Like shitty, partially incoherent children.
the models used in OP site are not tuned on stylized content
you keep saying LLM when you mean chatbot, i’m not sure if you’re really reading my posts
Where do the incorrect options come from?
In another comment the author wrote
> I made a little game/quiz where you try to guess the next word in a bunch of Hacker News comments
So I guess the correct answer comes from the HN user who wrote the comment?
Yeah, but I was wondering about the incorrect options.
I suspect they come from the LLMs.
For anyone else daring the full 100 question quiz: you need to get at least a third right to be considered better than guessing by traditional statistical standards. (You'd need more than half to be better than LLMs.)
I got 9/15, vs. 4/15 for an LLM. I assume these are lifted from HN? Seems like an indication I should spend less time here...
You scored 6/15. The best language model, gpt-4o, scored 6/15. The unigram model, which just picks the most common word without reading the prompt, scored 2/15.
Keep in mind that you took 204 seconds to answer the questions, whereas the slowest language model was llama-3-8b taking only 10 seconds!
> You scored 8/15. The best language model, mistral-7b, scored 6/15. The unigram model, which just picks the most common word without reading the prompt, scored 5/15.you: 8/15 gpt-4o: 2/15 gpt-4: 4/15 gpt-4o-mini: 4/15 llama-2-7b: 5/15 llama-3-8b: 5/15 mistral-7b: 6/15 unigram: 5/15(In I think 120 seconds - didn't copy that part).
Interesting that results differ this much between runs (for the LLMs).
Surely someone did better than me on their first run?
Ed: I wonder if the human scores correlate with age of hn account?
I took some mushrooms and hallucinated the answers.
Was mine broken? One of my prompts was just '>'. So of course I guessed a random word. The answer key showed I got it wrong, but showed the right answer inserted into a longer prompt. Or is that how it's supposed to work?
That isn't how it's supposed to work. I mean sometimes you get a supper annoying prompt like ">", but if you guess the right answer it should give you the point. I just checked the two prompts like that, and they seem to work for me.
Right, I got the answer incorrect, so that part worked right. I just wasn't sure if the question was intentionally clipped and missing that context, but it does sound intentional. I guess I make a poor LLM!
Yes. I can tell you about things that happened this morning. Your language model cannot.
I can also invite you out for a coffee and your LLM can’t do that either–yet.
They're perfectly capable of inviting you out for coffee. They just can't show up yet.
Well the showing up part is quite important I’d argue.
though, with web access and a credit card and the right information, you could probably get one to order a pizza to your house though.
I’m cool with that as long as it’s not my credit card.
This isn't really the challenge (loss function) that language models are trained on. It's not a simple next-word challenge, they get more context, see how BERT was trained as a reference.
Like a ML model I would prefer being scored with cross entropy and not right/wrong. Like, I might guess wrong but it might not be that far off in likelihood.
It is mitigating that we get so many questions, but I agree it's inefficient. As a human forecaster I also prefer being judged in part on my confidence in each of the alternatives.
So... If I picked the same results, in the same timeframe... And I don't think glue should go on pizza... Does that mean LLMs are completely useless to me?
I got one of my own comments on the 15 question quiz!
I like the website, but it could be a bit more explicit about the point it's trying to make. Given that a lot of people tend to think of LLM as somehow a thinking entity rather than a statistical model for guessing the most likely next word, most will probably look at these questions and think the website is broken.
I've got 2/15, so worse then random choice... I guess partly because English is not my mother tongue.
Of course not, but that does not mean LLMs will lead to AGI. We might never build AGI in fact: https://www.lycee.ai/blog/why-no-agi-openai
That article, disapointingly, doesn't provide any arguments as to why we can't build AGI.
>the quintessential language model task of predicting the next word?
Based on what? The whole test is flawed because of this. Even different LLMs would choose different answers and there's no objective argument to make for which one is the best.
The one provided in the original post.
I don't see any of that.
Quote?
The prompts you see in the quiz are from real hacker news comments. Whatever word the commenter said next is the "correct" word.
This is what I see,
And then a list of questions.Are you smarter than a language model? There are a lot of benchmarks that try to see how good language models are at human tasks. But how good are you at the quintessential language model task of predicting the next word?How am I supposed to know it has anything to do with HN?
After the quiz, the source is linked along with the full comment.
> 8. All of local politics in the muni I live in takes place in a forum like this, on Facebook[.] The electeds in our muni post on it; I've gotten two different local laws done by posting there (and I'm working on a bigger third); I met someone whose campaign I funded and helped run who is now a local elected. It is crazy to think you can HN-effortpost your way to changing the laws of the place you live in but I'm telling you right now that you can.
This is a magical experience. I've done something similar in my university's CS department when I pointed out how the learning experience in the first programming course varies too much depending upon who the professor is.
I've never experienced this anywhere else. American politicians at all levels don't appear to be the least bit responsive to the needs and issues of anyone but the wealthy and powerful.
7/15, 90 seconds. I'll blame it on fact that I'm not English native speaker, right? Right?
On a more serious note it was a cool thing to go through! It seemed like something that should have been so easy at first glance.
I am a native English speaker and only got 5/15 - and it took me over 100 seconds. You have permission to bask in the glory of your superiority over both GPT4 and your fellow HN readers!
I feel like I recognise the comment about tensors from HN a few days ago, haha.
I think this is a good joke on nay-sayers. But if author is here, I would like a clarification if user is picking the next token or the next word? Cause if it is the latter, I think this test is invalid.
The language model generating the candidate answers generates tokens until a full word is produced. The language models picking their answer choose the completion that results in the lowest perplexity independent of the tokenization.
I'd say the test is still not quite valid, and more of in between the original "valid" task and "guess what LLM would say" as suggested in another comment here. The reason is: it might be easier for LLMs to choose the completion out of their own generated variants (1) than the real token distribution.
1. perhaps even out of variants generated by other LLMs
Everything I picked was grammatically correct, so I don't see the point. Is the point of a "language model" just to recall people's comments from the internet now?
Always has been.
5/15, so the same as choosing the most common word.
I think I did worse when the prompt is shorter. It just becomes a guessing game then and I find myself thinking more like a language model.
It says choosing the most common word was just 1/5 (and their best LLM was 4/15)
Yeah, it should be sentences that have low next token distribution entropy. Where an LLM is sure what the next word is. I bet people do real well on those too. By the way, I also had 5/15.
The LLMs are better than me at knowing the finer probabilities of next words, and worse than me at guessing the points being made and reasoning about that.
Is this with the “temperature” parameter set to 0? Most LLM chatbots set it to something higher.
It would be interesting to try varying it, as well as the seed.
Temperature doesn't play a role here, because the LLM is not being sampled (other than to generate the candidate answers). Instead the answer the llm picks is decided by computing the complexity for the full prompt + answer string.
Tried to respond like a LLM would
> You scored 7/15. The best language model, mistral-7b, scored 7/15.
I guess it's a success
This is a nonsense test. There is no context, so the 'next' word after the single word 'The' is effectively random.
I'm pretty certain that LLMs are unable to work at all without context.
They will "work", ie give a prediction, it's simply that it will have a pretty low probability of being the correct answer, which is a consequence of the highly limited context.
IMHO that doesn't make it nonsense, but maybe you are reading something different into the purpose of this test to what I am.
7/10 This is more about set shattering than 'smarts'
LLMs are effectively DAGs, they literally have to unroll infinite possibilities in the absence of larger context into finite options.
You can unroll and cyclic graph into a dag, but you constrict the solution space.
Take the 'spoken': sentence:
"I never said she stole my money"
And say it multiple times with emphasis on each word and notice how the meaning changes.
That is text being a forgetful functor.
As you can describe PAC learning, or as compression, which is exactly equivalent to the finite set shattering above, you can assign probabilities to next tokans.
But that is existential quantification, limited based on your corpus based on pattern matching and finding.
I guess if "Smart" is defined as pattern matching and finding it would apply.
But this is exactly why there was a split between symbolic AI, which targeted universal quantification and statistical learning, which targets existential quantification.
Even if ML had never been invented, I would assume that there were mechanical methods to stack rank next tokens from a corpus.
This isn't a case of 'smarter', but just different. If that difference is meaningful depends on context.
With some brief experimentation ChatGPT also fails this test.
It might make sense: any kind of fine-tuning of LLMs usually reduces generalization capabilities, and instruction-tuning is a kind of fine-tuning.
you: 6/15 (336sec)
gpt-4o: 5/15
gpt-4: 5/15
gpt-4o-mini: 5/15
llama-2-7b: 6/15
llama-3-8b: 6/15 (Slowest Bot: 14sec)
mistral-7b: 5/15
unigram: 2/15
Yes definitely
you: 5/15
gpt-4o: 5/15
gpt-4: 5/15
gpt-4o-mini: 4/15
llama-2-7b: 7/15
llama-3-8b: 7/15
mistral-7b: 7/15
unigram: 4/15
The only ones I got right were ones where I had read the actual HN comment…
Just proves why IQ tests are worthless.