Show HN: GPT crushes my high score in 2048.io

31 points by inishchith 3 years ago · 39 comments

Reader

I can't think of a particularly polite way of putting this, so I apologise, but did you take some liberties with that title, or are you just not very good at 2048? Random inputs may get you a higher score than that.

I looked at the code to see the prompt, and I think it's a very limited way of having GPT play, the context has no board history or information on how and where the game inserts piece, the AI won't be able to execute any strategy.

jxf 3 years ago

There shouldn't need to be any board history, just like you don't need to know the board history to know the next move in checkers.
- Kaotique 3 years ago
  
  2048 spawns new numbers randomly on the board. A human player uses that information to determine the best move.
  - jxf 3 years ago
    
    Yes, but that's _state_, not _history_.
  - morelisp 3 years ago
    
    Board history does not matter.
    Grid.prototype.randomAvailableCell = function () { var cells = this.availableCells(); if (cells.length) { return cells[Math.floor(Math.random() * cells.length)]; } };
  - Kaotique 3 years ago
    
    Besides that you need to predict where the next number can and should spawn. If you don't figure out the strategy for that you will never get a decent score.
    
    IanCal 3 years ago
    
    That relies on the board state but not the board history.
jychang 3 years ago

OP might be my little brother. Years back, I destroyed him in 2048, and then he called me a virgin. Then he wrecked me at a round of chess.
(I was engaged then, currently married now)
Ahh, good times.

ggerganov 3 years ago

I just played 2 games randomly pressing all arrow keys with closed eyes and got a score of ~1100 in the first game and 1468 in the second game. OP's AI agent scored 1348

matsemann 3 years ago

Tried as well. Got 4344 by button smashing left/up and when it got stuck sometimes right or down. https://imgur.com/a/k7cOPhY
have_faith 3 years ago

What would be a good benchmark score for an AI? I just got 4500~ without any strategy.
- ggerganov 3 years ago
  
  Just make the same Javscript that provides random moves, make it play 100 games and take the average score.
  Then do the same using GPT and compare the scores.
  Anything else is just cherry-picking
  - gumballindie 3 years ago
    
    Yeah but that doesn't drive clicks.

iamflimflam1 3 years ago

You can get much great results at playing games with ChatGPT by adding “memory” as shown here - https://youtu.be/4oQUsiPsbOQ and here - https://youtu.be/lXFeq2yUy58

Prompting it to remember the game state and feeding it back in or offloading that job to a plugin can get some very Interesting results.

streakfix 3 years ago

OP is just bad at 2048

mkl95 3 years ago

It looks like the wrong AI for the problem. RL should be more successful.

Kaotique 3 years ago

AlphaStar or AlphaGo were trained on thousands of games and played hundreds of thousands of games to learn what works and what doesn't. This model is told the rules of the game and knows how to use the 4 basic inputs. I don't think it has any notion of strategy, good or bad moves.
But maybe with a feedback loop it will improve.
Pretty cool example though to see its limitations.
antegamisou 3 years ago

Came here to read this, it'd actually be a very good project (DIY 2048 AI) to take up to get your hands dirty with Reinforcement Learning.

lordnacho 3 years ago

This is a pretty interesting example, as I think most of you will find that after playing 2048 for a bit, you discover a way to beat the game every time. I'm not sure I could code it up, but the heuristic is basically to keep the big numbers squashed against the same edge (eg the top) the whole game while using left-right-up movements to squash the smaller numbers into each other as opportunities arise.

I don't think GPT could figure this out. My impression of it lately is that it's a sort of very advanced cargo cultist, maybe with a bit of superficial intelligence confined to the linguistic sphere. Asking it for a history essay gives you a grammatically perfect melange of likely terms that will do just fine for high school but possibly not for graduate level studies.

I've never seen it do anything where I thought it had a parsimonious internal model of the problem. For instance I had it tell me about the quadratic equation, and the explanation was fine. When it came to plugging in numbers, it utterly failed, though the presentation was as if it understood it. If it had just a simple calculator inside it, this wouldn't be a problem.

This game is also pretty simple, and for the same reason I don't think it can actually do it.

mschuster91 3 years ago

> I've never seen it do anything where I thought it had a parsimonious internal model of the problem. For instance I had it tell me about the quadratic equation, and the explanation was fine. When it came to plugging in numbers, it utterly failed, though the presentation was as if it understood it. If it had just a simple calculator inside it, this wouldn't be a problem.
That's - at the moment, AFAIU - a limitation of the tokenizers used to interface with LLMs. Basically, the model "calculates" bullshit because the input layer doesn't get correct inputs from the tokenizer.
joshspankit 3 years ago

I’m not sure if you know yet, but one of the first ChatGPT plugins was WolframAlpha. That adds a rather advanced calculator to it’s toolset.

geraldwhen 3 years ago

A set of random inputs will likely beat 128.

Heck, using the heuristic of down/left/repeat until blocked, up, back to start will win some games.

austin-cheney 3 years ago

My highest score on 2048 was in 2015 when I scored over 120,000. I was close but failed to achieve the 16384 block. If an AI were trained to play that game I would expect similar performance, because there are very few strategies to this game but those strategies will comprise multiple moves to solve for a given problem in the moment.

I also modified the original game code to allow a board of different sizes. The modifications are just a minor fix to the CSS, an input field for board size, an the corresponding JS for that input field.

https://github.com/prettydiff/2048

tarruda 3 years ago

The prompt contains rules of the game and a few examples.

Does anyone know if GPT really learns 2048 only from this prompt or most of its knowledge came training data?

cwillu 3 years ago

The theory that gpt has merely learned the acceptable commands needs to be considered more likely, given the other comments pointing out that random input will do as well as gpt apparently does.

Kaotique 3 years ago

It is not building towards a corner which makes it super weak to random new numbers appearing in bad places.

vorticalbox 3 years ago

little JS trick for you you can use Array.from so you can skip one of the .fill

> let board = Array.from({length: N_ROWS}).map(() => new Array(N_COLS).fill(0));

probably doesn't matter for this size of a board but its one less loop of the array

tremarley 3 years ago

OP is just super bad at 2048

bl0rg 3 years ago

But he's excellent at clickbait!

kazinator 3 years ago

GPT4 just told me that a billion golf balls take up 40 billion cc, and since packing efficiency of spheres is as high as 74%, they take up 26 billion cc.

No, you have to divide by 0.74, not multiply.

In shit that GPT4 is not trained on (like code code code and more code), it can get really goofy.

Earlier in the same chat about golf balls, it claimed that if brain cells were the size of golf balls (an imaginary thread I started) there would have to be 40 billion of them. That doesn't follow; the number of brain cells is an external quantity that we hold constant, not related to what size we are imagining them to be. (The number is wrong too, the common estimate is over 80 billion.)

GPT4 wheedles tidbits of information out of your own questions and tries to work them into answers. For instance, today it claimed that the Lomuto partitioning scheme often seen in Quicksort implementations requires external storage of one bit per array element. That's utterly false; it requires no external storage proportional to the array, just a few registers to manipulate the values and array indices and whatnot. I had talked about an idea involving one bit of storage earlier in the chat. The stochastic DJ just jammed a needle into that groove and went with it.

I asked it where can I get a copy of Hoare's original paper on Quicksort. It said that it's hard to find because the paper is very old, blah blah. Just excuses for not knowing where that might be. I switched to another window and found in two seconds with Google on an Oxford website, free PDF download of complete text.

A few days ago I asked GPT4 what is the cell of a honeycomb called in Japanese. It told me instead what a honeycomb is called. I explained, the cell of a honeycomb is a distinct object from the honeycomb. It had no idea what the cell might be called, in spite of being capable of chatting in fluent Japanese with you at the drop of a hat.

I found the info in the Japanese Wikipedia on honeycombs: a caption under a picture of them calls them "heya", which is a common word for room (e.g. bedroom). Guess that's not one of the billions of texts it has assimilated.

Another trick up GPT4's scheme is to ask you for hints when it can't solve something. You have to give it so many hints that it doesn't need to solve the actual problem, but then it acts like it has reasoned it out. When confronted it admits, yes, sorry, the answer was deduced from your hints in such and such a way.

Can't say it's not entertaining, though.

I went through this protracted exercise whereby I took a paragraph from Edgar Allan Poe, and encrypted it with a Vigenère cipher. I convinced GPT4 to try to crack it. First I had to get past its ethical objections. We worked out a protocol whereby it can ask me questions, the answers to which prove that I know the plaintext and key, without it revealing the key to me. Eventually it forgot about its ethical obligation and was revealing to me the key that it thinks it might be. Which, if that were right, would amount to cracking the text for me.

I convinced it to actually perform the letter frequency analysis to try to crack the key length. It was close so I just gave that away.

In my ciphertext, I preserved word divisions, and also case. I told GPT4 about this and encouraged it to use the information. Like a single-letter lower case ciphertext word is likely a. It tried to use this but was getting the position wrong, and the key offset wrong and other logical issues. In the end, I gave it so many hints about where the plaintext comes from that it pulled it from the network and then pretended to have solved the problem.

It then made up a fictitious Vigenère key and sad, hey look, with this key that I cracked, your ciphertext decodes to the first paragraph oif the Fall of the House of Usher. I reminded that it couldn't possibly be the key because the real one is six characters long (as we established several times in the chat). It was basically just spewing smooth sounding text.

It's not pure bullshit. It's like raisins of clarity in a pudding of bullshit or something. We are seeing some sparks of something that resembles intelligence. In 5, 15 years we will be having different conversations about this stuff (not to mention with).

pmoriarty 3 years ago

It would be interesting to know how Anthropic's Claude+[1] (or at least plain old Claude[2]) would fare on these tests.
[1] - https://poe.com/Claude%2B
[2] - https://poe.com/Claude-instant
streakfix 3 years ago

Pretty smart for a 3-year-old if you ask me.
andrepd 3 years ago

Yet according to many HNers we're scarce months from AGI that can replace programmers or synthesise 6-season TV series. :)
- sebzim4500 3 years ago
  
  Can you link to someone actually saying this? The whole debate seems like people are simply talking past each other, with one side saying "this is a useful tool" and one side saying "this tool is not going to turn into skynet in the next two weeks, so who cares?"
  - kazinator 3 years ago
    
    Mostly it's tech jounalists. "We are doomed! Along with artists and programmers!"
- mschuster91 3 years ago
  
  Given that a lot of what programmers do is write CRUD-style apps and a lot of TV shows and movies - particularly the big IPs - follow one and the same formula... it's not like that argument has some merit.
  AI is rapidly approaching a quality level near to "good enough with a bit of human cleanup afterwards".
  - Ekaros 3 years ago
    
    And even then with CRUD apps isn't main work to define what you CRUD. And what is the involved rules. The writing is mostly busy work that should actually be standardized already.
    For average episodic TV show, yeah. Just pull things from hat and fill the holes, later that could be done by pretty cheap writers already.

Settings

Show HN: GPT crushes my high score in 2048.io

Keyboard Shortcuts