Settings

Theme

Gemini 2.5 gets 24.4% on MathArena USAMO beating previous top score of 4.7%

matharena.ai

54 points by alphabetting 9 months ago · 12 comments

Reader

IceHegel 9 months ago

I was playing around with having this model plot orbital trajectories and it was seriously impressive. Other top-tier models would struggle to get functional simulations working. Gemini 2.5 was able to do it after three or four turns in Cursor. It does feel like a meaningful step up in mathematical reasoning and math-dense coding.

On the other hand, if you try to play chess with any of these reasoning models (including Gemini 2.5), it basically doesn't work at all. They keep forgetting where pieces are. Even with rl and sequential thinking on max, they consistently move pieces in impossible ways and mutate the board position.

In a recent test with Gemini 2.5, it used like 1700 thinking tokens to conclude it was in checkmate... but it wasn't. It's going to be very hard to trust these models to do new science or to operate outside of domains humans can verify while this kind of behavior continues.

  • parsimo2010 9 months ago

    > They keep forgetting where pieces are

    The vast majority of human chess players need to look at the board to know where the pieces are. Only a few people can know where all the pieces are if you just give them a list of moves. Have you tried evaluations where you give the LLM a representation of the board state at every move, as most human players would have, and which all chess engines track?

    • IceHegel 9 months ago

      My prompt is: "Let's play chess. Show me the board each time." Which gives something like:

            a b c d e f g h
        8 | r n b q k b n r | 8
        7 | p p p p . . p p | 7
        6 | . . . . . . . . | 6
        5 | . . . . . . . . | 5
        4 | . . . . P p . . | 4
        3 | . . . . . . . . | 3
        2 | P P P P . . P P | 2
        1 | R N B Q K B N R | 1
            a b c d e f g h
      
      
      I suppose I could use an external representation and paste that in, but I could also have it write a python script to use stockfish.
      • parsimo2010 9 months ago

        If it's forgetting the board after showing you that each turn then I suppose that is a legitimate criticism.

  • falcor84 9 months ago

    Regarding chess, I think that's an unfair challenge without access to long-term memory. Did you try to give it a memory scratchpad, e.g. with Letta?

    [0] https://www.letta.com/

adverbly 9 months ago

This does look like a large relative increase in score, but it seems like it comes from getting zero correct out of 6 to getting 1 and 1/2 correct. I think it's fair to say the sample size here is relatively small. Still, a record is a record! Congrats to the team for a new record!

  • onlyrealcuzzo 9 months ago

    From my small sample size (tens of queries per day), Gemini 2.5 seems like a noticeable improvement in (almost) every way compared to to previous Gemini models.

    Answers do seem to take longer to generate, but well worth the cost.

jeffbee 9 months ago

Odd that ETHZ authors published less than a week ago excluding Gemini 2.5

"PROOF OR BLUFF? EVALUATING LLMS ON 2025 USA MATH OLYMPIAD"

https://files.sri.inf.ethz.ch/matharena/usamo_report.pdf

  • Tiberium 9 months ago

    Gemini 2.5 Pro was only released on 25th of March ;)

    • boroboro4 9 months ago

      They updated the paper and included Gemini 2.5. It's the only model which got non trivial score (mostly solved one problem) - 10/42.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection