Settings

Theme

How well does your LLM do?

2 points by sscaryterry a year ago · 9 comments · 1 min read


You go to a shop and buy three things. You work out the cost as being £5.88. When you get to the till you realise that instead of adding the items together on your calculator you actually pressed multiply each time instead. Interestingly the cost at the register also comes out to £5.88. What was the cost of the three items.

Try asking your preferred LLM to solve this? How did it do?

PaulHoule a year ago

What's the right answer here?

   a + b + c = 5.88
   abc = 5.88
is terribly undetermined and has a huge number of possible answers if the domain is the reals. If you postulate that a=1 then b=2.71 and c=2.17 is right to within rounding error (adds right, multiples to 5.8807)

If you are treating it as actual cents though, I don't think there's an exact answer as a/100,b/100 and c/100 will come out of the factors of 588 and I just don't see getting enough big numbers to add up to 588.

PaulHoule a year ago

Microsoft Copilot says: "Interesting puzzle! Let's work through it step-by-step:"

then works through it with the kind of faulty thought process that I'd expect from a student who squeaks through an intro physics class and gets a D grade.

It give me 2.7, 1.1, and 0.22 which don't multiply or add to 5.88.

I told it was wrong, it gave me some numbers that summed right but multiplied wrong, pointed that out and it gave

1.20, 1.50, 3.18 which sums right and multiples to 5.724 which is a little low.

  • cindylmcindy a year ago

    That's not the response when I tried it with Microsoft Copilot but then again I do prefer noise and lots of it.

smallerize a year ago

QwQ Preview 32B Q5_K_M: Tried a couple of techniques, then decided to set C to 1.00 and used the quadratic formula to get 2.17 and 2.17 for the other values.

Phi-4 (unsloth's dynamic 4-bit version) keeps trying random and repeating numbers with no particular strategy. Adding Maharshi's "contemplative prompt" did not really change anything.

smallerize a year ago

Oh wow, I just got an answer from Smallthinker 3B Preview. It talked for a long time but after 6,900 tokens, it eventually got the 1.00, 2.17, 2.71 answer. No system prompt, temperature 1.00 (which is probably too high, but I didn't notice until the test was over).

ghoul2 a year ago

R1 got it correct in one shot, I just pasted the statement.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection