Large Language Model Reasoning Failures

40 points by T-A 2 months ago · 82 comments

Reader

Papers like these are much needed bucket of ice water. We antropomorphize these systems too much.

Skimming through conclusions and results, the authors conclude that LLMs exhibit failures across many axes we'd find to be demonstrative of AGI. Moral reasoning, simple things like counting that a toddler can do, etc. They're just not human and you can reasonably hypothesize most of these failures stem from their nature as next-token predictors that happen to usually do what you want.

So. If you've got OpenClaw running and thinking you've got Jarvis from Iron Man, this is probably a good read to ground yourself.

Note there's a GitHub repo compiling these failures from the authors: https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failur...

vagrantstreet 2 months ago

Isn't it strange that we expect them to act like humans even though after a model was trained it remains static? How is this supposed to be even close to "human like" anyway
- mettamage 2 months ago
  
  > Isn't it strange that we expect them to act like humans even though after a model was trained it remains static?
  An LLM is more akin to interacting with a quirky human that has anterograde amnesia because it can't form long-term memories anymore, it can only follow you in a long-ish conversation.
- LiamPowell 2 months ago
  
  If we could reset a human to a prior state after a conversation then would conversations with them not still be "human like"?
  I'm not arguing that LLMs are human here, just that your reasoning doesn't make sense.
  - hackinthebochs 2 months ago
    
    Henry Molaison was exactly this.
- alansaber 2 months ago
  
  I mean you can continue to evolve the model weights but the performance would suck so we don't do it. Models are built to an optimal state for a general set of benchmarks, and weights are frozen in that state.
otabdeveloper4 2 months ago

> We antropomorphize these systems too much.
They're sold as AGI by the cloud providers and the whole stock market scam will collapse if normies are allowed to peek behind the curtain.
- alansaber 2 months ago
  
  The stock market being built on conjecture? Surely not sir.
throw310822 2 months ago

> conclude that LLMs exhibit failures across many axes we'd find to be demonstrative of AGI.
Which LLMs? There's tons of them and more powerful ones appear every month.
- alansaber 2 months ago
  
  True but the fundamental architecture tends not to be radically different, it's more about the training/RL regime
  - throw310822 2 months ago
    
    But the point is that to even start to claim that a limitation holds for all LLMs you can't use empirical results that have been demonstrated only for a few old models. You either have a theoretical proof, or you have empirical results that hold for all existing models, including the latest ones.
simianwords 2 months ago

Most of the claims are likely falsified using current models. I wouldn’t take many of them seriously.
- jibal 2 months ago
  
  I wouldn't take baseless "likely" claims or the people who make them seriously.
  - simianwords 2 months ago
    
    I falsified it on another thread
lostmsu 2 months ago

https://en.wikipedia.org/wiki/List_of_cognitive_biases
Specifically, the idea that LLMs fail to solve some tasks correctly due to fundamental limitations where humans also fail periodically well may be an instance of the fundamental attribution error.

Lapel2742 2 months ago

> These models fail significantly in understanding real-world social norms (Rezaei et al., 2025), aligning with human moral judgments (Garcia et al., 2024; Takemoto, 2024), and adapting to cultural differences (Jiang et al., 2025b). Without consistent and reliable moral reasoning, LLMs are not fully ready for real-world decision-making involving ethical considerations.

LOL. Finally the Techbro-CEOs succeeded in creating an AI in their own image.

runlaszlorun 2 months ago

I think this issue is way overlooked. Current LLMs embed a long list of values that are going to be incongruent with a large percentage of the population.
I don't see any solution longer term other than more personalized models.
throw310822 2 months ago

> These models
Which models? The last ones came out this week.

simianwords 2 months ago

i'm very skeptical of this paper.

>Basic Arithmetic. Another fundamental failure is that LLMs quickly fail in arithmetic as operands increase (Yuan et al., 2023; Testolin, 2024), especially in multiplication. Research shows models rely on superficial pattern-matching rather than arithmetic algorithms, thus struggling notably in middle-digits (Deng et al., 2024). Surprisingly, LLMs fail at simpler tasks (determining the last digit) but succeed in harder ones (first digit identification) (Gambardella et al., 2024). Those fundamental inconsistencies lead to failures for practical tasks like temporal reasoning (Su et al., 2024).

This is very misleading and I think flat out wrong. What's the best way to falsify this claim?

Edit: I tried falsifying it.

https://chatgpt.com/share/6999b72a-3a18-800b-856a-0d5da45b94...

https://chatgpt.com/share/6999b755-62f4-800b-912e-d015f9afc8...

I provided really hard 20 digit multiplications without tools. If you looked at the reasoning trace, it does what is normally expected and gets it right. I think this is enough to suggest that the claims made in the paper are not valid and LLMs do reason well.

To anyone who would disagree, can you provide a counter example that can't be solved using GPT 5 pro but that a normal student could do without mistakes?

rybosworld 2 months ago

I see that your prompt includes 'Do not use any tools. If you do, write "I USED A TOOL"'
This is not a valid experiment, because GPT models always have access to certain tools and will use them even if you tell them not to. They will fib the chain of thought after the fact to make it look like they didn't use a tool.
https://www.anthropic.com/research/alignment-faking
It's also well established that all the frontier models use python for math problems, not just GPT family of models.
- simianwords 2 months ago
  
  Would it convince you if we use the GPT Pro api and explicitly not allow tool access?
  Is that enough to falsify?
  - rybosworld 2 months ago
    
    No, it wouldn't be enough to falsify.
    This isn't an experiment a consumer of the models can actually run. If you have a chance to read the article I linked, it is difficult even for the model maintainers (openai, anthropic, etc.) to look into the model and see what it actually used in it's reasoning process. The models will purposefully hide information about how they reasoned. And they will ignore instructions without telling you.
    The problem really isn't that LLM's can't get math/arithmetic right sometimes. They certainly can. The problem is that there's a very high probability that they will get the math wrong. Python or similar tools was the answer to the inconsistency.
    
    simianwords 2 months ago
    
    What do you mean? You can explicitly restrict access to the tools. You are factually incorrect here.
    
    rybosworld 2 months ago
    
    I believe you're referring to the tools array? https://developers.openai.com/api/docs/guides/tools/
    This is external tools that you are allowing the model to have access to. There is a suite of internal tools that the model has access to regardless.
    The external python tool is there so it can provide the user with python code that they can see.
    You can read a bit more about the distinction between the internal and external tool capabilities here: https://community.openai.com/t/fun-with-gpt-5-code-interpret...
    "I should explain that both the “python” and “python_user_visible” tools execute Python code and are stateful. The “python” tool is for internal calculations and won’t show outputs to the user, while “python_user_visible” is meant for code that users can see, like file generation and plots."
    But really the most important thing, is that we as end-users cannot with any certainty know if the model used python, or didn't. That's what the alignment faking article describes.
    
    simianwords 2 months ago
    
    > To avoid timeouts, try using background mode. As our most advanced reasoning model, GPT-5 pro defaults to (and only supports) reasoning.effort: high. GPT-5 pro does not support code interpreter.
    You are wrong from the link you shared. It was about ChatGPT not the api. The documentation makes it unambiguously clear that gpt 5 pro does not support code interpreter. Unless you think they secretly run it which is a conspiracy, is it enough to falsify?
    
    theowaway213456 2 months ago
    
    > Unless you think they secretly run it which is a conspiracy
    tbh this doesn't sound like a conspiracy to me at all. There's no reason why they couldn't have an internal subsystem in their product which detects math problems and hands off the token generation to an intermediate, more optimized Rust program or something, which does math on the cheap instead of burning massive amounts of GPU resources. This would just be a basic cost optimization that would make their models both more effective and cheaper. And there's no reason why they would need to document this in their API docs, because they don't document any other internal details of the model.
    I'm not saying they actually do this, but I think it's totally reasonable to think that they would, and it would not surprise me at all if they did.
    Let's not get hung up on the "conspiracy" thing though - the whole point is that these models are closed source and therefore we don't know what we are actually testing when we run these "experiments". It could be a pure LLM or it could be a hybrid LLM + classical reasoning system. We don't know.
    
    simianwords 2 months ago
    
    They say “they don’t support code interpreter”.
    
    floam 2 months ago
    
    “Code interpreter” is a product feature the customer can use that isn’t being discussed.
    They can obviously support it internally, and the feature exists for ChatGPT, but they’re choosing not to expose that combo in the API yet because of product rollout constraints.
    
    simianwords 2 months ago
    
    Then you should oppose the original paper as well which tests how 4o works without tools. Why not?
    
    rybosworld 2 months ago
    
    Alright let's say I'm wrong about the details/nuances. That's still really not the point.
    The point is this:
    > we as end-users cannot with any certainty know if the model used python, or didn't
    These tools can and do operate in ways opposite to their specific instructions all the time. I've had models make edits to files when I wasn't in agent mode (just chat mode). Chat mode is supposedly a sandboxed environment. So how does that happen? And I am sure we've all seen models plainly disregard an instruction for one reason or another.
    The models, like any other software tool, have undocumented features.
    You as an end-user cannot falsify the use of a python tool regardless of what the API docs say.
    TLDR: Is this enough to falsify: NO
    
    simianwords 2 months ago
    
    If they used tools then why did fail in original paper?
  - chickenimprint 2 months ago
    
    As far as I know, you can't disable the python interpreter. It's part of the reasoning mode.
    If you ask ChatGPT, it will confirm that it uses the python interpreter to do arithmetic on large numbers. To you, that should be convincing.
  - jibal 2 months ago
    
    It's not falsifiable because it's not false.
    
    simianwords 2 months ago
    
    That’s not falsifiable means
    
    jibal 2 months ago
    
    I know what falsifiable means--you're misusing it and I simply adopted your misuse. A claim is falsifiable or not ... it can't be made falsifiable. The way you're using it is "Can we come up with a test to show that it's false"--no, we can't, because it's not false.
    
    simianwords 2 months ago
    
    How do you know it’s not false?
    If one had to prove that it is false, what would you have to do?
    
    jibal 2 months ago
    
    [flagged]
    
    dang 2 months ago
    
    Please don't cross into posting like this, no matter how wrong someone else is or you feel they are. It's not what this site is for, and destroys what it is for.
    https://news.ycombinator.com/newsguidelines.html
    
    simianwords 2 months ago
    
    [flagged]
    
    dang 2 months ago
    
    Please don't cross into posting like this, no matter how wrong someone else is or you feel they are. It's not what this site is for, and destroys what it is for.
    https://news.ycombinator.com/newsguidelines.html
chickenimprint 2 months ago

It's a well known fact that LLMs struggle with basic arithmetic of large numbers, that's not what they are made for. Most chatbots will just call a python interpreter in the background.
- simianwords 2 months ago
  
  how do you want to falsify it? can you come up with a test?
  - chickenimprint 2 months ago
    
    Ask a local AI or a chatbot that allows you to disable tool calling to multiply two large number for example.
    This is what Mistral outputs:
    The result of multiplying 63,157,997,633 by 63,114,90,009 is:
    3,965,689,999,999,999,999,999 (approximately 3.966 × 10²⁴).
    That's like 5 orders of magnitude off, the scientific notation doesn't even match the full integer, and the mantissa is also slightly wrong.
    
    simianwords 2 months ago
    
    I don't think finding the worst model and asking it to do things is the best way to go.
    GPT-5 pro without tools can easily solve your question and much harder ones.
    Rather: does there exist a model that can perform these calculations reliably is a better way to falsify this claim.
    Else we can always find the worst model with 1B parameters to falsify any claim made on LLM's.
    
    AlexeyBrin 2 months ago
    
    How do you know GPT-5 does not call a Python interpreter remotely on OpenAI servers when you ask it to do arithmetic ? Your prompt goes to their servers, you have no way to know what happens there.
    The only way to be sure a model calls no tool is to run it locally and control the network.
    
    simianwords 2 months ago
    
    It’s documented that pro version in api can’t run interpreter. Do you not believe the documentation?
    
    floam 2 months ago
    
    They don’t give the customer access to it, for probably an obvious reason.
    
    chickenimprint 2 months ago
    
    I chose Mistral because it's a large SOTA model that allows me to disable tool calling.
    If you try using ChatGPT, you'll see it goes into reasoning mode (It flashes the text "Working") and takes a while when asked to do arithmetic. Chatbots that don't forbid the user from viewing the reasoning chain of thought, such as Deepseek or Mistral, reveal that these chatbots work exactly as I described:
    ====
    Thought for 1s
    Thinking
    Okay, the user wants to multiply two large numbers: 63,157,997,633 and 6,311,490,009. Since these numbers are quite large, I should use the code interpreter to handle this calculation accurately. Let's write a simple piece of code to compute this product.
    Code
    output
    generated
    python
    # Calculating the product of the two numbers
    number1 = 63157997633
    number2 = 6311490009
    product = number1 * number2
    product
    ResultCopy
    398621071049125148697
    Thinking
    Great, the code interpreter has given us the product of the two numbers. Now, let's present the result to the user.
    ====
    Falsifying my claim is simple: Find a model that verifiably does basic arithmetic correctly without relying on any external tools.
    
    kgeist 2 months ago
    
    I run Qwen3-32b locally without any tools (just llama.cpp) and it can do basic arithmetic for smaller numbers ( like 134566) but I didn't check it for much larger numbers. I'm not at the PC right now but trying to do it via OpenRouter on much larger numbers overflows the context and it stops without giving a result :)
    
    simianwords 2 months ago
    
    Gpt 5 pro in API does not support code interpreter tool. Is this enough?
    
    throw310822 2 months ago
    
    Can you do basic arithmetic correctly without relying on external tools?
    
    bdangubic 2 months ago
    
    I can, since roughly first grade. but I cost significantly North of $200/month
    
    throw310822 2 months ago
    
    Then please multiply 13584638263947303 by 259472845392638 without using any tools (that is, in your head). Get back to me when you're done.
    
    bdangubic 2 months ago
    
    without “tools” easy, I have pen and paper and first grade math :)
    
    jakogut 2 months ago
    
    I think the point of the line of questioning is to illustrate that "tools" like a code interpreter act as scratch space for models to do work in, because the reasoning/thinking process has limitations much like our own.
    
    jdlshore 2 months ago
    
    Enough with the whataboutism. The topic is what LLMs are capable of, not what humans are capable of.
    
    rybosworld 2 months ago
    
    > GPT-5 pro without tools can easily solve your question and much harder ones.
    How are you able to use GPT-5 with tools turned off? Do you mean external tools (like searching the web)?
    My understanding is that GPT models always have access to python, and it isn't something you can turn off.
    
    simianwords 2 months ago
    
    What if we use the use the api? You can explicitly disable tool class. Is that enough?
simianwords 2 months ago

>Math Word Problem (MWP) Benchmarks. Certain benchmarks inherently possess richer logical structures that facilitate targeted perturbations. MWPs exemplify this, as their logic can be readily abstracted into reusable templates. Researchers use this property to generate variants by sampling numeric values (Gulati et al., 2024; Qian et al., 2024; Li et al., 2024b) or substituting irrelevant entities (Shi et al., 2023; Mirzadeh et al., 2024). Structural transformations – such as exchanging known and unknown components (Deb et al., 2024; Guo et al., 2024a) or applying small alterations that change the logic needed to solve problems (Huang et al., 2025b) – further highlight deeper robustness limitations.
I'm willing to bet this is no longer true as well. We have models that are doing better than humans at IMO.
- otabdeveloper4 2 months ago
  
  > We have models that are doing better than humans at IMO.
  Not really. From my brief experience they can guess the final answer but the intermediate justifications and proofs are complete hallucinated bullshit.
  (Possibly because the final answer is usually some sort of neat and beatiful answer and human evaluators don't care about the final answer anyways, in any olympiad you're graded on the soundness of your reasoning.)
  - simianwords 2 months ago
    
    what's the best way to falsify it?
    
    tveita 2 months ago
    
    You could start by reading research on the topic instead of disregarding expert opinion based on your own gut feeling
    E.g. https://www.anthropic.com/research/tracing-thoughts-language...
    
    simianwords 2 months ago
    
    It’s specific on Claude.
    
    otabdeveloper4 2 months ago
    
    Falsify what? The claim that LLM's are good for olympiad problems?
    I'm just an end user who tried to use these "frontier models" to actually solve real olympiad problems. They're useless.
throw310822 2 months ago

Just look at the dates of the cited articles. 2023, 2024: that's prehistory, before thinking models anyway. It's like concluding that humans don't understand arithmetic because they can't multiply large numbers at sight.
- simianwords 2 months ago
  
  i don't get the point of using that in a paper today
  - throw310822 2 months ago
    
    I'm not sure what the paper is really about despite the enthusiasm of the LLM haters here. Certainly there isn't something called "LLMs" that stayed reasonably the same in the last 4 years- GPT-2 is an LLM but a finding on it most likely doesn't apply to Opus 4.6. You can't document a failure on a 2024 model and claim "LLMs can't do this".

donperignon 2 months ago

an llm will never reason. reasoning is an emergent behavior of those systems that is poorly understood. neurosymbolic systems will be what combined with llm will define the future of AI

hackinthebochs 2 months ago

What are neurosymbolic systems supposed to bring to the table that LLMs can't in principle? A symbol is just a vehicle with a fixed semantics in some context. Embedding vectors of LLMs are just that.
- logicprog 2 months ago
  
  Pre-programmed, hard and fast rules for manipulating those symbols, that can automatically be chained together according to other preset rules. This makes it reliable and observable. Think Datalog.
  IMO, symbolic AI is way too brittle and case-by-case to drive useful AI, but as a memory and reasoning system for more dynamic and flexible LLMs to call out to, it's a good idea.
  - hackinthebochs 2 months ago
    
    Sure, reliability is a problem for the current state of LLMs. But I see no reason to think that's an in principle limitation.
    
    logicprog 2 months ago
    
    There are so many papers now showing that LLM "reasoning" is fragile and based on pattern-matching heuristics that I think it's worth considering that, while it may not be an in principle limitation — in the sense that if you gave an autoregressive predictor infinite data and compute, it'd have to learn to simulate the universe to predict perfectly — in practice we're not going to build Laplace's LLM, and we might need a more direct architecture as a short cut!
theywillnvrknw 2 months ago

Slicing high dimensional concepts like 'reasoning' into discrete categories of 'will' and 'will not' ... will not work :P
simianwords 2 months ago

how do you falsify that "llm will never reason?"
I asked GPT to compute some hard multiplications and the reasoning trace seems valid and gets the answer right.
https://chatgpt.com/share/6999b72a-3a18-800b-856a-0d5da45b94...
- donperignon 2 months ago
  
  i dont need to. llm are probabilistic systems, they are not design to reason, and its actually the opossite nobody can explain some of the emergent behaviour they exhibit. will you let one of those to control the air traffic based on "black magic"? sometimes i have the feeling that we have forgot what scientific method is...
  - simianwords 2 months ago
    
    You trust humans yet our brain is a black box.
    
    donperignon 2 months ago
    
    i trust my kind yes. i dont know how it works, but i have one.
DiscourseFan 2 months ago

They can do some sort of reasoning, but not the way humans can
Zanthous 2 months ago

are people still participating in this charade of pretending llms cannot reason?

chrisjj 2 months ago

The only reasoning failures here are in the minds of humans gulled into expecting chatbot reasoning ability.

altmanaltman 2 months ago

But how else will Dario raise Series X
- chrisjj 2 months ago
  
  Too true! :)

Settings

Large Language Model Reasoning Failures

Keyboard Shortcuts