No, LLMs Still Cannot Reason - Part II

Alejandro Piad Morffis

Oct 1, 2024

The follow-up you didn't know you needed—and neither did I...

54 Comments

It seems to me that the LLMs are something similar to the part of our brain that recognize patterns without reasoning, and them we have another component that start iterating over those outputs, do the reasoning and try to find and verify if the coming up solutions are correct. What do you think about that idea?

Expand full comment

Alejandro Piad Morffis

Yep, it's an analogy many have made, system 1/system 2 kind of deal. Makes sense, imo, the only difference is that it seems in humans system 2 (the slow reasoner) is in charge and can override system 1 when necessary but if you use an LLM to call external tools you'll get a system 1 that can ignore system 2. So we would need to flip the chain of command somehow. And the other difference is that, in humans, language is definitely in system 2.

Expand full comment

LLMs are proving beyond a shadow of a doubt that pure symbol manipulation is a necessary but insufficient mechanism for intelligence/reasoning. For now, Searle's Chinese Room is a reality!

Expand full comment

Alejandro Piad Morffis

The man was way ahead of his time.

Expand full comment

Abhyudaya Ranglani

Fascinating.

I continue to take away super cool nuggets from your writing that I would otherwise never chance upon.

Expand full comment

Alejandro Piad Morffis

🙏

Expand full comment

Damn, sir. That's a robust double-takedown.

But also...I see your "humans can't fly, so there's that" comment, and I raise you: https://youtu.be/nlD9JYP8u5E?si=dmpzV5npnmgaI4E9

Expand full comment

Alejandro Piad Morffis

Haha I hadn't seen that one.

Expand full comment

First: great article and I learned a lot.

I agree fully that LLMs can’t reason and won’t reason, just due to fundamentals as you have described. So the argument overall is quite compelling. But I think it matters much less than we think. Because deductive reasoning is far less important to real-world computer solutions than it is to human solutions.

Because logic and deduction is central to our conscious minds, we lionize logic and deduction. But AI computing can solve a ton of real-world problems without formal logic and without deduction. I am speaking not just of LLMs but also more specific AI (like the perception and planning engines that run autonomous robots and cars, which is more my personal background). In perceiving complex surroundings and making real-time decisions about how to drive a car for example, AI has passed us already (or arguably is passing us, roughly right now). It’s not deduction and it’s not rules-based. But it surpasses our capability. Similar in chess…. We use logic and deduction, AI does what it does, and AI wins. And I think a lot of stuff will be like that in the future.

Expand full comment

Alejandro Piad Morffis

Great point! I may a couple of not counter arguments per se but maybe general comments on why I think formal reasoning might indeed become more important as we start to trust more and more on these systems. I'd have to organize my ideas better, maybe in a future article, but the key point is that trustworthiness works on epistemic grounds. I cannot trust what I cannot understand. I do agree for a whole lot of important applications this might not matter too much, but for some key areas like medical or legal advice, it could be a showstopper. Anyway, thanks for the food for thought!

Expand full comment

A propitious read as I attempt to wrangle a couple LLMs into compliance.

Expand full comment

Alejandro Piad Morffis

You have my blessings...

Expand full comment

Oct 17Edited

haha Thanks, *queue a Sabaton song of your choice*

Expand full comment

I came across limitations in reasoning early on with ChatGPT. I struggled to get the model to recognize that Michael Burnham, Spock, and Sybok (from the horrible Star Trek V, which still had some great lines like "What does God need with a starship?" and "I don't want my pain taken away, I need my pain!”) are all half siblings. Even after issuing prompts (logical premises) that essentially tried to guide it to the correct answer, the model remained "recalcitrant". It took over 10 well crafted (IMO) prompts to get to a satisfactory answer. A Trek fan would have arrived at the correct answer after the first prompt. All but the most logically challenged people would have gotten the correct answer after a maximum of 3 prompts.

I've seen more reasoning problems like this since LLMs became publicly available. Ironically, newer models have sometimes been doing better on reasoning chains with more nuanced or underspecified inputs than those with more straightforward specifications like the Trek example. I suspect this might be due to some "overengineering" in a rush to show that GAI can do dependable, consistent reasoning...

Expand full comment

Alejandro Piad Morffis

Yeah this is a very common experience I would say, for any of us who's played long enough and hard enough with any of the largest models. They're impressive for sure, but in a weird, unreliable way.

Expand full comment

I will believe that LLMs have the potential to reason when I see one solve the most elementary of cryptograms:

https://earlboebert.substack.com/p/simple-cryptograms-are-still-safe?r=2adh4p

One of the problems they have (akin to the "strawberry" problem) is that they have difficulty counting, and therefore cannot comprehend that there is a one to one correspondence between the number of letters in a ciphertext word and the number of letters in the corresponding plaintext word.

More teraflops isn't going solve this.

Expand full comment

Alejandro Piad Morffis

Exactly. There are fundamental limitations in the paradigm itself.

Expand full comment

Good writeup of o1's ability to decrypt substitution ciphers: https://arxiv.org/pdf/2410.01792

Expand full comment

Oct 6Edited

Thanks for the reference. A shift cipher is considerably simpler than a substitution cipher to solve because of the fixed relationship between plaintext and ciphertext; if I guess that ciphertext DOO is plaintext ALL, I have the shift value that is valid for the whole message. So I don't think the ability to solve them is an indication of much of anything, unfortunately.

Expand full comment

"Thus, when I say LLMs cannot reason, I’m simply saying there are—sometimes pretty simple—deduction problems they inherently cannot solve." Can you go into some of the pretty simple deduction problems they cannot inherently solve?

Expand full comment

Alejandro Piad Morffis

A rather obnoxious example but if you take a sufficiently complex boolean formula and ask it whether that formula is satisfiable, there is a formula size after which it has to fail for computability limitations. Of course these are not the kind of problems we care in practice, for that, just google around "openai o1 reasoning failure" and you'll find lots of practical examples where the best reasoning-capable LLM out there fails, but what the boolean satisfiability shows is these aren't just technological limitations, these are theoretical constraints of the stochastic language modelling paradigm itself.

Expand full comment

I’m quite aware of examples of o1 failing to solve a problem, but I struggle to generalize any given failure to the statement “o1 cannot reason”. Even if it cannot reason about everything, surely I can reason about some things?

Expand full comment

Alejandro Piad Morffis

Yes sure. It works like a charm for a lot of cases and then fails catastrophically for other cases, sometimes even easier problems. Reliability is the issue here. When it fails, you won't even know it failed.

Expand full comment

It's interesting, I am re-reading your claim, "Thus, when I say LLMs cannot reason, I’m simply saying there are—sometimes pretty simple—deduction problems they inherently cannot solve," you're saying something different than what most people believe you are saying. "There are some problems they cannot solve" is different from "they cannot reason."

Expand full comment

Alejandro Piad Morffis

That's a very good point, I can see how my framing can lead to confusion. Totally my fault, let me see if I can make myself clearer here.

I'm definitely not claiming LLMs will always fail to reason. That would be ludicrous, even a random algorithm would get a reasoning right sometimes.

But LLMs are better than random at reasoning, surprisingly better in fact so that requires some explanation.

On the other hand I'm also not claiming simply that LLMs make reasoning mistakes sometimes, that is true but it is also kind of trivial. We all do. And if LLMs only made totally arbitrary reasoning mistakes from time to time we could simply run the same input several times and get as good a confidence interval as we wanted.

So my claim, to be more precise, is that LLMs do something that is remarkably similar to reasoning in many cases, but not exactly a correct reasoning algorithm, so they will make reasoning mistakes, and those are unavoidable without a paradigm shift because (reasons explained in the article).

What this ultimately means is that when LLMs fail at reasoning, it's not simply a random mistake, no, they make systematic, unavoidable but also unpredictable mistakes. And this is more dangerous than just random mistakes: as long as the LLM is used in reasoning tasks that you can verify, no big deal, but the moment you want to use for novel problems that require reasoning to which we don't know the answer, we are screwed, but we can never trust it will work. It is an unreliable reasoning algorithm, and that is worse than no reasoning algorithm at all, because it gives you the false illusion that it works, and you may not be able to tell when it doesn't.

Hope this makes it clearer.

Expand full comment

Oct 6Edited

It does! Certainly the claim "LLMs cannot reason" is different from "LLMs are not a correct reasoning algorithm."

Let's imagine for a second though that the reasoning problem we're trying to solve is "guess a user's password." The solution space is near infinite. If we could use an algorithm to reduce that solution space from near inifite to 10,000 possible solutions, then brute force check 10,000, that is good enough. If you don't care that the other 9,999 guesses are wrong, who cares? You found the right guess. It doesn't matter if the algroithm you used is a "corecct reasoning algorithm." You really just needed to have correct 1 in 10,000 tries (still a massive improvement from brute forcing a near infinite number of passwords). (Also note, I'm not suggesting LLMs are good for this usecase, it's just an very understandable usecase.)

This is the power of o1, the ability ro reduce the search space from near infinite to 10,000 or so, but not just for one specific problem, but a general set of problems. That's incredible and a huge breakthrough.

One more thing. You called LLMs "toys" which impliles that a) they're trivial playthings and b) the people who desire to use them are children. Neither is a chartible description to many in the field (even if it is perhaps true for some).

Expand full comment

Jurgen Gravestein

Approximate reasoning will scale just as successfully as approximate retrieval — it won’t. Because larger and more instructable language models become less reliable, not more (https://www.nature.com/articles/s41586-024-07930-y).

To echo the words of Francois Chollet: “We need new ideas”.

Expand full comment

Alejandro Piad Morffis

Fantastic comment

Expand full comment

Chris Birmingham

I find this as a popular debate to be rather strange, but interesting enough I thought I would add my two cents.

While I appreciate the nuances you explore in the article and follow up, your titles indicate a stronger stance on the reasoning debate than I think is fair or warranted, even when the proper caveats are included with the definitions you give in the second article.

In your first article you formulate the claim as follows:

"While large language models may exhibit some reasoning capabilities, their fundamentally stochastic nature and fixed computational architecture hinder their ability to engage in open-ended, arbitrary-length deductions."

In your follow up you narrow the scope significantly:

"Thus, when I say LLMs cannot reason, I’m simply saying there are—sometimes pretty simple—deduction problems they inherently cannot solve." or even further: "LLMs are incapable, by design, of perfect reasoning."

I think this is both significantly more cogent and importantly, falsifiable. I hypothesize if you created a formal logic benchmark and LLM makers had sufficient motivations to train models to do well on it you would find that the scores would steadily increase and would become better 'reasoners' by your measuring stick. I would also hypothesize that any program that a human can come up with to do "perfect reasoning" is also something that an LLM, with sufficient prompting could write.

To start with your Argument 1: Reasoning, even narrowly prescribed logical deduction is a skill that humans are capable of developing but will also never perfect. As you describe, it is np-complete class difficulty. Humans will run the full spectrum of capability on this skill, from utterly incapable to virtuoso level. Right now LLMs may perform at the high school level, but there is nothing that structurally stands in the way of them also reaching the level of a human virtuoso. However if your standard is that "approximate reason is not enough." And that "If the LLM fails one out of every million times to produce the right deduction then that means the LLM cannot reason" then I am not sure what we are even talking about at this point. Certainly no human has ever reached such perfection, yet we still build society around human reason, from the president to the doorman. Your only evidence that humans can reason is "2000 years of solid math" yet 1) it is unclear how much trust we can put in that history or in the math it produced (just how different are we from your 'monkeys who are also the editors'?) and 2) there is no reason that given the right environment/training a LLM would not be able to recapitulate that math.

But you would rather compare the LLM to a SAT solver, which is not inherently wrong. They are both tools, and in that sense the SAT solver is a far better reasoner, if your definition of reasoner is simply the manipulation of symbols to produce patterns that are verifiable by algorithms that validate the internal consistency of the logical formalism. Again I agree that by this definition SAT solvers blow LLMs out of the water. But I would hazard a guess that 98% of people who read your article were not looking to compare LLMs to SAT solvers, they wanted to compare LLMs to human reasoners, and here I believe the distinction is much less clear than you posit.

In Argument 3 you push back against the idea that LLMs can be Turing complete. But to me you have missed the argument. I believe that much like us, a sufficiently competent LLM is capable of writing and calling a SAT solver. Yes, much like us they might not do so perfectly, but fundamentally if an LLM is capable of writing and running a program (not necessarily within its own internal structure) it is essentially computationally unbound and Turing complete.

In summary - I agree with you that LLMs are currently not fit for solving all reasoning challenges we face, and at the current level of skill we should be very wary of using them for anything important. But ultimately I am not sure if you are too pessimistic about the possibility of LLM agents reasoning successfully through worthwhile challenges or too optimistic about how well humans are currently performing at such challenges. In either case I think we are much closer than you believe to LLM agents that can reason about as well as a human - and which we can trust about as much as a human.

But thanks for your informative and thought provoking articles. If I could offer any feedback for your consideration it would be to try and include some citations for your claims and offer examples to illustrate your points (though I acknowledge I did not follow my own advice here.)

Expand full comment

Alejandro Piad Morffis

And regarding Turing completeness by calling a tool, yes, that solves it in principle. That's what my team is working on. Sadly the reliability issue comes back to bite us. The LLM will fail to call the right tool or will fail to interpret the output the right way in some small but important number of cases and there it can go waaaay off the wrong reasoning path. And unless you know the right answer, you have no way to notice it. There is no exception thrown, no sign of anything fishy. And another LLM trying to evaluate it? Yes we've tried that too, it improves a bit when you use weaker models, because ensembles of weak models are better models than either component, but when you use gpt4 or Claude, no, they themselves cannot catch them lying. And why would you trust a weaker models telling you that gpt4 is lying?

Expand full comment

Alejandro Piad Morffis

Thanks for all these comments, these are very good points and I don't necessarily disagree with any of it. I see that I've gone a bit farther in this debate than maybe was necessary and perhaps this shows a stubbornness on my side that is not really there. Yes I think incremental changes can make LLMs as good as the average human in reasoning and that's probably good enough for a lot of cool applications like a personal planner, and maybe we can even get as far as a math tutor for college students. I don't know if the current paradigm can get as far as AI that can prove novel and interesting theorems or propose novel and interesting physical theories reliably, though. I hope we do but I think we need a change of paradigm (or an update of paradigm, no need to ditch stochastic language modelling, just make it better).

Expand full comment

Alejandro Piad Morffis

So I guess my main concern in practical terms (beyond the super exciting but rather abstract discussion of whether these things can be truly intelligent if we don't solve this) is reliability. For most of the problems we solve with computers, we have some form of reliability guarantee. If the problem is in P we have exact and fast answers, mathematically proven. If not, we at least have exact and slow, or approximate answers. But approximate meaning we either have a performance bound (this solution is no worse than twice the optimal) or a probabilistic one (this solution is the right one 90% of the times) and with that probabilistic bound we often have the option to spend more compute and crank that confidence interval as high and narrow as we want. All this just means we know exactly what to expect from our algorithms. With LLMs we have nothing of the sort. Say a benchmark tells us it makes the right deduction 90% of the time. There is, as far as I know, no way to guarantee we can rerun the LLM 100 times and get independent solutions that push our confidence to 99.9%, because reruns are not really independent and basically the probability it gets something right is a super complex combination of weights and input prompt. And that's not taking into the account the benchmark is just an approximation of the real world problem set so we cannot really estimate how good the LLM will be in unseen problems, further obscured by the fact these things often have test leak because it is virtually impossible to filter out the test data from the internet crawl.

Anyway, this is a super interesting discussion and I thank you for your engagement. You've given me food for thought. I'm more than happy to chat about it anytime.

Expand full comment

Alejandro Lopez-Lira

So for loop + LLM + verifier can reason?

Expand full comment

Alejandro Piad Morffis

Yes but only is you can ensure the LLM calls the verifier with the right arguments and interprets the results correctly, so we're back to the unreliability of probabillistic language generation.

Expand full comment

Alejandro Lopez-Lira

Can't you check for those? What's a good resource for learning about verifiers?

Expand full comment

Alejandro Piad Morffis

A human can check but that beats the purpose of course. Other than that, you need an LLM to check, and we're back to square one with reliability.

Expand full comment

https://arxiv.org/pdf/2110.14168

Expand full comment

Alejandro Piad Morffis

Super cool! You'll notice even at the end with all the tricks, they still can't get optimal performance.

Expand full comment

This was 2021 and GPT-3.

Expand full comment

Alejandro Piad Morffis

Same paradigm and architecture, just more parameters and compute. I'm a bit tired of making the same arguments over and over. No amount of incremental innovation will solve this. We need a paradigm shift to incorporate true reasoning in language models. As it stands, the paradigm is incapable of modelling the math necessary to do precise reasoning.

Expand full comment

I really appreciate you calling out AI in general in terms of complexity classes—so much hype about them seems to ignore that decades of theoretical CS has shown a lot of this stuff to scale exponentially, something obscured by how opaque LLM operation is at its core. That said to be a little bit fair to Google and OpenAI, the recent systems that come closest to reasoning (Deepmind’s math systems and OpenAI o1) clearly are doing some form of SAT solver method behind the scenes (likely implemented as a form of Monte Carlo tree search). To the extent we can really trust anything technical out of OpenAI (probably not a lot), on their website it shows o1’s accuracy on some unspecified benchmark increasingly linearly with an exponential increase in compute, the characteristic of NP solvers for reasoning. I agree it’s still not trustworthy the way a real theorem prover is (though I think deepminds system calls one as a subroutine), and still a statistical approximation to reasoning with exponential runtime scaling, but nonetheless I think it’s stronger than this argument implies.

That said, unless they manage to break the exponential scaling barrier (not holding my breath) anything like artificial super intelligence is going to likely take more power than we can really imagine…

Expand full comment

Alejandro Piad Morffis

Yes you're absolutely right! I've argued before NP-hardness is probably out best defense against evil super intelligence.

Regarding o1 and DeepMind's model, yes these are the closest things we have to open ended reasoners and still have gaping holes in their performance but definitely that's the path forward, integration with traditional AI systems in a way that is reliable.

Expand full comment

Very similar argument as seen here: https://youtu.be/59PTmetkPCY?si=5LIs3NE71AZ_8lnz

Expand full comment

Alejandro Piad Morffis

Thanks, I haven't seen that one, will check it out right away.

Expand full comment

Actually, sometimes I see the LLMs, rather than something that reason, or something that has emergent probabilities, I see it as one of the greatest candidates we have for Kolmogorov complexity of our language. Like the language is being described by the the LLM code. The larger the code, the more imformation of all possible string combinations that for human make sense the LLM can encode, and therefore in that encoding you need to put together things that are similar or at least represent them similarly and is able to show capabilities that look as reasoning of math, in other areas because in its encoding it actually figure out that we humans have put those two pieces together in our language and can be compressed in a smaller representation. Of course, for this I have 0 proof, but is something that I believe could be a theory to explore to justify many things that happen with LLMs.

Expand full comment

Alejandro Piad Morffis

I do think compression and learning are kind of equivalent, in the sense that perfect compression would require finding the minimum length explanation for a set of examples, so yeah, I do think there is something cool going on there, though we still haven't saturated the largest models in terms of compute/token.

Expand full comment

Hi Ale, another great article, is sad that the previous one generated all that toxicity on Twitter, but you know how it is, unfortunately is a topic that is being strongly argued currently, some with real research objective questions, and others probably with other agendas.

I believe that the reason why even when we can prove so elegantly as you did in your previous article that LLMs cannot reaso (in the cases of people with real objective and interesting questions) is because of the emergent abilities.

Is incredible how a simple task as predict the next word, can encode in the weighs of the LLM so much information. And because, we humans write on the Internet so many incredible things and provide incredible reasoning and mathematical proves, the LLM actually ended expresing the similarities of reasonings in similar problems. And even more surprising when you use in context learning and tends to do better. Hopefully, the research community will find a great answer to this soon enough

You continue publishing this greats articles which, I am pretty sure have resulted as interesting and educational to so many people like it has been to me!!

Thank you!

Expand full comment

Alejandro Piad Morffis

Thanks man!

Expand full comment

Amplifier Worshiper

Question for a lay person: what does this inability to perfectly reason mean for using AI? This is both for using them and how likely they will truly automate away certain functions.

Expand full comment

I believe the point made is not that: "because LLMs cannot reason, then are not useful as AI", they are extremely useful and surprisingly good at things that we didn't even thought during the design phase they could do. However, this is very relevant when you actually want an application with perfect reasoning, for instance imagine that your doctor is an LLM, and that it fails an easy or complex reasoning about your test results and diagnose something wrong, that affects you terribly.

Of course a human could also make that mistake, but that doesn't mean is ok, for the same reasons we judge strongly humans that made wrong reasonings in important situations, we need to judge technology like LLMs with the same strength, which on top of everything, by default they cannot reason perfectly and what they look like reasoning, it is not what we humans call reasoning.

Finally, I believe that more tailored to the field of AI, an AI that cannot reason as humans do, it cannot be called an AGI which at least for now is supposed to be one of the main goals of every researcher involved in AI. Therefore since LLMs cannot reason, then right now they are not the answer to AGI.

Expand full comment

Alejandro Piad Morffis

I couldn't say it better. Just adding that as long as you want a user-facing bot to help with rather mundane planning tasks, that is probably ok, you won't be too bothered by the occasional logical slip any more you would when doing the same planing yourself. It is when these systems start to get integrated into life-or-death decision making processes that we must be really careful.

Expand full comment

Amplifier Worshiper

Thanks Gents.

Expand full comment

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts