This article is part of my upcoming book How to Train your Chatbot, an ebook full of intuitive explanations and practical advice to use LLMs for all sorts of cool stuff. You can get it on early access at a ridiculously low price.
Large language models often seem to be able to reason effectively. They can generate coherent and contextually relevant responses that resemble human reasoning. However, this apparent capability can be misleading.
LLMs frequently make mistakes when faced with complex problems requiring extensive reasoning chains. Their responses may seem logical initially, but they often lack the depth and accuracy for sound reasoning. This is particularly evident in tasks that involve multiple steps or intricate logical deductions, where the model may falter and produce incorrect conclusions.
This article explores the fundamental limitations of large language models (LLMs) in reasoning — highlighting the difference between their advanced outputs and their evident inability to perform logical deductions. By examining the stochastic nature of these models, their computational constraints, and their lack of complete computational capabilities, we will uncover the reasons behind their failures in complex reasoning tasks.
Additionally, we will discuss current strategies to enhance LLMs' reasoning capabilities, including chain of thought prompting and self-critique mechanisms, while critically assessing their effectiveness and underlying challenges. This article aims to foster a deeper understanding of what LLMs can and cannot achieve, urging caution in interpreting their seemingly intelligent responses.
Why LLMs can’t reason
One significant limitation of language models regarding reasoning is their stochastic nature. These models generate outputs based on probabilistic predictions rather than deterministic logical rules. This means that even a well-structured prompt can yield different responses on different occasions due to the randomness in their decision-making process.
Consequently, an LLM might arrive at a wrong conclusion purely by chance, leading to inconsistencies in reasoning. For example, when asked to solve a mathematical problem or make a logical inference, the model's response may vary significantly depending on the random seed used during generation, undermining trust in its reasoning capabilities.
Granted, you may set the temperature to zero effectively forcing the model to fix the output for a given input. But this output is still probabilistic, you’re just sampling the most likely continuation. The fact that the mapping between input and output hinges on a probabilistic distribution that encodes correlations between elements in the input and corresponding elements in the output is already suspicious. It would be very weird, although not impossible, that we just happened to converge on the right probability distribution that produces the correct output for every input, in terms of logical deduction rules.
However, this limitation is still not definitive. But it gets worse.
By design, large language models spend a fixed amount of computation per token processed. This means the amount of computation an LLM does before it produces the first output token is a function of just two numbers: the input size and the model size. So, if you ask an LLM to produce a yes or no question for a logical puzzle, all the “thinking” the model can do is some fixed—albeit huge—number of matrix multiplications that only depend on the input size. See where I’m going here?
Now, consider that you have two different logical puzzles with the same input size, i.e., the same number of tokens. But one is an easy puzzle that can be solved with a short chain of deduction steps, while the other requires a much higher number of steps. Here is the kicker: any LLM will spend exactly the same amount of computation in both problems. This can’t be right, can it?
A basic result in computational complexity theory is that some problems with very small inputs seem to require an exponentially high computational cost to be solved correctly. These are NP-complete problems, and most computer scientists believe there are no efficient algorithms to solve them. Crucially, a huge number of reasoning problems fall in this category, including the most basic logical puzzle of all—determining if a given logical formula can be satisfied.
When faced with an instance of an NP-complete problem, an LLM will produce an answer after a fixed amount of computation defined solely by the input size. Now, by sheer size, some larger models might just spend enough computation to cover many smaller instances of NP-complete problems. As it happens, a huge constant function can be larger than an exponential function for smaller inputs. But crucially, we can always find instances of NP-complete problems that require, even in principle, a sufficiently large amount of computation to surpass the computational capacity of any LLM, no matter how big.
But this means something even more profound. Ultimately, LLMs are not Turing-complete systems but essentially very large finite automata. While they can handle a wide range of tasks and produce outputs that appear sophisticated, their underlying architecture limits the types of problems they can solve.
Turing completeness is the ability of a computational system to perform any computation given sufficient time and resources. Modern computers and many seemingly simple systems, such as cellular automata, are Turing complete systems. But LLMs are not, ironically.
The reason is simple. We know from computability theory that any Turing complete system must be able to loop indefinitely. There are some problems—some reasoning tasks—where the only possible solution is to compute, and compute, and compute until some condition holds, and the amount of computation required cannot be known in advance. You need potentially unbounded computation to be Turing complete.
And this is the final nail in the coffin. LLMs, by definition, are computationally bounded. No matter their size, there will always be problem instances—which we may not be able to identify beforehand—that require more computation than is available in the huge chain of matrix multiplications inside the LLM.
Thus, when LLMs seem to tackle complex reasoning problems, they often solve specific instances of those problems rather than demonstrating general problem-solving capabilities. This might just be enough for practical purposes—we may never need to tackle the larger instances—but, in principle, LLMs are incapable of truly open-ended computation, which means they are incapable of true reasoning. Case closed.
Improving LLM reasoning skills
However, we need not throw the hat here. Researchers and practitioners have explored several innovative strategies, including Chain of Thought prompting, self-critique mechanisms, and integrating external tools to improve the reasoning skills of large language models.
CoT prompting encourages LLMs to articulate their thought processes, allowing them to break complex problems into manageable steps and improve their accuracy in reasoning tasks. On the other hand, self-critique aims to refine outputs through an internal evaluation process, yet it has shown mixed effectiveness in meaningfully correcting errors. Additionally, incorporating external tools such as reasoning engines and code generation systems can significantly augment the LLMs' capabilities by providing structured logic and formal verification.
However, each approach has its own set of challenges, and their potential and limitations in fostering true reasoning abilities within LLMs need to be carefully examined.
Chain of Thought
Chain-of-thought prompting has emerged as a promising technique for enhancing the reasoning capabilities of large language models. By guiding models to articulate intermediate reasoning steps before arriving at a final answer, CoT prompting helps decompose complex problems into manageable parts. This method has improved performance across various reasoning tasks, such as arithmetic and commonsense reasoning.
CoT prompting instructs LLMs to break down complex problems into simpler, sequential steps and then tackle each step independently. This structured approach enables the model to tackle each component individually, improving response accuracy and precision. Studies have shown that this technique can significantly boost performance on reasoning tasks, particularly when the model has a sufficient number of parameters (around 100 billion) to use the benefits of CoT prompting effectively.
By encouraging models to articulate their thought processes, CoT prompting taps into the extensive pool of knowledge that LLMs acquire during training. This mechanism helps models apply relevant information more effectively, addressing their inherent difficulties with logical reasoning and problem-solving.
Additionally, CoT makes the LLM “think harder” in the sense it forces the model to produce what we can consider “internal thought” tokens. Thus, we may view it as a way to produce additional computation on the input before deciding on the response.
However, despite its advantages, CoT prompting remains insufficient for several reasons.
The effectiveness of CoT prompting is highly contingent on the quality and diversity of the prompts used. If the examples provided are not representative or sufficiently varied, the model may struggle to generate coherent reasoning chains, leading to suboptimal performance. This reliance on effective prompt engineering can limit the technique's scalability and generalizability.
And again, the stochastic nature of LLMs means that even with CoT prompting, outputs can vary significantly across different runs due to randomness in generation processes. This variability can lead to inconsistent reasoning outcomes, undermining the reliability of the model's responses.
Ultimately, CoT extends the computation budget by a finite amount. Unless we try some cyclic scheme where the LLM is prompted to continue thinking, potentially indefinitely, until satisfied, their fundamental limitation on Turing incompleteness remains.
Self-critique
Another intuitive approach to improving reasoning is self-critique, which involves evaluating and refining an LLM's responses with the same model, using prompts that instruct the model to read its previous output, highlight potential errors, and try to correct them. A form of after-the-fact chain-of-thought, if you might. However, recent research has highlighted significant limitations in the effectiveness of this self-critique capability.
While LLMs can generate multiple ideas and attempt to critique their initial outputs, studies indicate that they cannot often meaningfully self-correct. The assumption that verification of correctness should be easier than generation—a fundamental idea in computational complexity theory—does not seem to hold true, in general, for LLMs. This is particularly problematic in reasoning tasks where the model struggles to adequately assess its outputs' accuracy. For example, if an LLM generates a flawed answer, its attempt to critique and revise it can lead to further errors rather than improvements.
Research shows that self-correction techniques in LLMs are heavily contingent on the availability of external feedback. In many cases, LLMs perform better when they have access to an external verifier or additional context rather than relying solely on their internal reasoning capabilities. For example, when solving complex problems, such as graph coloring or planning tasks, LLMs often fail to produce reliable solutions without external guidance.
Interestingly, attempts at self-critique can sometimes degrade performance rather than enhance it. Studies have shown that when LLMs engage in self-critique without external validation, they may generate false positives or incorrect conclusions. If you push harder, you can easily fall into a cycle of self-reinforcement of invalid or erroneous arguments, making the LLM increasingly more certain despite it getting worse and worse.
External tools
Integrating external tools, such as reasoning engines or code generation systems, into large language models represents a promising—and, for me, the only really viable—approach to enhancing their reasoning capabilities.
Connecting LLMs to external reasoning engines or logical inference tools makes it possible to augment their reasoning capabilities significantly. These tools can handle complex logical deductions, mathematical computations, or even domain-specific knowledge that the LLM might not possess inherently. This integration allows for more accurate and reliable outputs, as the external tools can apply formal logic and structured reasoning that LLMs typically struggle with.
Similarly, external code generation systems enable LLMs to produce executable code for specific tasks. This capability can streamline software development processes and improve efficiency in generating functional code snippets. The external systems can provide rigorous checks and balances that help ensure the correctness of the generated code.
By leveraging these external resources, LLMs can potentially overcome some of their inherent limitations in logical reasoning and problem-solving. For starters, an external inference engine will be Turing-complete, so we scratch that problem down, right?
Not so fast. Unfortunately, this approach has many challenges, particularly regarding the LLM's ability to generate the correct input for function calls or code execution. It all circles back to the original sin of LLMs: stochastic output.
First, the effectiveness of function calling or code generation hinges on the model's ability to accurately interpret a task and generate appropriate inputs. If the model misinterprets the requirements or generates vague or incorrect prompts, the external tool may produce erroneous outputs or fail to execute altogether. This reliance introduces a potential failure point where the model's limitations in understanding context and intent become apparent.
Many reasoning tasks require a nuanced understanding of logic and context that may exceed the capabilities of language models. For instance, when generating inputs for a logical inference engine, the model must understand the problem and articulate it in a way that aligns with the system's requirements. If the model fails to capture these nuances, it may lead to incorrect deductions or ineffective reasoning processes.
Translating text into code or structured queries makes it more complex and can undermine reasoning capabilities. This conversion requires programming syntax and logic knowledge that may not be intuitive for an LLM trained primarily in natural language data. Mistakes in this translation can spread to the external system, causing more errors.
While external tools can, in principle, improve the reasoning capabilities of an LLM by providing structured logic and formal verification, they cannot compensate for LLMs' basic limitations in generating precise inputs. Therefore, there is no formal guarantee that the outputs from this integration will be logically sound or appropriate for the context, simply because of the age-old adage: garbage in, garbage out.
Conclusions
While large language models may exhibit some reasoning capabilities, their fundamentally stochastic nature and fixed computational architecture hinder their ability to engage in open-ended, arbitrary-length deductions. This underlying limitation means that despite ongoing research and exploring various techniques to enhance reasoning, such as Chain of Thought prompting and self-critique mechanisms, and even duck-taping them with powerful reasoning engines, we still don’t know how to make language models reason using flawless, formal logic.
The emergence of models like OpenAI’s o1, which boasts impressive reasoning abilities, may seem like a significant step forward. However, this approach does not represent a fundamentally new paradigm in logical reasoning with LLMs. Deep down, this is “just” a way to explicitly incorporate chain of thought prompting in a fine-tuning phase and teach the model via reinforcement learning to select mostly coherent paths of deduction.
Thus, while definitely an impressive technical and engineering feat, o1 (terrible name) —and any future models based on the same paradigm— will continue to share the same core limitations inherent to all LLMs, only mitigated using some clever tricks. Thus, while they may excel in certain contexts, caution must be exercised in interpreting their outputs as definitive reasoning.
This post makes a lot of sense, Alejandro, at least to this layperson. I have had a sense for a long time that LLMs are fundamentally different from humans and will never emulate human thought. Your point here, if I understand you, is that natural language as a pragmatic human invention cannot be perfectly modeled in mathematical terms. Thought and language in humans are not stochastic, not random like an LLM seeking probabilities for meaning irrespective of its reference in reality. Humans think with language specifically looking for an empirical reference in reality, not for a meaning which is probably sufficient for any particular moment. Imagine if humans behaved this way. They’ll say one thing now because it’s within the weighted mean, but another thing in a few minutes when the weights change or the wind blows. LLMs are trapped by language, playing the odds. They’re like magicians, karaoke singers, card sharps. Imagine an LLM as a very talented magician. This magician can mimic human speech incredibly well and even string together sentences that sound smart. However, no matter how well it talks, it doesn't truly understand what it's saying. Even if you teach it to "show its work" or check itself, it's still fundamentally limited by its nature as a mimic, not a thinker. it can’t identify it’s own logical or empirical errors and fix things up. It has no metacognition. Am I close? Btw, this is damn good technical writing. It’s disciplined, organized, clear—yet your voice is personal, at home, comforting for laypeople like me which helps us tolerate our own inadequate background knowledge and STILL “get it.” I’m hoping I did:)
Excellent analysis as usual