Understanding Large Language Models

What is a LLM, how do you build one, and what can you do with it.

Mar 19, 2025

The following article is extracted from Chapter 1 of Mostly Harmless AI, a 160-page book on LLMs and everything you can, and cannot do with them. The book is in beta stage, meaning it has basically the final structure and content, but major revisions are still in progress. You can get it today at a 50% discount and ensure access to all future versions, including huge discounts on printed versions, as well as a community of like-minded readers.

What is a (Large) Language Model?

In machine learning, language modelling means guessing how likely a given sentence is.

For example, "the sun rises in the east and sets in the west" is a typical sentence with a high subjective probability. You would probably agree this is a sentence you’re likely to hear at least one. But a sentence with random words that don't mean anything has a low probability of ever being uttered by anyone, or written in a book.

Language modelling can be tricky because it's hard to say how likely a sentence is to “exist”. What does it even mean? In machine learning, we use a collection of texts called a corpus to help with this. Instead of the abstract, ontological question, we might ask something much more straightforward: How likely is it for this sentence to appear in all the written text, for example, on the internet?

However, if we only looked at sentences that already exist on the internet, language modelling wouldn't be very useful. We'd just say a sentence is either there or not, with a probability of 0 or 1. So instead, we can think about it in statistical terms like this: If the internet was made and erased many times, how often would this sentence show up?

To answer this question, we can think about whether a word will likely come after a group of words in a sentence. For example, "The sun rises in the east and sets in the..." What word would most likely come next? We want our language model to be able to guess that word.

Thus, we need to know how often a given word appears after a group of words. If we can do that, we can find the best word to complete the sentence. We keep doing this repeatedly to create sentences, conversations, and even full books.

Now, let's talk about the most common way to make this language modelling program work in practice. It's called statistical language modelling. We start with lots of text and learn how words correlate with other words. That is, we estimate the correlation of each word with a given context.

In simple terms, a context is a group of words around a specific word in a sentence. For example, in the sentence "the sun rises in the east and sets in the west," the word "east" is in the context of "{ the, sun, rises, and, sets }." If we look at many sentences, we can find words that are often in the same context. This helps us understand which words are related to each other.

For example, if we see "the capital of France is Paris" and "the capital of the United States is Washington," we can learn that Paris and France, as well as Washington and the United States, are related. They have the same relationship: being the capital of a country. We might not know what to call this relationship, but we can know it's the same type.

Statistical language modelling is thus making a model that can guess how often a word appears in a certain context, by using lots of data. This doesn't necessarily mean it truly understands the meaning of a sentence. But if we use enough data, it starts to look like the model can indeed capture at least some of the semantics.

The Simplest Language Model: N-grams

We've been building statistical language models since the early days of AI. The n-gram model is one of the simplest ones, storing the probability of each n-gram's occurrence. An n-gram is a collection of n words that appear together in common sentences. For example, in a 2-gram model, we count how many times pairs of words appear together in a large corpus, creating a table showing their frequency.

As we increase the n-grams to 3, 4, or 5, the collection of all n-grams becomes extremely large. Before the deep learning revolution, Google built a massive n-gram model from the entire internet with up to 5-grams. However, since the combination of all 5 words in English is huge, it only stored probabilities for the most common combinations.

This simple model counts words in a strict context when they're within a specific window size. It's very explicit, as each n-gram's probability or frequency is recorded. To compress this model further, we use embeddings.

Word Embeddings

An embedding is a mapping of some object—say, a word—to an n-dimensional vector of real numbers. Embeddings aim to transform semantic properties from an original space—i.e., words—into numerical properties of the embedding space. That is, we want words that occur together in context to map to similar vectors and form clusters in the embedding space.

Word2Vec, in 2011, was the first massively successful use of embeddings. Google trained a large embedding model using statistics from text all over the internet and discovered an amazing property: directions in the embedding space can encode semantic properties.

For instance, if you go take France and Paris, the vector needed to add to the word “France” to reach “Paris” is similar to the vector needed to add to the word “United States” to reach “Washington”. The semantic property has-capital was encoded as a specific direction in this space. Many other semantic properties were found encoded this way, too.

This was an early example of how encoding words in a dense vector space can capture some of their semantics.

Contextual Word Embeddings

The issue with Word2Vec is its assignment of a unique vector to each word, regardless of context. As words have different meanings in different contexts, many attempts were made to create contextual embeddings instead of static ones.

The most successful attempt is the transformer architecture, with BERT being the first example. The first transformer paper revolutionized natural language processing in artificial intelligence, providing a single tool to tackle various NLP problems.

A transformer is a neural network that generates a embedding of each word in the input text by considering the entire content of the input. This means each word's embedding changes according to its context. Additionally, a global embedding for an entire sentence, paragraph, or general fragment of text can be computed.

Why does this matter? Neural networks are among the most powerful machine learning paradigms. We can find embeddings for text, images, audio, categories, and programming code using a single representation. This enables machine learning across multiple domains using a consistent approach.

With neural networks, you can transform images to text, text to image, text to code or audio, etc. The first idea of the transformer was to take a large chunk of text, obtain an embedding, and then use a specific neural network for tasks like text classification or translation. But then, you can build sequence-to-sequence architectures that allow a neural network to receive a chunk of text, embed it into a real-value vector, and generate a completely different chunk of text from it.

For example, you can encode a sentence in English with a transformer that embeds it into a real-value vector and then decode it with another transformer that “speaks” French. The real-value vector in the middle represents the meaning of the text independent of language. So, you can have different encoders and decoders for various languages and translate any language pair.

A remarkable phenomenon is that you can train embedings in pairs of languages like English-Spanish and German-French and then translate from English to French without ever training on translations from English to French. This is due to using a shared internal representation for all languages. The sequence-to-sequence transformer is a fundamental piece behind technologies like ChatGPT. The next step is training it on massive amounts of text and then fine-tuning it for specific tasks.

Large Language Models

Large language models are the latest development in statistical language modelling, evolving from N-Gram models, embeddings, and transformers. Thanks to innovations that efficiently accommodate thousands of words in memory, these advanced architectures can compute contextual embeddings for extensive text contexts. This capacity has increased continuously, with the first version of ChatGPT holding something like 4000 words, while recent models hold anything from 30 thousand to a couple million words in the context!

A significant change is the scale of data on which these models are trained. BERT was trained on a vast dataset for its time, but it pales in comparison to GPT-2, 3, and 4. Large language models learn from a massive amount of internet text, including technical texts, books, Wikipedia articles, blog posts, social media, news, and more. This exposure to diverse text styles and content allows them to understand various mainstream languages.

Large language models, like GPT-2, generate text by predicting the next word in a sentence or paragraph, just like all previous language models. But when you combine the massive scale of the data and computational resources put into making these beasts of language models, and some clever tricks, they become something completely beyond what anyone thought possible.

GPT-2 was a huge leap forward in terms of coherent text generation. Given an initial prompt—say, the introductory paragraph of a fictional story—the model would generate token after token creating a mostly coherent story full with fictional characters and a plot. After a while it would start to diverge, of course, but for short fragments of text, this was already mindblowing.

However, things really exploded with GPT-3. At this size, emerging capabilities like "in-context learning" appeared, and this is where our story really begins.

How do LLMs work?

A generative language model, at its core, is just a statistical machine learning model trained to predict the continuation of a text sequence. Essentially, it's a prediction machine. You input a text prefix, run it through the model, and receive the most likely next token--a token is more or less a word or component of a word.

Actually, you don't really get just the next most likely token. The model provides a distribution across all possible tokens, giving you the probability of each one being the next continuation.

To use an LLM, we start with user input, like a query or text prefix, and run the model to get the next input token. We append it to the sequence and repeat the whole process until reaching a maximum number of tokens or the model predicts a special STOP token.

There are choices to make in this process. Choosing only the most likely continuation can quickly lead to repetitive predictions. Instead, you can choose from the top 50 most likely tokens at random, weighted by their probability. This injects some variety in the generated and is the reason why, for the same prompt, you can get different albeit semantically similar responses.

There are a few key parameters in this sampling process: the top K tokens to choose, the cumulative probability, and the temperature, which is the most relevant. The temperature is a parameter that affects the weights of the tokens you will pick for continuation. If the temperature is 0, you'll usually choose the most likely token. If it's higher, probabilities are smoothed out, making it more likely to choose less probable tokens. This increases the model's variability.

That's why some call high-temperature "creative mode" and low-temperature "precise mode." It has nothing to do with actual precision or creativity, just how deterministic the response to a given prompt will be.

From this perspective, you can already see why some people say language models are "just autocomplete on steroids". Indeed, that is the gist of their mechanics: you're completing a text sequence by adding one token at a time until you decide to stop. However, this is just scratching the surface. There is so much more involved in getting these models to behave in a useful way, and we will talk about some of those aspects in the next section.

But before moving on, here is a key insight from this explanation of how LLMs work: A language model always performs a fixed amount of computation per token.

This means that whatever limited form of "reasoning" can happen in an LLM, the depth and complexity of that reasoning is directly proportional to the number of total tokens the model processes. This implies two things:

If the input prompt is larger, the model will perform more computation before starting to compute its answer. This is part of the reason why more detailed prompts are better. But crucially, if the output is larger, the model is also doing more computation.

This is why techniques like chain-of-thought—and basically anything that makes a model "talk more"—tend to improve their performance at some tasks.

They have more compute available to do whatever reasoning they can do. If you ask a model a quick question and instruct them to give a one-word answer, the amount of compute spent producing that answer is proportional to just the input size. But if you ask the model to produce a step-by-step reasoning of the answer before the final answer, there is a higher chance you'll get a better answer just by virtue of spending more computation.

At the risk of anthropomorphizing too much, I like to summarize this insight as follows: LLMs only think out loud. If you want them to think better, get them to talk more.

So, this is how a language model works from a user perspective. Let's see how you build one.

How to Train your Chatbot

How do you make your language model work? There are three main steps.

Pre-training

The first step is called self-supervised pretraining. In this step, you take a raw transformer architecture with uninitialized weights and train it on a massive amount of data to predict the next token. You use a large corpus of data, such as news, internet blog posts, articles, and books, and train the model on trillions of words.

The simplest training method is next token prediction. You show the model a random text and ask it what the next token is. Take a random substring from the dataset, remove the last token, show the prefix to the model, and ask for likely continuations. Compute a loss function to determine how mistaken the model was in its predictions and adjust it slightly to improve future predictions.

So far, this is a standard machine learning approach. We call it self-supervised learning because the targets are not given by humans, but chosen automatically from the input. But deep down, this is just supervised learning at scale.

Now, that being said, scaling this training process to billions of parameters and trillions of tokens presents a massive engineering challenge. No single supercomputer in the world can handle training GPT-4 from scratch, so you must resort to distributed systems to split the model across used across hundreds or thousands of GPUs for extended periods of time, synchronizing different parts of the model across multiple computers is crucial for efficient training. This just means, while the conceptual part of training an LLM is pretty straightforward, it is nothing short of an engineering prowess to get build like GPT-4.

Once pre-training is completed, you have what is called a "base model", a language model that can continue any sentence in a way that closely resembles existing text. This model is already extremely powerful. Give it any prefix of text with any content whatsoever and the model will complete it with a mostly coherent continuation. It's really autocompletion on steroids!

However, these base models, as powerful as they are, are still very hard to prompt. Crucially, they do not understand precise instructions, mostly because their training data doesn't contain a lot of examples of instructions. They are just stochastic parrots, in a sense. The next step is to get tame them.

Instruction tuning

At this point, the LLM already has all the knowledge in the world somewhere hidden in its weights--metaphorically speaking--but it is very hard to locate any concrete piece of knowledge. You must juggle with transforming questions into the right prompts to find a pattern that matches what the model has seen.

The way to solve this problem is to include another training phase, but this time much shorter and focused on a very well-curated dataset of instructions and responses. Here, the quality is crucial, much more than the quantity. You won't teach the model anything new, you will just tune it to expect instruction-like inputs and produce answer-like outputs.

Once finished, you have what's called an instruction-tuned model. These models are much more robust and easy to prompt compared to the base model, and this is the point where most open-source models end. But this is not the end of the story.

Instruction-tuned models are still not suitable for publicly-facing products for one crucial reason: they can be coerced into answering anything at all, including producing biased, discriminatory, or hate speech and instructions on how to build bombs and deadly poisons.

Given base models are trained on the whole Internet, they are full of all the good and bad you can read online--although some effort is put into cleaning the pretraining dataset, but it's never enough. We must teach the model that some questions are better left unanswered.

Preference tuning

The final step is to fine-tune the model to produce answers that are more closely aligned with user preferences. This can and is primarily used to avoid biased or hate speech, and to reject any questions that are deemed unethical by the developers training the model. However, it also has the effect of making the model more polite in general, if you wish so.

The way this process works is to turn the problem from supervised learning into the real of reinforcement learning. In short, the main difference is that, while in supervised learning we give the model the correct answers (as in instruction tuning), in reinforcement learning we don't have access to ground truth answers.

Instead, we use an evaluator that ranks different answers provided by the LLM, and a feedback loop that teaches the LLM to approximate that ranking. In its original inception, this process was performed with a human evaluator, thus giving raise to the term "reinforcement learning with human feedback", but since including humans makes this process slower and more more expensive, smaller organizations have turned to using other models as evaluators.

For example, if you have one strong model, like GPT-4, you can use it to rank responses by a smaller, still in-training model. This is one example of a more general concept in machine learning called "knowledge distillation" in which you attemp to compact the knowledge of a larger model into a smaller model, gaining in efficiency without sacrificing too much in performance.

And finally, we have now something that works like GPT-4. The process was long and expensive: a massive pretraining following by a carefully curated instruction tuning and a human-backed preference tuning. This is the reason why so few organizations have the resources to train a state-of-the-art large language model.

What can LLMs do?

Now that we understand how language models are built, let's turn our attention to their capabilities. As we've seen so far, base models are, ultimately, just autocompletion models. Given an initial prefix, they can produce a mostly coherent continuation that is plausible as far as the data and the training procedure allow.

But autocompletion is far from the only task you can do with LLMs. As we will see in this chapter, a sufficiently powerful autocompletion engine can be coerced into performing many disparate tasks. Combine this with task-specific fine-tuning, and you can turn a chatty, hallucination-prone LLM into a powerful tool for many domains.

We will start by examining what base models can do since, ultimately, all fine-tuning can do is unlock existing capabilities, making them easier to prompt. Then, we will survey many specific tasks for which LLMs can and have been used.

What can base models do?

As cool as it sounds, autocompletion on steroids doesn't ring like anything smart, right? Well, it turns out that if you are very, very good at completing any text prefix, that implies you must be good at a wide range of cognitive tasks.

For example, suppose you want to build a question-answering engine. Take a question like "Who is the current president of the United States" and turn it into a prompt like "the current president of the United States is...". If you feed this to a powerful base LLM, the most likely continuation represents the correct answer to the question. This means autocomplete on steroids gives you question answering for free.

And you can do this for a whole lot of tasks. Just turn them into an appropriate prefix and continuation. Do you want to translate a sentence? Use the prompt like "An English translation of the previous sentence is..." Do you want to summarize a text? Use a prompt like "A summary of the previous text is..." You get the point.

But it goes much further than that! The scientists at OpenAI discovered that models the size of GPT-3 and above were capable of inferring the semantics of a task given examples without explicitly telling them what the task was. This is called in-context learning, and it works wonders. For example, if you want to use an LLM for sentiment analysis, you can use a prompt like the following.

Comment: This movie was so good!
Sentiment: Positive

Comment: This movie really sucks.
Sentiment: Negative

Comment: The book was better.
Sentiment: Neutral

Comment: I couldn't stop looking at the screen!
Sentiment:

That is, you build a prompt with a few examples of inputs and outputs and feed that to the LLM, leaving the last input unanswered. The most likely continuation is the right answer to the last input, so provided the base model has seen similar tasks in its training data, it will pick up the pattern and answer correctly most of the time.

In-context learning is a surprising discovery at first, but when you look deep down, it makes total sense. Since base LLMs are completion machines, provided they have seen examples of some arbitrary task in their training set, all you need to do is come up with a text prefix that makes the model "remember" that task. And that prefix is often just a set of examples of a given task because that is actually what is stored in the LLM weights: a loosely and implicitly connected set of similar text fragments.

In a sense, the input to the LLM is a key to retrieving a part of its training set, but not in an accurate way. Since LLMs only store correlations between words, anything you "retrieve" from an LLM is a fuzzy approximation and aggregation of several (possibly millions) of similar training examples. For this reason, we say base models already "know" everything, but it's very hard for them to "remember" it, because you have to find the right key--i.e., the right context prefix.

But what if we could teach the LLM that some arbitary instruction is equivalent to the right key for a given task? That is exactly what instruction tuning is about. By showing the LLM input/output pairs of, this time, precise instructions and the corresponding answer, we are rewiring some of its parameters to strengthen the correlation between the instruction and the response.

In a sense, fine-tuning is like finding a path between the input space and the output space in the base model's fuzzy web of word correlations and connect those two subspaces of words with a shortcut, so next time you input the instruction, the LLM will "remember" where is the appropriate answer.

If this sounds overly anthropomorphic, it is because we have stretched the analogies a bit to make it easier to understand. In reality, there is no "remembering" or "knowing" happening inside a large language model, at least not in any way akin to how human memory and reasoning work. I have written extensively about this difference and its implications and will continue to do so in future posts.

For the time being, please be cognizant that any analogy between LLMs and human brains is bound to break pretty soon and cause major misunderstandings if taken too seriously.

Use cases for fine-tuned LLMs

With proper fine-tuning in a concrete domain, you can turn LLMs into task-specific models for a huge variety of linguistic problems. In this section, we'll review some of the most common tasks for which LLMs can be deployed.

When discussing the use cases of fine-tuned LLMs, we don't talk about an "input prefix" anymore because even if, technically, that is still what we are feeding the LLM, the response is not necessarily a direct, human-like continuation of the text. Instead, depending on which dataset it was fine-tuned, the LLM will respond with something that looks more like an answer to a question or an instruction than a pure continuation. Actually, if you give a fine-tuned LLM like GPT-4 an incomplete text prefix, it will often reply back with something like "I didn't understand you entirely, but it appears what you are trying to do is [...]" instead of casually continuing where you left.

Thus, it is often best to interpret this process as "prompting" the LLM with an instruction, and this is the reason why the input text is called a "prompt", and the process of designing, testing, and optimizing these prompts is called, sometimes undeservedly, "prompt engineering".

Text generation

The simplest, most straightforward use case for large language models is of course text generation, whether for fictional content as for technical articles, office work, homework, emails, and anything in-between. But instead of using a base model, where you have to provide a prefix to continue, an instruction-tuned model can be instructed directly to write a paragraph, passage, or even a short essay on a given topic. Depending on how powerful and well-trained the model is, you can even provide hints about the intended audience, the complexity of the language to use, etc.

Text generation--and all instructions in general--often works better the more descriptive the prompt. If you simply ask the LLM to "tell me a fairy story", yes, it will come up with something plausible, and it might even surprise in the good way. But you most likely want to have finer control over the result, and thus crafting a well-structured and informative prompt is crucial. In @sec-prompting we will learn the most basic strategies to create effective prompts.

A common issue in text generation, especially in longer formats, is that the LLM can and will often steer away from the main points in the discourse. The longer the response, the most likely some hallucinations will happen, which may be in the form of incoherent or plain contradictory items, e.g., characters acting "out of character" if you're generating fiction.

A battle-tested solution for generating coherent, long-form text is the divide-and-conquer approach. Instead of asking for a full text from the begining, prompt the LLM to first generate an outline of the text, and then, sequentially, ask it to fill in the sections and subsections, potentially feeding it with previously generated content to help it mantain consistency.

Summarization

Summarization is one of the most common and well-understood use cases of LLMs. In a sense, it is a special case of text generation--what isn't, right?--but it has specific quirks that merit a separate discussion. In general, LLMs excel at summarizing. After all, that's what they've been implicitely trained to do: construct a statistical model of the whole internet, which is rather, in a sense, a summary of the whole human knowledge.

However, summarization isn't a trivial problem. Besides the usual concerns about the audience, complexity of the language, style, etc., you will probably also want to control which aspects of the original text the LLM focuses on. For example, rather than a simple compactation of the text, you might want a summary that emphasizes the consequences of whatever is described in the original text, or that highlights and contrasts the benefits and limitations. This is a more abstract form of summary that produces novel value, beyond just being a shorter text.

There are important caveats with summarization, though. LLMs are very prone to hallucination, and the more you push the boundary between a plain summary and something closer to critical analysis, the more the LLM will tend to ignore the original text and rely on its own pre-trained knowledge.

And just like before, the best way to counteract any form of rebellious generation is to be very intentional in your prompt and make it as structured as necessary. For example, you can first ask the LLM to extract the key points, advantages, and limitations. Then, ask it to cluster the advantages and limitations according to your criteria. Only then can we ask it to provide a natural language summary of that semi-structured analysis. This gives you finer control over the end result and will tend to reduce hallucinations while being easier to debug since you can see the intermediate steps.

Translation & style transfer

The text-to-text transformer architecture (the precursor and core component of the modern language model) was originally designed for translation. By encoding the input sentence into a latent space of word correlations detached from a specific language and then decoding that sentence in a different vocabulary, these models achieved state-of-the-art translation in the early 2018s. The more general notion of style transfer is, deep down, a translation problem, but instead of between English and French, say, between technical and plain language.

Modern LLMs carry this capability, and will be more than enough for many practical translation tasks. However, beware that plenty of studies show that LLM translation are often poorer in many linguistic notions from professional translations. Translation is an art, as much or more than it is a science. It involves a deep knowledge of the cultural similarities and differences between readers of both languages, to correctly capture all the nuances that even a seemingly simple phrase can encode.

That being said, LLMs can help bridge the gap for non-native speakers in many domains where you don't need--or can't hope for--a professional translation. An example is inter-institutional communication, e.g., emails from co-workers who don't speak your native language. In these cases, you must also be careful nothing important is lost in translation, literally, but as long as everyone is aware of the limitations, this is one of the most practical use cases for LLMs.

Structured generation

Continuing with the topic of text generation capabilities, our next stop is generation from structured data. This is one specific area where LLMs come to mostly solve a long-standing problem in computer science: to generate human-sounding explanations of dry, structured data.

Examples of this task are everywhere. You can generate a summary of your calendar for the day, and pass it to a speech synthesis engine, so your personal assistant can send you every morning an audio message reminding you what you have to do, with cute linguistic cues like "Oh, and on the way to the office, remember to pick up the your wife's present." We will see an example of this functionality in @sec-planner.

Other examples include generating summaries of recent purchases for a banking app or product descriptions for an online store—basically anywhere you'd have a dashboard full of numbers and stats, you can have an LLM generate a natural language description of what's going on. You can pair this capability with the super skills LLMs have for question answering (at least when the answer is explicit in the context) to construct linguistic interfaces to any of number of online services or apps.

Text classification

Text classification is the problem of categorizing a text fragment—be it a single sentence, a whole book, or anything in between—into one of a fixed set of categories. Examples vary from categorizing comments as positive/neutral/negative, determining if an email is spam or not, or detecting the tone and style of a text, to more specific tasks like extracting the intended intention from a user, e.g., chatting with an airline bot.

To have an LLM correctly and robustly classify your text, it is often not enough to just instruct it and provide the intended categories. The LLM might come up with a new category you didn't mention just because it makes sense in that context. And negative instructions, in general, don't work pretty well. In fact, LLMs are lousy at interpreting negative instructions precisely because of the underlying statistical model. We will see in @sec-reasoning why this is the case.

Instead of a dry, zero-shot instruction, you can improve the LLM classification capabilities substantially with a few examples (also called a k-shot instruction). It works even better if you select the examples dynamically based on the input text, a procedure that eerily similar to k-NN classification but in the world of LLMs. Furthermore, many LLMs tend to be chatty by design, and will often fail to provide a single word classification even if you instruct it to. You can mitigate this by using a structured response prompt, as seen in @sec-prompting.

Structured data extraction

A generalization of text classification is the problem of structured data extraction from natural language. A common example is extracting mentions of people, dates, and tasks in a text, for example, a transcription from a video meeting. In the more general case, you can extract any entity-relation schema from natural text and build a structured representation of any domain.

But this capability goes much further. If you have any kind of structured input format--e.g., an API call for any online service--you can instruct (and probably k-shot) an LLM to produce the exact JSON-formatted input given a user query. This is often encapsulated in modern LLM providers i a functionality called "function calling", whic which we will explore in @sec-function-calling.

As usual, the main caveat with structured generation is the potential for subtle hallucinations. In this case they can be in two forms. The simplest one, when the LLM fails to produce the expected format by, e.g., missing a key in JSON object or providing an invalid type. This type of error is what we call a syntactic hallucination and, although anoying, is often trivial to detect and correct, even if just by retrying the prompt.

The second form of hallucination is much more insidious: the response can be in the right format, and all values have the right type, but they don't match what's in the text. The LLM hallucinated some values. The reason this is a huge problem is because detecting this form of semantic hallucination is as hard to solve as hallucinations in general. As we'll see in @sec-hallucinations, we simply have no idea how to ensure an LLM always produce truthful responses, and it might be impossible even in principle.

Question answering

Question answering is one of the most surprising capabilities of sufficiently large language models. To some extent, question answering can be seen as a form of retrieval, where you ask about some facts explicitly mentioned in the training set. For example, if you ask, "Who wrote The Illiad" it is not surprising, given what we know of LLMs, that a fine-tuned model can easily generate "Homer" as the most plausible response. The sentence "Homer wrote The Illiad" must have appeared thousands of times in different ways in the training set.

But modern LLMs can go way beyond simply retrieving the right answer to a trivia question. You can ask questions that involve a small set of reasoning steps, combining facts here and there to produce a response that is not, at least explicitly, in the training set. This is rather surprising because there is no explicit reasoning mechanism implemented in LLMs. All forms of reasoning that can be said to happen are an emergent consequence of learning to predict the next token, and that is at least very intriguing.

In any case, as I’ve argued many times, the statistical modelling paradigm has some inherent limitations that restrict the types of reasoning that LLMs can do, even in principle. This doesn't mean that, in practice, it can't work for the types of problems you encounter. But in its most general form, long-term reasoning and planning are still an open problem in artificial intelligence. I don't think LLMs alone are equipped to solve it.

You can, however, plug LLMs with external tools to enhance its reasoning skills. One of the most fruitful research lines is to have them generate code to solve a problem, and then run it, effectively making LLMs Turing-complete, at least in principle, even if in practice they may fail to generate the right code. Which leads us to the next use case.

Code de generation

Since LLMs are trained to autocomplete text, it may not be that surprising that, when feed with enough training examples of code in several programming languages, they can generate small snippets of mostly correct code. However, for anyone who codes, it is evident that writing correct code is not as simple as just concatenating plausible continuations. Programming languages have much stricter syntax rules that require, e.g., to close all parenthesis and to use explicit and very tight naming conventions. Failing to produce even a single semicolon in the right place can render a program unusable.

For this reason, it is at least a bit surprising that LLMs can code. More surprising still is that they can not only autocomplete existing code but generate code from scratch given natural language instructions. This is one of the most powerful capabilities in terms of integrating LLMs with external tools because code is, by definition, the most general type of external tool. There is nothing you can do on a computer that you can't do with code.

The simplest use case in this domain is, of course, using LLMs as coding assistants embedded in developer tools like code editors. But this is just scratching the surface. You can have an LLM generate code to solve a problem it would otherwise fail to answer correctly--e.g., perform some complex physics computations. Code generation allows an LLM to analyze large collections of data by computing statistics and running formulas. You can even have an LLM generate the code to output some chart, and voilá, you just taught the LLM to draw!

Code explanation

Code explanation is the inverse problem of code generation: given some existing code, produce a natural language explanation or, more generally, answer questions about it. In principle, this is a form of question-answering that involves all the caveats about complex reasoning we have already discussed. But it gets harder.

The problem is the majority of the most interesting questions about code cannot be answered in general: they are undecidable, meaning no algorithm can exist that will always produce the correct response. The most poignant example is the question, "Does this function ever returns?". This is the well-known Halting problem, the most famous problem in computability theory, and the grandfather of all undecidability results. Similar questions, such as whether a variable is ever assigned or a method is ever called, are also undecidable in the general case.

And this is not just a theoretical issue. The Halting problem highlights one crucial aspect of computation: in the general case, you cannot predict what an algorithm will do without running it. However, in practice, as anyone who codes knows, you can predict what lots of your code will do, if only because it is similar to code you've written before. And this is where LLMs shine: learning to extrapolate from patterns to novel specific instances, even if the general problem is unsolvable.

To top it all, we can easily imagine an LLM that, when prompted with a question that seemingly cannot be answered from the code alone, could decide to run the code with some specific parameters and observe its results, drawing conclusions not from the syntax alone but from the execution logs. A debugging agent, if you will.

Final Remarks

These are the most essential high-level tasks where LLMs can be deployed, but they span hundreds, if not thousands, of potential applications. Text classification, for example, covers a wide range of applications, just changing the classification target.

One conclusion you can draw from this chapter is that LLMs are some of the most versatile digital technologies we've ever invented. While we don't know if artificial general intelligence is anywhere near, we're definitely one step closer to general-purpose AI—models that can be easily adapted to any new domain without research or costly training procedures.

However, language modelling is not magic. The above discussion has already given us a glimpse of some of this paradigm's fundamental limitations. In future posts, we will explore how these models learn compared to humans and what this difference entails regarding their limitations.

If you want to know more about language modelling in general, and LLMs in particular, feel free to check Mostly Harmless AI. It’s jam-packed with information (most of which is published in this blog already) on the good and the ugly parts of LLMs, and lots of advice on how to get the best out of them.