Why are LLMs so versatile?

Wait? Isn't this just autocomplete on steroids? Well, it is, but that is missing the whole point. Here's how LLMs really work.

May 06, 2024

In previous articles, we have explored many issues with artificial intelligence in general and language models in particular. Now let me say this out loud: although I've been critical and skeptical about current machine learning approaches, I am a huge believer in artificial intelligence. I teach, research, and work on AI projects, betting my career on the success of this field. My criticism stems from wanting AI to truly realize its full potential as a transformative technology and science.

So, to shift the focus a bit towards the optimistic side, let's begin discussing the positive aspects of AI and how existing paradigms can succeed in specific domains when applied wisely. In upcoming articles, I'll share with you examples from my own research and work, showcasing how we use AI to improve various aspects of human life.

But before diving into specific applications, let's explore why language models are versatile and effective despite their limitations. This will help dispel common misconceptions.

When interacting with powerful language models like GPT-4, Claude, or Gemini, it may seem that something truly intelligent is underneath. However, some argue that there is no real understanding or meaning behind these models. Skeptics might dismiss them as overly simplistic, but there's something intriguing about their effectiveness. Why do they feel so good and work so well in so many situations?

By the way, if you've only experienced the first wave of language models from a year or so ago, it's time for you to to dive deeper. Try out the latest generative language models like GPT-4 with Microsoft’s Copilot or Google Gemini on their free tier. It’s a whole different thing. GPT-3.5 was impressive, but pretty soon you would see its limitations, as it would fail to understand even simple semantic connections.

However, GPT-4 and other frontier models are much better at grasping what you're saying. Often, if they don't initially succeed in solving a task, you can rephrase the prompt or conversation to guide them toward the solution. The true limitations of these models are still unknown, and many tasks deemed "unsolvable" by GPT-4 might just need the right prompt.

Given any random cognitive task today, chances are, with enough tinkering, someone can find a way to make GPT-4 tackle that task. This doesn't mean they can't do everything, of course. But it does mean whenever you think you've found a task they can't solve, try harder with a better prompt before declaring defeat (or victory, depending on your brand of ego).

So, in this article, we'll explore the capabilities of language models and begin to understand why they excel at their tasks. We'll examine the components of a large language model as well as their training process. By understanding these aspects, we can appreciate the nuances of the statistical language modeling paradigm and what makes these tools so versatile.

The prowess of modern language models lies in a combination of factors. The first one is sheer scale. The enormous amount of data and computational power used in creating these models surpasses anything previously attempted in machine learning. This significant increase in scale largely accounts for the improvements seen in GPT-3 and GPT-4 era language models compared to earlier versions like GPT-2 and BERT.

Another crucial factor is the fine-tuning process. Simply training massive language models on humongous amounts of text doesn't produce high-performing models like Claude or GPT-4; instead, it yields brittle systems that are very difficult to use. Additional fine-tuning steps are necessary to enhance their effectiveness and usability.

Lastly, when deploying language models like ChatGPT in practical settings, certain techniques can be employed to optimize user interaction. Properly formatting user input and connecting with tools to filter both input and output helps improve overall performance. We will see examples of this aspect in future articles.

Let’s dive in!

How do LLMs work?

A generative language model, at its core, is just a statistical machine learning model trained to predict the continuation of a text sequence. Essentially, it’s a text completion algorithm. You run a text prefix through the model and get the most likely next token—a token is more or less a word or component of a word.1

To use an LLM, we start with the user input, like a query or text prefix, and run the model to get the next input token. We append it to the sequence and repeat the whole process until reaching a maximum number of tokens or when the model outputs a special STOP token.

There are choices to make in this process, though. Choosing only the most likely continuation can quickly lead to repetitive generation. Instead, you can choose from the top 50 or so most likely tokens at random, weighted by their probability. This injects some variety in the generated text and is the reason why the same prompt gets you different albeit semantically similar responses.

There are a few key parameters in this sampling process you should know about: the top K tokens to choose, their cumulative probability, and the temperature, which is the most relevant to manipulate the generation. The temperature affects the weights of the tokens you will pick for continuation, increasing or decreasing the probability of the top tokens. If the temperature is 0, you’ll usually choose the most likely token. If it’s higher, probabilities are smoothed out, making it more likely to choose less probable tokens. This increases the model’s variability.2

From this perspective, you can already see why some people say language models are “just autocomplete on steroids”. Certainly, that is the gist of their mechanics: you’re completing a text sequence by adding one token at a time until you decide to stop. However, this is just scratching the surface. There is so much more involved in getting these models to behave in a useful way, and we will talk about some of those aspects in the next section.

But before moving on, here is a key insight from this explanation of how LLMs work: A language model always performs a fixed amount of computation per token.

This means that, whatever limited form of “reasoning” can be said to happen in an LLM, the depth and complexity of that reasoning is directly proportional to the number of total tokens the model processes. This implies two things:

If the input prompt is larger, the model will perform more computation before starting to compute its answer. This is part of the reason why more detailed prompts are better. But crucially, if the output is larger, the model is also doing more computation.

This is why techniques like chain-of-thought—and basically anything that makes a model “talk more”—tend to improve their performance at some tasks. They have more computing power available to do whatever reasoning they can do.

If you ask a model a quick question and instruct it to give a one-word answer, the amount of computing power spent producing that answer is proportional to just the input size. Basically a fancy dictionary lookup. But if you ask the model to produce step-by-step reasoning before the final answer, there is a higher chance you’ll get a better answer just by spending more computation.

At the risk of anthropomorphizing too much, I like to summarize this insight as follows: LLMs only think out loud. If you want them to “think” better, get them to “talk” more.

So, this is how a language model works from a user perspective. Let’s see how you build one.

How to make an LLM

How do you make your language model? There are three main steps.

Base training

The first step is called self-supervised pre-training. In this step, you take a raw transformer architecture with uninitialized weights and train it on a massive amount of text to predict the next token. You use a large corpus of data, such as news, internet blog posts, articles, and books, and train the model on trillions of words.

The simplest training method is next token prediction. You show the model a random text and ask it what the next token is. Take a random substring from the dataset, remove the last token, show the prefix to the model, and ask for likely continuations. Compute a loss function to determine how mistaken the model was in its predictions and adjust it slightly to improve future predictions.

So far this is a standard machine learning approach. We call it self-supervised learning because the targets are not given by humans, but chosen automatically from the input. But deep down, this is just supervised learning at scale.

Now, that being said, scaling this training process to billions of parameters and trillions of tokens presents a massive engineering challenge. No single supercomputer in the world can handle training GPT-4 from scratch, so you must resort to distributed systems to split the model across used across hundreds or thousands of GPUs for extended periods, synchronizing different parts of the model across multiple computers for efficient training.

This just means that while the conceptual part of training an LLM is pretty straightforward, it is nothing short of an engineering prowess to build something like GPT-4.

Once pre-training is completed, you have what is called a “base model”, a language model that can continue any sentence in a way that closely resembles naturally occurring human text. This model is already extremely powerful. Give it any prefix of text with any content whatsoever and the model will complete it with a mostly coherent continuation. It’s really autocompletion on steroids!

What can base models do?

Autocompletion on steroids, as hyped as it sounds, doesn’t really ring like anything smart, right? Well, it turns out, if you are very very good at completing any text prefix, that implies you must be good at a wide range of cognitive tasks.

For example, suppose you want to build a question-answering engine. Take a question like “Who is the current president of the United States”, and turn it into a prompt like “the current president of the United States is…”. If you feed this to a powerful base LLM, the most likely continuation is the correct answer to the question. This means, autocomplete on steroids gives you mostly correct question answering for free.

And you can do this for a whole lot of tasks. Just turn them into an appropriate prefix and continuation. Do you want to translate a sentence? Use the prompt like “An English translation of the previous sentence is…” Do you want to summarize a text? Use a prompt like “A summary of the previous text is…” You get the point.

But it goes much further than that! The scientists at OpenAI discovered that models the size of GPT-3 and above were capable of inferring the semantics of a task given examples, without explicitly telling them what is the task. This is called in-context learning, and it works wonders. For example, if you want to use an LLM for sentiment analysis, you can use a prompt like

Comment: This movie was so good!
Sentiment: Positive

Comment: This movie really sucks.
Sentiment: Negative

Comment: The book was better.
Sentiment: Neutral

Comment: I couldn't stop looking at the screen!
Sentiment:

That is, you build a prompt with a few examples of inputs and outputs and feed that to the LLM, leaving the last input unanswered. The most likely continuation is the right answer to the last input, so provided the base model has seen similar patterns in its training data, it will pick up the pattern and answer correctly most of the time.

However, these base models, as powerful as they are, are still very hard to prompt. Crucially, they do not understand precise instructions, mostly because their training data doesn’t contain a lot of examples of instructions. They are just wild stochastic parrots. The next step is to domesticate them.

Instruction tuning

At this point the LLM already has all the knowledge in the world somewhere hidden in its weights—metaphorically speaking—but it is very hard to locate any concrete piece of knowledge. You must juggle with transforming questions into the right prompts to find a pattern that matches what the model has seen.

The solution to this problem is to include another training phase, but this time, it is much shorter and focused on a very well-curated dataset of instructions and responses. Here, the quality is crucial, much more than the quantity. You won’t teach the model anything new, you will just tune it to expect instruction-like inputs and produce answer-like outputs.

Once finished, you have what’s called an instruction-tuned model. These models are much more robust and easy to prompt compared to the base model, and this is the point where most open-source models end. But this is not the end of the story.

Instruction-tuned models are still not suitable for user-facing products for one crucial reason: they can be coerced into answering anything at all, including producing biased, discriminatory, or hate speech and instructions on how to build anything from nuclear bombs to deadly viruses.

Given base models are trained on the whole Internet, they are full of all the good and bad you can read online—although some effort is put into cleaning the pre-training dataset, it’s never enough. We must teach the model that some questions are better left unanswered.

Preference tuning

The final step is to fine-tune the model to produce answers that are more closely aligned with user preferences. This can be and is primarily used to avoid biased or hate speech and to reject any questions deemed unethical by the developers training the model. However, it also has the effect of making the model more polite, more verbose or concise, etc.

The way this process works is to turn the problem from supervised learning into the realm of reinforcement learning. In short, the main difference is that, while in supervised learning we give the model the correct answers (as in instruction tuning), in reinforcement learning we don’t have access to ground truth answers.

Instead, we use an evaluator that ranks different answers provided by the LLM and a feedback loop that teaches the LLM to approximate that ranking. In its original inception, this process was performed with a human evaluator, thus giving rise to the term “reinforcement learning with human feedback”, but since including humans makes this process slower and more expensive, smaller organizations have turned to use other models as evaluators.

For example, if you have one strong model, like GPT-4, you can use it to rank responses of a smaller, still-in-training model. This is one example of a more general concept in machine learning called “knowledge distillation” in which you attempt to compact the knowledge of a larger model into a smaller model, gaining in efficiency without sacrificing too much in performance.

And finally, we now have something that works like GPT-4. The process was long and expensive: a massive pre-training followed by carefully curated instruction tuning and human-backed preference tuning. This is the reason why so few organizations have the resources to train a state-of-the-art large language model from scratch.

Beyond vanilla LLMs

Once an LLM is deployed into production, the most basic application you can implement is a ChatGPT-clone: a chat interface where you can interact with a powerful model and get it to work for you. But this is far from the limit of what current models can do.

With careful prompting and some augmentation techniques, you can integrate an LLM into more traditional applications to work either as a powerful natural language frontend or as a backend tool for language understanding. This is where LLMs can really shine, beyond the basic chatbot application.

You have to be careful, though. There are many common pitfalls to using these models, including some inherent limitations like the dreaded hallucinations, which, although they can be mitigated to a certain extent, are probably impossible to solve altogether without a paradigm shift.

However, despite their many limitations, large language models are one of the most transformative computational tools we’ve ever invented. Learning to harness their power will supercharge your skills, in whatever field you are working. The next article will explore the many techniques to integrate language models into existing or novel applications and bring truly transformative AI-powered apps to life.

And that’s all for today! If you want to learn more about LLMs and how to use them in practice, check out my in-progress book, How to Train your Chatbot.

Actually, you don’t really get just the most likely next token. The model provides a distribution across all possible tokens, giving you the probability of each one being the next continuation.

This is the reason why some people call low-temperature a “precise mode” and high-temperature a “creative mode”. There is nothing intrinsically more precise or creative either way. It's just making the output more or less deterministic.