Code Generation is the Next Milestone Towards AGI

How to turn a stochastic parrot into a general-purpose problem solver.

Jun 10, 2024

Last week, we saw how function calling is one step towards enabling LLMs to interact with external systems by giving them more flexibility in choosing which operations to perform—i.e., which methods to call—based on the conversation context. But, in the end, function calling still limits the model's capabilities. It is restricted to the preprogrammed functionalities we give it access to.

This may be precisely what you want because it gives you total control over the action space of the LLM. However, for some of the most complex tasks, it may be too hard to develop a flexible enough API.

Suppose you’re making a data analysis bot, that will read a CSV file and answer questions about it. You can ask it to group, filter, or otherwise process the data in myriad ways. You could conceivably come up with a set of functions that cover your entire question space, but you would end up coding something that resembles the pandas API. What you actually want, is for your chatbot to write and run pandas code!

Enter code generation.

Instead of developing a very flexible and broad set of API functions, simply let your bot write Python code (or any other programming language) and run it. If the model is sufficiently trained in code generation, it can often solve most of the low-level coding tasks you would end up encapsulating anyway.

There are many flavors to code generation, ranging in complexity from single instructions to fully working applications. There is also the question of how that code gets used. One option is to execute it immediately, maybe even transparently to the user, to obtain a response. In this case, the result from running the code is what matters rather than the code itself. On the other hand, you might want the code as the end result, maybe to copy and paste it somewhere else.

In this article, we will explore the most interesting use cases for code generation, and some tips and strategies to get the most out of language models that can code.

This article is part of my in-progress book How to Train your Chatbot, currently over 150 pages of practical advice to build all sorts of cool stuff with LLMs, including prompt engineering principles, deep but intuitive theoretical explanations of how LLMs work, and hands-on coding tutorials building actual working applications.
You can get the early access today and help Mostly Harmless Ideas grow.

How to Train your Chatbot - 50% off

How code generation works

In the simplest case, you can think of code generation as a subset of text generation. If your model is trained on mixed natural language and code input and/or fine-tuned with coding instructions, it will naturally learn to answer some prompts with actual code. For example, you can train an LLM in programming contests, where the input is a problem statement, and the output is the code of the solution in some programming language.

It is at least somewhat surprising that vanilla LLMs, trained on code, can learn to write code at all. The reason to be skeptical is that programming languages have very strict syntax rules, which make it hard, at least in principle, for a purely statistical language model to produce parseable code, not to mention semantically correct. Failing to produce a single semicolon in the exact location can make an otherwise perfect piece of code incorrect. Yet, LLMs learn to code, almost without additional effort.

In fact, most general-purpose models now available have at least some general capabilities for code generation, if only because they are trained of vast datasets that contain, among the many types of text modalities, lots and lots of code. And even if you don’t want an LLM explicitly for code generation, training on code and text (rather than just text) has been shown to improve the general reasoning capabilities of a model, even for text-only tasks!

But although you can get reasonably good code generation almost for free, the best coding LLMs are fine-tuned on precise text-and-code datasets.

There are many reasons to prefer a model fine-tuned for coding to a general one. The most straightforward argument is that contrary to natural language, highly plausible code can still be incorrect. Fine-tuning a model specifically on code reinforces the syntax rules and makes it much less likely to generate almost correct but still syntactically wrong code.

In the same vein, since programming languages are much more rigid in syntax than natural language, fine-tuning can make a smaller model as good or even better than larger, general models if focused on a specific language. Likewise, even if your general LLM can code Python, it may not know the specific framework you’re interested in or code with the exact style you want.

Code generation use cases

In this section, we’ll look at code generation from a high-level perspective to understand what are the most interesting use cases it unlocks. We won’t go in-depth into technical details of how to make these use cases work in practice, as we will have plenty of time in Part 3 of the book to see concrete examples in action.

Code completion

The simplest use case for code generation is straightforward code completion. This can be as simple as asking ChatGPT the code to solve a concrete problem without any additional context. However, this use case becomes much more interesting when you can provide the LLM with relevant context (existing code, class hierarchies, function declarations, etc.) and have it produce code that fits right into an existing codebase.

At its core, code completion is just a form of text completion with all the quirks and caveats. The same prompt can produce different results, and slightly different prompts can vary widely in output quality. So far, all prompt techniques we’ve discussed apply: format your prompt carefully, be precise, don’t overshare, use examples whenever possible (no negative examples, please), and be intentional and cognizant of all usual gotchas.

Code maintenance

Code maintenance is a slightly more advanced form of code completion, in which we ask the model not code to support new functionality but rather to modify existing code (or add to the existing code) to improve its quality, maintainability, clarity, etc. A typical example is automatic refactoring: asking the model to, e.g., extract some functionality into its own function or otherwise abstract or encapsulate relevant fragments of code.

This process can be enhanced with prompt templates for everyday tasks, including generating boilerplate code or running typical refactors like splitting methods, encapsulating functionality, or changing style from, e.g., a recursive implementation to an iterative one.

Another form of code maintenance is adding unit tests. A careful explanation of the relevant functional requisites in natural language might be enough to have a model generate reasonably good unit tests for the most common use cases.

Translation and explanation

The previous use cases are mostly examples of language-to-code generation. On the other hand, we can have code-to-code and code-to-language generation.

The first case is helpful for code translation. A simple example is translating code from one programming language to another, perhaps because you found the solution to your problem online, but it’s not in the programming language you need. But you can also translate between two versions of the same language, say, from Python 2 to Python 3, to update an implementation. Or translate between different frameworks or different implementation styles.

The second case is helpful for automatically adding comments to existing code or otherwise generating explanations in any level of detail. As usual, the accuracy of the explanation is subject to how powerful the model is, and how complex the code. In general, it is impossible to understand what a given code will do without executing it, but you can get pretty far, at least in the most common scenarios.

Using external tools

Now, we get into the domain of code as a means to an end rather than the end in itself. You can use code generation to interface with external tools that either don’t have a high-level functional API or that, by their nature, are code-based.

An example of the former is using any of the myriad utilities in your terminal via Bash scripts. If you want your LLM to be capable of, e.g., creating files, making commits, downloading stuff, compressing and moving things around, etc., it is very likely that a reasonably good LLM can generate a Bash one-liner or small script to work these tasks out.

In the latter case, you may want to interface with code-based tools, such as SQL databases, or any special-purpose computing engines, from Wolfram Alpha to a SAT solver or an in-house tool. If the language used by that tool is not mainstream—meaning, the LLM won’t be trained on it—you’ll need to fine-tune on it.

And finally, you can interface with programming frameworks that have, e.g., a Python-based API. Again, unless the framework is very well-known—e.g., sklearn—you may need to fine-tune your model to teach it how to use that concrete API. But in many cases, the model might generalize from its basic Python knowledge to specific APIs with a small set of carefully curated k-shot examples.

Enhanced reasoning

LLMs are lousy at mathematical and logical reasoning. This is somewhat surprising at first because computers are supposed to be precise mathematical machines. However, when you understand how LLMs “think”, you realize they don’t have any explicit mechanism for even the simplest mathematical operations. But you know what does? Python! (and any other programming language).

Code generation is the most effective way to enhance the mathematical skills of LLMs. Instead of having the model directly answer questions involving mathematical operations, make it generate a short code expression that computes the right formula, run it, and feed the result back to the LLM. This way, you can “teach” an LLM to solve complex mathematical problems by doing the same thing we humans do: using the right tool.

But the possibilities go far beyond simple (or fairly involved) mathematical operations. You can pair an LLM with any of the many specialized inference engines the AI community has invented over the decades. Make your LLM generate Prolog code and voilá, you have a general-purpose logic reasoning engine from natural language. Or make it generate SymPy expressions, and you have a symbolic reasoner.

But this is no silver bullet, of course. Your LLM can simply fail to generate the right code. So, even if you have the perfect computing engine that solves the right problem, getting a language model to generate semantically correct code for that engine is an open problem and one which is ultimately unsolvable according to basic computability theory.

However, for many practical cases, given enough examples for k-shot or a small fine-tuning process, you can get an LLM to learn how to solve interesting problems reasonably well. This is an active area of research, so we can only expect these capabilities to improve in the near future.

Prompting tips for code generation

In many cases, you can make an LLM write code simply by asking. A prompt like “Generate a Python function to find the third maximum element from a list” will work almost flawlessly in any sufficiently capable language model you can find today. This works fine for many use cases where the code is all you need. At least, it is no worse than searching for a similar snippet of code online.

However, there are several drawbacks to this KISS approach. First, most commercial LLMs are fine-tuned for chat, so they are… chatty. Instead of the raw code, they might answer with something like “Sure, here is a function in Python to do ….” and then the code. This makes it hard to integrate them with external tools that need just the code because then you have to parse the response.

In many cases, you can get away by adding an explicit instruction like “Please answer just with the source code”, but still, some models may refuse to comply. And even if they comply, different models output code in different formats. Some will enclose the code in markdown-style code block annotations, while other models might indent the code. It depends heavily on their training data.

Another problem you may face is when asking for one-liners, i.e., single instructions or expressions that you want to evaluate with, e.g., the eval function in Python. If you ask for a single pandas expression to, say, group and filter a dataset, the model may sometimes produce a proper expression—e.g., df.groupby(...).agg(...)—and other times an instruction—e.g., df = df.groupby(...). You may work around these issues by doing some checking and post-processing of the response, like removing anything before the last = sign, but this is a very brittle approach.

In these cases, some of our well-known prompt techniques also apply. Be very intentional with the prompt and provide positive examples of the exact response format you expect. While none of this will 100% guarantee you’ll get a response in the format you need, when paired with a try-and-repeat strategy, you can often get away with the performance you need. For example, if the model makes a mistake 10% of the time, you’ll need to redo one in ten queries on average, which is not terrible.

In many cases, when retrying the same code generation task, it helps to include the previous answer and the error in the prompt. This can often be automated simply by trying to run the code, capturing any exceptions, and feeding the model back with the exception message, asking it to try to fix it.

Finally, with some tricks, we can force the LLM to produce syntactically correct code—even if it is not guaranteed semantically valid. The trick is to restrict the sampling step to only select among the tokens that would be syntactically valid.

Some open-source LLM inference engines, like llama.cpp, support passing a formal grammar that defines the programming language's syntax. During sampling, the engine will select only those among the top-k tokens that are valid according to the grammar's production rules. This can be done efficiently with a linear-complexity automaton constructed automatically from the formal grammar definition. While this is a relatively novel and, arguably, rather advanced feature, some commercial APIs are starting to support it.

Limitations and caveats

Needless to say, code generation is full of subtle and not-so-subtle problems. For starters, some hallucinations are going to happen, and this might result in several different types of problems. The simplest case is getting code that is not syntactically correct, that is, code that doesn’t parse. If this is your main problem, then you’re lucky because this is simple to check. Just run a linter for your target language and retry if you find any syntax errors.

A more complicated issue is when your model generates syntactically correct code that throws an exception. This is still not terrible because you can run the code and check for exceptions. However, running code generated by an LLM is a bad idea if you don’t have some guardrails. For all you know, the code may have an instruction to wipe out your hard drive. So, you must always run LLM-generated code in a sandboxed environment. This is especially true when running code generated by a user-facing LLM. Someone will hack that LLM to generate some system-breaking instruction.

The third level of problem is when your LLM generates code that runs without exceptions but doesn’t do the right thing. This is, in general, impossible to detect beforehand. However, depending on your specific use case, you may be able to check the result is what you expect and even roll back any potential side effects if that isn’t the case. For example, you can have your code work on a copy of the relevant data and check for any unexpected changes before merging that data.

However, this is the most important open problem in program synthesis from natural language. To fully solve it, I believe a new paradigm that goes beyond statistical language modeling is required.

Grammar-restricted output is one of the most effective ways to make code generation more robust. Still, this process is ad-hoc and has not been baked into the training process. Thus, the LLM can get stuck simply because it doesn’t give a high probability to any valid tokens. If the LLM wouldn’t naturally produce the correct response, at least with some non-trivial probability, no ad-hoc filtering mechanism can force it to generate the correct code.

This means adequate prompting and possibly fine-tuning for specific domains will remain relevant strategies in the near term.

Conclusions

Code generation is one of the most surprising (weakly) emergent capabilities in the current language modeling paradigm. The fact that pure statistical correlations between tokens give rise to something that can mostly code—granted, at the level of maybe a 2nd year CS college student, in the best of cases—is something I wouldn’t have expected to be possible even three years ago. On the other hand, strong code generation is one of the most powerful and versatile capabilities in language models. It opens the door for a myriad of integrations with existing and newly created tools.

A general-purpose language model paired with a general-purpose code interpreter is one step closer to AGI, there is no doubt. If Turing is right—and would you bet he isn’t?—, any problem that can be solved with an effective, repeatable, systematic method can be solved with any modern general-purpose programming languages. An LLM that can code at the level of the best human programmers would, almost by definition, have a general intelligence.

The only gap that needs bridging is getting the model to produce the right code. But this might well be the hardest problem in computer science. We know that it is generally impossible to know what a program will do without running it. But this doesn’t mean machines are necessarily less capable than humans. Who says our brains aren’t just very powerful machines?

But all of that is hypothetical. In the meantime, even with the many limitations of modern LLMs, code generation is one of the most useful tools available to build practical, useful applications using language models.

We will spend a lot of time in Part 3 of the book playing around with different strategies to turn LLMs into effective coders. Remember, you can get your early access copy today at a laughably low price. It is the best way to support these free educational articles.

How to Train your Chatbot - 50% off

Andrew Smith

I think I got a really good dose of understanding early on when I realized GPT3 could write code. I saw how the world had changed in an instant, and when other folks realized you could write code, they were in awe at first, but then quickly complained about how it didn't work all that great after all, and how human programmers are so much greater, yadda yadda knee-jerk defensive response.

Being quick to dismiss the potential - no, the inevitable- is simply silly.

I guess the big observation I have for you is that there are a whole bunch of kids growing up today with LLMs in their hands, just like the current gen Alpha grew up with smartphones and social media from birth. What does that mean? IMO, it means a lot of folks will take all this for granted. It's magic, though! I know this because I tried very hard to do many of the things I can now do in a minute fraction of the time. Kids don't see this today, and maybe that's not such a bad thing - they'll demand ever more improvements, not really understanding why those dummies who created the first generative AI didn't think of those rather obvious things when they created the first good LLMs.

They won't get that there wasn't like some planning committee, just a shocking invention that rocked the world, and we've been scrambling to adjust ever since.

Expand full comment