Coding is Dead, Long Live Coding!

Code generators: how they work, what are their limitations, and do they spell and end to coding as we know it?

Sep 22, 2023

Few developments in the generative AI space have been as exciting lately as the rise of code generators.

Code generators are machine learning models that can take a natural language prompt and some contextual code and produce new code that mostly aligns with the prompt's intention. The simplest case involves defining a function, its arguments, and comments describing how it works; the code generator will then provide the function's body. For example, you can input a function signature and a comment saying, “This function finds the minimum of an unsorted list”. The generated code would fit these specifications.

GitHub Copilot was one of the first major breakthroughs (or at least public breakthroughs) in this space, sparking both praise and controversy due to its impressive accuracy in many situations while also sometimes producing verbatim code from private or open-source repositories with proper licensing information included. Code generators work similarly to language models, with an uncanny ability to comprehend and generate code based on human communication, but they also need some special considerations., because code is not natural language.

In this first edition of Mostly Harmless, we’ll dive into this topic and examine how these systems work, their potential applications and limitations, and answer a fundamental question regarding whether or not they signal an end to coding: does anybody ever need to learn how to program again?

Building an LLM for code generation

Code generators are essentially large language models, like GPT, trained on vast amounts of code. For that reason alone, we should code generators to learn to generate code based on the previous context, i.e., the variables, methods, classes, and generally the structure of existing code.

For example, if you have variables declared in the context, it is more likely that the code you generate will contain references to those same variables because, in training code, you refer to existing variables more often than you define a new identifier.

Now, if you train on data with code interspersed with other language comments, like function documentation or inline comments inside the code, it is natural that this comment will refer to the semantics of the surrounding code. Thus, it is also somewhat natural that language models can learn to generate code from a natural language description.

Up to this point, all we have discussed is plain, unsupervised training of language models, but on code data. As we know, models like ChatGPT are trained this way, fine-tuned on instructions, and then reinforced with human feedback.

And you can do the same thing with code. You can compile a set of instruction pairs that say, in natural language, "modify the previous code so that there are no bugs", or "in the previous code change this variable for this other variable". These examples will not appear naturally in training code taken from Github, so you need to bake them in with an instruction fine-tune.

And finally, you can do reinforcement learning with human feedback, evaluating different versions of the same code. Some of these will have better style or naming conventions that more closely adhere to established standards in developers. Using RLHF, you can steer a model to not only generate syntactically correct code, but also respect some desired style. The same methodological approach leading to a world-class text-based chatbot can also lead you to a world-class code generator.

However, there are still limitations since code is a formal language with stricter syntactic rules than natural language. Therefore, certain aspects may not work as well as they could. For instance, you cannot have unbalanced parentheses or ill-formed arithmetic expressions. And while these limitations may be harder to learn from data, we can bake these rules into our code generator in at least two different ways.

One of the simplest but most effective ways is to use trial and error. You can have a prompt, make your code generator produce a bunch of continuations, and then you lint those continuations with, say, the Python linter, and reject the ones with parsing errors. You cannot do this with natural language, because checking that natural language is well-formed is as hard as generating natural language.

You can also reject the generated snippets with easy-to-see semantic errors, like referencing an unexisting variable. You can even run them in a sandbox and reject the ones with some trivial easy runtime error, like a null dereference, or a division by zero. You can do this during the inference phase, of course, but also during the training phase so that some of these rules bleed into the probabilistic learning and make the model less likely even to sample ill-formed code.

Another way of forcing syntactic rules is to pre-process the code so what the language model generates cannot, by definition, be incorrect. For example, you need your code generator to generate a loop with a variable and use the variable internally, and those two mentions of the same variable have to match the variable name.

This is the case of a context-sensitive restriction. One way to solve cases like this is to have the language model operate at a syntactic level where variables are all called var0, var1, etc. This makes it much easier for the language model to learn to refer to existing variables, since the number of distinct tokens it must handle is far less. During training, you can rename all variables, function names, and symbols in general to this restricted format. Then, during inference, you run a post-processing step that renames symbols back into their actual names in the given context.

In general, there are many tricks you can use to leverage the fact that you are dealing with a very restricted syntax and, at the same time, make it easier for the language model to learn those syntactic rules.

What can code generators do

Now that we know broadly how these code-generators work, we can ask ourselves in what way they can help us programmers.

The most basic use case is, of course, code generation. This was the first use case highlighted by Github Copilot, where you can ask the model to generate a snippet of text. You can generate a complete short method or a code fragment, like a loop or a snippet that uses an API. So it's basically code completion on steroids.

You can also ask for small modifications to the code. For example, you can say, "In the following code, change the inner loop so that it doesn't use this variable twice". This way, you can make small code refactorings via natural language.

That's the simplest use case, but it's not nearly the most interesting one. When you extend natural language to programming code translation to its full extent, you realize this can be bidirectional. You don't need to go from natural language to code; you can also go from code to natural language.

Code explanations

The code-to-language direction allows, for example, automatic documentation of functions, as well as getting some explanations for a fragment of code. Suddenly, you can ask questions about your code to a language model and get a natural language description of what the code is doing. It may not be exactly accurate, though, and maybe it couldn't be as high level as it could.

For example, if you have a somewhat complex algorithm —like a sorting algorithm— the description you get could be something like "this variable is changed to this array position", and it probably won't get what you really want from the code, which is something like "in the internal loop, we are guaranteeing that the first part of the array is always sorted and the second part contains only elements larger than the first half".

That's what you would want in a sorting algorithm for the explanation. And it's going to be hard to get this from a language model because it has to understand not the syntax of the code --what the code is saying-- but the semantics of the code --what the code is doing in execution-- and that's not something that you can at least naturally learn from data on code and comments. Actually, even for humans, this is hard to do. You have to debug the code and prove a theorem of what the code is doing in your head.

Code translation

The next thing that you can do is code translation. You can ask the model to translate code from one programming language to another programming language. Of course, we can train the model for this —e.g., translate from Python to Rust—, but you can also use natural language as an intermediate language.

If you have a language model that can generate back and forth between natural language and several programming languages, it already implicitly knows how to translate. Thus, you can say, "take this code to a high-level natural language description and then produce a code implementation in another language".

Again, this comes with the caveat that maybe the model isn't getting the exact semantics of the code, but it can get pretty close to be good enough for most practical purposes.

API documentation

But perhaps one of the most interesting use cases I'm finding people are doing a lot is to document APIs automatically. Instead of interacting with your code base, imagine you have some library you're using —like Python bindings for Elasticsearch. That library will have documentation with code examples and snippets interspersed with natural language.

A very simple use case is to do retrieval-augmented language modeling. This means we index the documentation in a vector store, and then we can ask questions about the documentation. You can say, "How do I log in and I make a query that says this and this?"

It doesn't matter if the language model has no training data from that specific library doing something you want to do. You can go to documentation, find relevant examples, feed them to the language model with their natural language description and code, and then ask a question.

Because of how in-context learning works, you can expect the language model to respond with new code and more or less answer the intended semantics you want by copying and pasting, refactoring, and combining things from the documentation.

This is, by far, the most interesting use case that I'm seeing right now. The reason is that this is one of the biggest pain points in software development. Programmers aren't usually wasting much time doing basic coding, like inserting a variable in a list or creating a dictionary with some format. If you've programmed for a while, these tasks are so ingrained in your muscle memory that you know how to do them automatically. I don't need anybody to remind me how to open a file in Python.

But 90% of the code you write for a consumer application is interface code with some external API or library. And a large part of that code is code you don't know how to write because it's maybe the first or one of the few times you interact with that specific library.

So now you can have a coding assistant that doesn't really need to be that good at getting the complex semantics of code. You need to be able to go to documentation, find the relevant examples, and give you a snippet of code that is 80% extracted from relevant examples but using your variables and contextualized in the piece of code that you are writing. And that already is a huge boost to productivity.

Limitations

The most important limitation in language modeling, in general, has been called the problem of hallucinations. Broadly speaking, it is very hard to get these models grounded in factual information because distilling from training natural language what is syntactic knowledge versus factual knowledge about the world is very hard.

A typical example is when you ask something to Chat-GPT, and it invents dates or people's names and locations. To alleviate this, we can use retrieval-augmented models to extract relevant factual information from Wikipedia and then use the language model to summarize that. But, at least until now, there is no principled way to guarantee that the model won't simply generate some weird, incorrect factual claim because the model doesn't know what part of its knowledge is syntactic and what part is factual.

I personally think this problem is ultimately unsolvable using the language modeling paradigm, where you generate text based on probabilistically picking the most likely continuation.

What do code hallucinations look like?

If you're going from natural language to code, the simplest way you can see hallucinations is when you get a code that uses a new variable or method that doesn't exist anywhere in your codebase or uses a new method that doesn't exist anywhere.

However, unlike natural language, if you hallucinate a wrong variable or function name, you can often detect it using a linter, so many of the more harmless hallucinations are irrelevant in the code generation case, as they won't introduce subtle bugs.

A slightly more difficult hallucination is using a wrong variable or a wrong function name that does exist in your codebase. In this case, you will not get a linter or a compiler error because you're using an existing symbol, but you will get the wrong behavior. This will be a lot harder to find because it has the same problem as most hallucinations: you have to review the code, so you have to be sufficiently knowledgeable to have been able to generate that code yourself.

Then you have a third level in which your code doesn't do anything obviously wrong. It uses the right variables and function names, and it looks more or less OK, but it has some subtle logical mistake that leads to a bug. For example, looking at a double-nested loop and finding out that a variable is not updated at the right moment in the internal loop is a tricky question even for human experts, so these kinds of mistakes will introduce subtle bugs.

However, even if, in many cases, the bugs that they will introduce are not worse than the bugs a human would introduce, this does pose an important threat because of automation bias.

When you check code written by humans, you expect bugs in that code, and you expect the code to be wrong in specific ways in which humans make mistakes because you've been looking at humanly written code forever.

But when you're looking at machine-generated code, the only way programmers have ever interacted with it has been with rule-based machine-generated code —compiler- and template-generated code— and that is basically code that has no mistakes.

So even if the language model would make errors that are, on average, not worse than what a regular average programmer would make, they can still be harder to get because these will not be the exact same mistakes that a human would make; so we could be less on guard.

Can we fix this?

We can try to fix hallucinations in code with better training data, post-processing, and many techniques that we still haven't figured out exactly but will come out in the next few months. But there is a fundamental limitation to what we can do automatically, Rice's theorem.

In short, there is no high-level non-trivial semantic property of programs that an algorithm can automatically check. This is a very important limiting theorem in theoretical computer science, which means there is no way to get an informal semantic description of the intended behavior and automatically verify that the generated code fits that behavior.

However, this doesn't mean it cannot be done sufficiently well in practice so that it already provides an incredible performance and productivity boost. It just says that, in general, we will never be able to formally guarantee with a proven theorem that a code generator produces code that does exactly what you tell it in natural language. That problem is unsolvable.

This highlights the theoretical impossibility of perfect natural language to code generation, but engineering isn't about perfection. Engineering is about solving the average case in the best possible way and the most important edge cases fairly well.

I think that we can get pretty far with code generation technology. We already have made a huge leap, and I think these theoretical limitations, even if they show that we will never be able to have a perfect programming companion, they don't tell in practice how far we can get in more or less well-known domains with more or less well-defined rules. If 90% of the code that is written today for consumer applications can be offloaded to computers and then you can spend 10% of the time reviewing that code instead, and you have some sensible unit testing in place and some sensible quality assurance, this could be an incredible boost in productivity.

Even if for mission-critical code you will still need a very thorough formal verification procedure and careful review, that's only mission-critical code. Most of the code we write today is not mission-critical; it doesn't matter if it has a bug and somebody gets annoyed. That is already happening. We are shipping tons of buggy lines of code, making tons of users annoyed, and wasting a lot of money on shitty code today.

And still, the software industry has brought us here. It has brought us to a place where we have social media, YouTube, machine learning, and rockets going to the Moon and Mars.

So yeah, if you have a code assistant that gets you 90% there and you have to put an extra 10%, that automation is solving the larger, easier part of the problem and letting the human expert solve the smaller, harder part of the problem.

So, is coding dead?

The final question I want to tackle is whether this will spell an end to software development as we know it --and you can hint by the way I've been talking about this that my answer is, of course, no, it won't.

The first argument is that we are not, and we probably will never be in a place where you can ask in natural language for a complete application, and you will get something deployed to the cloud. But when we get there, that will most likely work only for most of the easy application domains where everything is more or less already done --and that's a good thing. But there will always be a gap between what a team of humans using the best available technology and software engineering practices can achieve and what the AI alone can achieve.

But this is, again, a very broad idea. Yes, maybe the best software engineers in the world will always have a job because there will always be difficult problems.

But what about the average software engineer?

Will the average programmer get outworked by a slightly better-than-average programmer with an AI compiler? If now I can write 90% of the code in 10% of the time, does that mean that nine out of 10 programmers get out of the job because I can do the work of nine of my colleagues?

That is a real concern, and it's a concern that is not so easy to dismiss with sweeping arguments like the previous one. Every time automation reaches some industry, some jobs are destroyed, and some people are taken out of the picture because their skills become irrelevant.

One optimistic point of view is that software is an industry that is nowhere near the saturation point where consumer demand meets offer. We have way, way more need for software out there than the amount of people who are currently writing software can do.

So, there is still a lot of space for increased productivity, and we will have enough demand to meet that increased productivity. Thus, in the near term, I only expect that if we increase the amount and quality of the code we are writing, i.e., the number of problems we are solving, this will mean that more users will be satisfied.

The other optimistic view is that any leap in software productivity—and this hasn't been the case in other industries— has translated into a massive barrier-lowering. For example, when we went from programming in assembly code to compilers or from programming in C to using OOP frameworks, we got more people into programming every time we had something that lowered the entry barrier.

Today, we have maybe 10 million programmers in the world, and we have almost 8,000 million people. So there are a lot of people that aren't programming yet, and they could get into this.

You'll probably ask, but why would everybody get into programming?

Why would everybody write code?

Well, here's an argument. The vast majority of people in the world have at least a basic understanding of math, so that when they go to the supermarket and they have to, I don't know, buy two things, they don't need to hire a professional mathematician to add for them. That is, the vast majority of people know enough math to get by daily, and then we have professional people who do professional math in problems that are sufficiently hard so that the average person doesn't know what to do. And that's because the modern world runs on math.

Well, I think programming will become more or less the same. The modern world also runs on software, and our society will become even more software-dependent. In that future, everybody will know a little bit of programming, sufficiently enough to say to their home computer, "When I get home, I want you to turn my lights on, but only if it's night and the electric bill is not above the average that I pay in the last three months." You will learn basic coding in school, and anywhere, you will have interfaces with computers that you talk to, and they generate some code and do something for you.

Is making software really about coding?

Here is one final argument for optimism. So far, we've been talking about how massive the boost in productivity of code generators is, but this boost in productivity is limited to one specific part of software development, which is actually writing code. Anybody who's done software development at some scale in the industry knows that writing code is by far neither the hardest nor the most time-consuming or even the most important part of software development.

If you increase productivity a hundredfold in a part of the process that represents 10% of the overall, you, at best, reduce that 10% close to zero but still have to deal with the remaining 90%.

And what else do we have in software development that is super hard to get? We must understand requirements or specifications, talk with customers, understand what they want, and guide them through designing a software product, knowing the user base, and finding a sustainable business model.

Yes, we can also improve some of that using language models because we could, e.g., get transcriptions of user interviews and produce a summary of all the bugs we have in Google Play comments.

But a fundamental threshold there is the human part of the process. We need to get some humans to use our application enough to discover whether our application is fulfilling their needs. And we need those humans to report it back. And that's a massive part of software development, testing your code with real users.

And you cannot simply replace that other human with a language model because your end user is a human. You want that human to interact with your code. You can replace the person listening to the human with a language model and maybe get a boost in productivity there, but you cannot replace the user with a language model. You'll still have a human user, and human users are slow. They get angry easily, don't understand your application, and don't know what it is they don't like about it.

Software development is not about writing code but about making useful software for somebody in the real world. And while code generators are making and will continue to produce massive improvements in the code writing part --and maybe in the code reviewing and debugging part--, all of that improvement is mostly isolated in the part of creating the software in the first place.

But there is so much more going on in software development beyond coding, from marketing to product development to talking to users to a lot of things that are still unscathed by language models --and some of them are probably always going to be outside of the realm of the language modeling paradigm because they involve interacting with real users in real-time and seeing them use your app and work through your app.

So this enormous productivity will be huge in one part of the pipeline, but it's contained to a part that is not by far the most important part of the pipeline.

And this is something that we've been seeing for a long time. We've always had enormous boosts in productivity before. We used to need to write assembly, C, or C++ code, and now we can write Rust or Python. These languages are extremely more productive than C and C++ for writing code that works out of the box, and we still don't see that software gets done 100x faster than 50 years ago.

The biggest progress in the software creation process has been because of innovation in the people process. Innovation in software engineering, management, and how you get people to work together and collaborate. And this will continue to be the most important part of the software pipeline for a long time.

Conclusions

So, should you learn to code? Definitely, there's going to be orders of magnitude more code written in the next few years than everything we've written in history.

So if you learn to code the traditional way, even if you never end up writing a single line of code unaided by AI --like I've never written a single line of production code unaided by syntax highlight, a linter, or a type verifier-- knowing how code works, how algorithms work, why a specific programming construction works the way it works is the same as knowing basic math. Coding changes how your brain is wired, makes you think clearer, and increases your creativity.

Furthermore, even if you are not working in the software industry, learning to code is an enjoyable experience. Being able to create something that keeps working on its own, I think, is the ultimate toy. So I think learning to code is good if only for thinking through these puzzles and these problems and learning to do something new.

But if you want to make a dent in the software industry, to work at a software company, and you're wondering if AI will get you out of the picture, don't worry. That won't happen anytime soon.

Learn to code, learn the fundamentals, but also learn how to use these new tools, how to leverage these new tools the same way we programmers of the previous decade learned how to use linters and intellisense and all of the super cool tools that didn't exist 20 years before. Learn to use all that and become the best programmer you can, using the best available technology to improve your work. As in every moment in human history, if you apply yourself and do your best, you will be at the top of the league, and there will be a spot for you.

Ernesto

Great article! Thanks! Actually we just sent a paper trying to start a deeper line. You mentioned documentation QA which is a great application, however if we go now to a big project that changes rapidly due to agile practices many times you will find the case that the documentation is outdated with respect to the last version of the code. So we are trying to actually do QA using the code and static analysis to reduce the context we pass if the code is too big. Of course this alao has its limitations, for now we teste it in a small microservices project and we were surprised that from 33 designed complex questions it was able to anseer the majority of the questions correctly where the questions went from purpose of an endpoint to something like give me the chain of calls from endpoint A to endpoint B or tell me if you see cyclic dependency etc. Of course this we submit is baby phase, hopefully the monsters will be able to give it a great use.

Expand full comment

1 reply by Alejandro Piad Morffis

1 more comment...