Mostly Harmless #1: Coding is Dead, Long Live Coding!
Code generators: how they work, what are their limitations, and do they spell and end to coding as we know it?
Few developments in the generative AI space have been as exciting lately as the rise of code generators.
Code generators are machine learning models that can take a natural language prompt and some contextual code and produce new code that mostly aligns with the prompt's intention. The simplest case involves defining a function, its arguments, and comments describing how it works; the code generator will then provide the function's body. For example, you can input a function signature and a comment saying, “This function finds the minimum of an unsorted list”. The generated code would fit these specifications.
GitHub Copilot was one of the first major breakthroughs (or at least public breakthroughs) in this space, sparking both praise and controversy due to its impressive accuracy in many situations while also sometimes producing verbatim code from private or open-source repositories with proper licensing information included. Code generators work similarly to language models, with an uncanny ability to comprehend and generate code based on human communication, but they also need some special considerations., because code is not natural language.
In this first edition of Mostly Harmless, we’ll dive into this topic and examine how these systems work, their potential applications and limitations, and answer a fundamental question regarding whether or not they signal an end to coding: does anybody ever need to learn how to program again?
Mostly Harmless is a premium newsletter.
Upgrade your subscription to unlock all past and future issues.
You can also unlock this single issue for less than a cheap coffee.
Building an LLM for code generation
Code generators are essentially large language models, like GPT, trained on vast amounts of code. For that reason alone, we should code generators to learn to generate code based on the previous context, i.e., the variables, methods, classes, and generally the structure of existing code.
For example, if you have variables declared in the context, it is more likely that the code you generate will contain references to those same variables because, in training code, you refer to existing variables more often than you define a new identifier.
Now, if you train on data with code interspersed with other language comments, like function documentation or inline comments inside the code, it is natural that this comment will refer to the semantics of the surrounding code. Thus, it is also somewhat natural that language models can learn to generate code from a natural language description.
Up to this point, all we have discussed is plain, unsupervised training of language models, but on code data. As we know, models like ChatGPT are trained this way, fine-tuned on instructions, and then reinforced with human feedback.
And you can do the same thing with code. You can compile a set of instruction pairs that say, in natural language, "modify the previous code so that there are no bugs", or "in the previous code change this variable for this other variable". These examples will not appear naturally in training code taken from Github, so you need to bake them in with an instruction fine-tune.
And finally, you can do reinforcement learning with human feedback, evaluating different versions of the same code. Some of these will have better style or naming conventions that more closely adhere to established standards in developers. Using RLHF, you can steer a model to not only generate syntactically correct code, but also respect some desired style. The same methodological approach leading to a world-class text-based chatbot can also lead you to a world-class code generator.