Beyond the Chatbot Revolution
Here's what a truly powerful generative AI could do, and it's so much more than fancy chatbots
It seems like everybody's integrating chatbots everywhere. Sure, there are many examples of applications where a conversational interface is better than any previous alternative. A classic example is for improved non-professional search. I you want to search for something factual, asking a conversational agent may be a far better experience than typing keywords in Google because you can ask follow-up questions. So, baring hallucinations —a topic for another day— we can all agree there is merit to adopting a conversational interface, at least in some cases.
Improved search is a very exciting use case that works reasonably well, but there are many other cases in which natural language is far from the best input. When we’ve needed to give machines precise instructions, we have built task-specific interfaces to interact with them at a level of control beyond what natural language can provide.
Now that we can build super powerful conversational agents, there is a misconception that natural language is the best and ultimate interface for all tasks. We can forget about buttons, sliders, dials; everything you can achieve by clicking some control on a traditional UI, you could do it just with language, right? This bias is part of why we are placing overly high expectations on chatbots as the fundamental engine of the AI revolution.
Language as an interface
One of the reasons we want to use chatbots for everything is because humans use language for everything. However, language didn't evolve specifically for giving instructions. It evolved as a means to communicate within a community, collaborate, agree, and sometimes provide instructions for activities such as hunting and building.
However, the majority of human language is not well-suited for precise instructions. When communicating with machines or even between humans, we resort to using a subset of language that is better suited for instructions. For instance, we invented mathematical notations to give precise meaning to formal claims.
The ultimate embodiment of this idea is programming languages. They are highly restricted to prevent misinterpretation, as the syntax enforces the semantics, ensuring that a computer —or another human programmer— cannot interpret any part of the language in a way unintended by the original programmer. And this is why programming languages remain the primary way to give computer instructions.
Now, of course, all of that is changing with language models because until now, we couldn't even give a computer complex instructions in natural language. The best we could do was some sort of pattern matching with keywords, and the most successful pre-LLM application of natural language is search.
When searching in Google or any search engine, you are writing a query that resembles natural language. However, it doesn't necessarily have to be a question or a well-formed sentence. You are writing a semi-natural language request that triggers a process —that doesn't require understanding the full meaning of that request, but it does require finding at least some partial meaning of the words that you're using— to instruct Google to search a vast database of the whole internet and give you a very summarized subset of that database.
Search is just a way to instruct a computer with natural language —or something close to natural language— to perform a concrete task: finding a given set of documents. But we all know that search is far from perfect, and sometimes, it's tough to narrow down search terms to pinpoint the exact thing you want. This is why advanced search engines have filters for dates, topics, tags, sorting, etc. You have many controls over the search beyond just the natural language because it would be too cumbersome to say, "starting last week, sort by whatever field."
Well, of course, now we have large language models, and it seems we are almost at the point where we can give very precise instructions to the computer in natural language, and the computer will do what we want. It could fully understand the semantics of the language.
Whether probabilistic language modeling allows for full natural language understanding or not, that's a question for another essay.1 For this article, let's assume that language models, either in the current paradigm or with a more advanced paradigm in the near future, reach a point where we have full natural language understanding. We can tell the computer exactly what we want it to do, it will be transformed flawlessly into instructions in some formal language, and the computer will execute those instructions.
Suppose you could ask Photoshop, your favorite video editing software, or any application to do whatever a human expert can. Would that be the best possible interface?
I claim it isn’t. Even perfect natural language understanding is far from the best possible interface for many tasks. There is something even better than perfect NLP, and we may be closer to achieving that than this perfect understanding of natural language.
Let’s talk about the true power of generative AI.
Low, mid, and high-level interfaces
Okay, let's take a step back and discuss tools and interfaces.
Every time you want to solve a specific task using a tool, that tool has an interface, which is essentially how you interact with it. If we consider a physical tool, the interface could be a handle, cranks, and buttons. On the other hand, for a virtual software tool, the interface usually consists of virtual controls such as text boxes, buttons, and sliders.
So, when solving a particular task, we can arrange all the potential virtual interfaces on a scale from low level to high level, with low level meaning having more control over the details and high level being more goal-directed.
Another way to see it is that the low level is closer to an imperative approach, where you must instruct the tool on how you want things done. Moving to a higher level allows you to instruct the tool on what needs to be done, and it will be carried out. The farther away you are from the actual steps the program must take to solve the task, the higher the level of your interface.
Let's consider a specific example, such as creating images. The lowest possible level of a tool for making images is Microsoft Paint, where you can decide the color for each pixel. This level of detail requires significant effort and skill because you must know in advance the steps that need to be performed, but you can definitely paint the Mona Lisa this way.
Then, you have higher-level tools like gradients and fills, which allow you to instruct the computer to change the color between pixels smoothly. This approach involves some math, but the computer can handle it. Moving to an even higher level, software like Photoshop offers features like blurring, adjusting contrast, and transforming pixels collectively while maintaining their aspect ratio.
These involve more complex calculations that the computer can manage, bringing you closer to instructing the computer about what you want —make this image brighter— without specifying how you want it. We even have some old-school image processing tools, like patch removals, which employ clever non-AI algorithms to smooth out photo imperfections, like scratches and dimples.
At the highest possible level, you can instruct an advanced program, such as Midjourney, to create an image detailing a scene with very abstract instructions, similar to what you might convey to a human artist. This level of instruction involves leaving many decisions up to the computer, much like a human artist interpreting vague directions.2
Thus, the further you move to a higher-level interface, the more you gain in ease of use, but you lose a lot of control. There's an unavoidable trade-off here. The more you expect the computer to do for you, the less control you can have over the final result. And that may be okay for many tasks, but it’s not enough to achieve outstanding masterpieces. However, there is a sweet spot in the middle that we still haven't figured out, but I think it's not so hard to design.
Let's explore that trade-off and see what we can find in the middle.
The sweet middle spot
Now that we have a good grasp of the difference between a low-level and high-level interface, let's try to imagine a mid-level interface in this context, while considering the example of image generation, which is very illustrative. Later on, we will explore different domains.
Continuing with the analogy of a human expert or editor that you can communicate with, imagine if, during the process, you ask the AI to generate an image of a man looking at the sun. Then, you want to alter the sky, adjusting the tint from blue to more reddish. You can make such modifications using language, but there is a limit to how accurately you can describe the redness you desire. At some point, you may want to specify the exact tint you desire for the sky
Now, let’s make it a bit hard. Suppose you decide that you want the man to be in a standing position instead of sitting. You can adjust your prompt and attempt to get an image that closely matches your new request, but this will generate a new image, potentially losing all the progress you made on getting the sky the exact color you wanted.
What you really want is to give specific instructions to the image generator, such as "Retain the same image, but change the man to a standing position." Today, we have models that can be fine-tuned or instructed to perform such modifications to some extent. But the control is still far from perfect. You’d want to be able to click on the man and move him within the picture, allowing everything around him to change contextually, including his shadow and reflections.
Cool, right? Now, let’s make it even better. Imagine now being able to interact with the Sun in the image. By clicking on it and moving it, you would like the color of the sky and the shape of the shadows to adjust accordingly. Or you click on the clouds, a slider pops up, and you can change their density, which affects the lighting and shadows of the whole scene. Or even more abstractly, you move a slider that controls the season, changing smoothly from spring to summer, to autumn, to winter, all the time while keeping the exact same scene composition, but the light changes, the trees grow and fade, and the grass turns to mud and then snow.
These transformations possess a magical quality because they meticulously control specific dimensions while maintaining overall contextual consistency across the entire image. This is what I call context-aware transformations, and its a critical aspect of the AI revolution for content creation.
Merely instructing the Midjourney to create an image is insufficient. In reality, we desire a level of control akin to that of an artist. We require tools that can subtly alter the image and even abstract it further. For instance, we might want to adjust the image's tone from somber to cheerful using a slider, enabling comprehensive alterations to the entire image.
Here, we are dealing with two types of transformations: local, such as repositioning individuals and objects within the image, and global, which make sweeping changes across the entire image. But in all cases, any change has to be contextually consistent, so the whole image must be readapted to fit any sensible constraitns.
Okay, so images are the quintessential example of this kind of semantic manipulation, which can be extremely powerful if you find the right balance between high and low levels, between expressivity and control. You can argue that this is just a specific niche where language is not that prevalent, but the same ideas apply to essentially all design tasks, whether 2D, 3D, architecture design, engineering design, websites, etc. They all involve design constraints on a space of objects with semantic relations among them.
Now, some tasks are inherently linguistic, such as technical writing, where it seems like a high-level chatbot is indeed the killer app. However, even in these tasks, a very good mid-level with a precise balance between control and expressivity still beats the chatbot interface.
In writing, the lowest-level operation is inserting and deleting characters in text editors, while the highest possible level is requesting, "ChatGPT, please write me an essay on this topic."
An interesting mid-level tool, for example, is rephrasing a text fragment to change the language style, which is something we can already do today with LLMs. But there are more semantically-oriented modifications that you may desire, which are not so simple at first glance. For instance, restructuring the order of the major claims in an article requires manipulating the underlying semantic structure of the text while maintaining the same overall tone and style. This is not something easily achievable today with prompt engineering.
So, I hope to have convinced you that there is a sweet spot between high-level declarative and low-level procedural interfaces where you can achieve the highest degree of precision with minimum effort.
Now, how do we implement those?
Final remarks
I don’t have any magical answer here, of course. Still, a key insight is that these operations involve manipulating the object not at a surface level –e.g., the pixels or the text characters– but at a semantic level. Every generative model implicitly defines a latent space of images, documents, or 3D scenes, where every point you take in this latent space produces a different surface-level object, but always with correct semantics.
Crucially, if this latent space is well ordered, small changes in it will also give small, semantically correct changes in the surface. We have seen this first-hand with GANs —remember those?— in what seems like centuries ago, when we could interpolate between a cat and a lion moving through images that always seemed like animals.
The critical point is that specific directions in that latent space map to semantically well-defined operations, such as “change the density of the clouds,” but it is far from trivial to find those directions for an arbitrary human-level operation.
The challenge, thus, is to combine modern language models with pre-LLM latent-space manipulation to give you the best of both worlds. An easy-to-use high-level interface to get you started and a very powerful mid-level interface for fine-tuning your creation. This is the true power of generative AI, and I think we’re far closer than it seems.
So forget about chatbots, or maybe don’t forget about them completely, but consider there are many more exciting and incredibly powerful applications of generative AI if you’re willing to think outside the box.
I'm skeptical of the capacity of probabilistic language modeling to model all of the complexity in natural language, especially because compositionality seems to be something that cannot be learned from examples alone. But that's a question for another article.
It's important to note that while advanced programs like Midjourney may produce impressive results, they are still incomparable to the work of a real human artist.
Beautiful work. You are speaking right to the moment. Well done!
Latent space manipulación and explainability is the topic I'm most interested about in AI.
What learning resources would you recommend to be on top of current SOTA?