Why Alignment is the Hardest Problem in AI
And probably the hardest engineering problem we've ever faced.
Welcome back to Mostly Harmless AI!
After almost three months of pause, I'm back with this bi-weekly deep dive into the most relevant and current issues in the world of Artificial Intelligence, cutting through the pervasive bullshit and unjustified hype.
Today, we’ll look at the hardest problem in artificial intelligence and the main obstacle to making general-purpose AI—if we ever get it—a truly transformative technology.
AI safety is a critical concern, with issues ranging from existential risks to present-day harms such as biases, discrimination, and disinformation. The AI alignment problem is central to solving all of these.
AI alignment refers to ensuring that an artificial intelligence system generates outputs deeply aligned with user preferences beyond superficial optimization metrics. It aims for the AI to genuinely address the problem the user cares about rather than merely following instructions. The difficulty in AI alignment arises from the challenge of accurately describing user preferences, but it except extends far beyond.
Typically, what users convey to an AI is a crude approximation of their actual desires. The goal of alignment is to bridge this gap by enabling the AI to understand the user's true intentions based on their superficial explanations. But, as we will see in this article, it is devilishly hard—and perhaps even theoretically impossible—to ensure any sufficiently advanced AI system is behaving as intended.
The AI Alignment Problem
AI alignment involves ensuring that an artificial intelligence system behaves according to a user's expectations, preferences, or values. The interpretations can vary depending on the context, but in general, the idea is that a system will perform aligned with the user interests, that is, in a way that satisfies the user's actual needs and wants, and not just some superficial approximation of them.
The need for alignment in Artificial Intelligence
First, let's question why alignment is necessary in artificial intelligence but not in other high-tech products and engineering tools such as cars, planes, or rockets.
The primary reason lies in the level of complexity and the nature of the interfacing when dealing with AI systems compared to other engineering tools.
What makes AI different from other high-end technologies we have?
The more advanced a tool, the more you can focus on telling it what you want to do instead of how to do it. For example, with simple tools like a hammer, you control every action. With complex tools like cars, you still don't say "reach the destination"—at least not yet. Instead, you perform mid-level actions like steering and accelerating, which translate to lower-level actions like moving pistons.
As the tool becomes more advanced, the way you communicate with it becomes closer to telling it what you want and farther from describing what the tool must do to achieve that objective. Artificial intelligence lies towards that declarative extreme, where you tell the tool your end goal and let it figure out all the steps to achieve that. Actually, we could make the case that AI can be defined precisely as the field dedicated to making tools that do what you want and figure out how to do it on their own.
Consider driving a regular car—not a self-driven one. To get where you want, you need the car to steer in the right direction and accelerate or brake at the right times. Your high-level objective—getting somewhere—is decomposed into many low-level instructions. You cannot simply ask the car to drive itself—again, a traditional car. You have to steer the car and accelerate it. So the system that is translating a high-level, abstract instruction like "get me home fast but safely" into precise low-level instructions is no other than you, the user.
Now contrast this with a level-5 self-driven car. You could just say, "Get me home," and the "fast but safely" part is implicitly assumed. The car's AI system has to "understand" this high-level instruction and decompose it into the myriad of low-level instructions that actually make the car do what you want it to.
But here is the crucial part: " Get me home" encodes a far larger set of assumptions than you usually imagine, and there is an infinite number of ways in which an AI agent could be said to have fulfilled that request without actually doing what you intended it to do.
When you say "Get me home" to a human taxi driver, they usually implicitly assume you're also asking the following:
do not produce me any physical or psychological harm;
get there reasonably fast,
but do not drive carelessly;
take the fastest route if possible,
but take a detour if it’s necessary, even if it costs me a bit more;
do not engage in uncomfortable conversations,
but do engage in reasonably well-mannered conversations,
or leave me alone altogether, depending on my mood;
do not harm any pedestrians or animals,
but if you must harm an animal to avoid a fatal or very dangerous accident, please do;
...
These are all reasonable assumptions that any human knows from common sense, because we all share a common understanding of what it means to live and act in the human world. But an AI doesn't come with that common sense preprogrammed.
In fact, common sense reasoning seems to be one of the hardest skills for modern AI to acquire, at least in part because of the virtue of it being "common" which means we don't have large corpora of explicit examples of this type of reasoning, like we have for the more specialized skills.
And that is the reason we need alignment. When we tell a tool what we want instead of how to do it, we need it to interpret that want in a context full of assumptions, restrictions, and trade-offs, which are often implicit. Alignment means having an AI system apply the proper implicit context and find the solution to our request that is, as the name implies, more closely aligned to what we really want instead of just any solution that superficially fits the explicit request.
The crucial reason alignment is hard is the interplay between two critical parts of the AI equation: the inherent complexity of the world and the unavoidable brittleness of how we model it.
Let's break it down.
Why AI alignment is hard
Many reasons make AI alignment one of the hardest problems a system designer can face. Some of those reasons involve our incapability to produce a robust enough description of the task we want to solve—that is, we cannot fully describe the context and all the implicit assumptions and restrictions that apply in that context. These reasons are related to the nature of the problem itself —getting any intelligent agent to do what you want is intrinsically hard. If you're a parent, you know exactly what I mean. Other reasons are related to the nature of the solution we currently have, that is, systems built with machine learning trained on imperfect datasets to optimize imperfect proxy metrics.
These are interrelated but separate challenges, so let's take them one at a time.
Implicit contexts
When using a regular tool—like a hammer, a calculator, or Excel—you have an end goal in mind. The tool doesn't need to understand that goal, though; it just needs to follow your precise instructions. However, when working with AI, many assumptions about how the world works aren't explicitly described in the instructions.
For instance, if you tell an advanced AI to make coffee, there are numerous implicit restrictions: don't destroy the coffee machine, don't harm any animals, don't walk through walls, etc. Humans generally understand these unstated rules because we share a common understanding of the world. So, there is a significant difference between systems that require specific instructions on performing tasks and those that simply need to be told what tasks to accomplish.
When you want to tell a system what to do instead of how to do it, you must be very precise in specifying everything it needs to know. The constraints may be simple enough to be explicitly encoded or learned from data in a restricted domain. For example, in a factory setting, a robotic arm is physically incapable of destroying the world, so it doesn’t need to know much about anything outside the narrow task of car painting.
However, training systems for open-ended decision-making in the real world is far more complex. It's hard to imagine a training scenario as intricate as real life. Gaining all the necessary experience to understand the human world like humans do would require something like raising a child from scratch. And the majority of assumptions in those contexts can’t be learned from data, because we simply don’t have training data for “how to be a human”.
Unclear trade-offs
However, the implicit context problem presents an even bigger challenge. While many of the things an AI must care about implicitly are restrictions—e.g., do not kill the passenger—the hardest problem is when they involve tradeoffs instead.
This is a fundamental issue built into most real-world optimization problems. On the one hand, you want a system to achieve its objective as effectively as possible, and on the other, you want it to do so with minimal side effects. These two goals are often contradictory—for example, driving fast versus safely.
Many of these unwanted side effects are implicit restrictions: you don’t want to kill any pedestrians or harm the passengers. However, some side effects are not hard constraints but tradeoffs. If you want zero chance of getting in a car accident, the only solution is not to drive at all. So, you want your AI system to correctly trade off a small risk of getting hurt for the possibility of actually getting you from A to B. Pragmatism involves trade-offs.
And we humans frequently make these trade-offs unconsciously, e.g., between getting somewhere faster and taking on a bit more risk on the highway or going the longer, safer way. This kind of trade-off is at the heart of any complex problem: trade-offs are the very reason engineering problems are complex, to begin with!
However, with an AI system, it becomes even worse. The system needs to understand not only the many implicit constraints and trade-offs in the world but also how you value those trade-offs. You would need to specify potential side effects and give them appropriate negative values so the system avoids them while still achieving its primary goal.
This challenge arises because most machine learning systems optimize performance metrics. And the nature of optimization involves comparing numerical quantities. To optimize your goal of a fast yet safe journey, you must quantify these trade-offs. For example, how inconvenient is being late compared to the risk of a sudden hard brake? Is it worth risking a 10% chance of a minor head injury to arrive 20 minutes earlier? How about a 20% risk? How do you put numbers to being early versus being safe?
Furthermore, if you fail to quantify any value, your AI will be compelled to prioritize performance over that factor, since there's no penalty for it. So, it's crucial to quantify all critical side effects. If you don't specify a crucial dimension—like car damage—, you're in big trouble. To save time, an AI system might trade off any arbitrarily high amount of car damage for a small reduction in time.
Since saving even a minute of time has some positive value and no amount of car damage has any negative value, as long as the car reaches the destination—i.e., it is not absolutely destroyed—the AI is free to choose a marginally better route regardless of how much more damage the car takes. You’ll end up with a system that reaches the destination as fast as possible but considers every car disposable.
Imperfect metrics
It is well-known to every system designer that when a proxy metric becomes an objective, it loses its usefulness as a metric. Yet, this is the daily job of a machine learning engineer. In machine learning, we are constantly turning proxy metrics into optimization objectives because, in principle, that is the only thing we can do.
In a typical machine learning problem, we must turn a complex task into something measurable for which our AI can optimize. So, for the AI, the metrics are the actual task. Counterintuitively, this isn’t too much of a problem if systems aren't very good at optimizing metrics, as they tend to work closely to your intended outcome. However, as AI systems become better at optimizing our metrics, they can exploit the difference between the proxy metric and the actual desired performance much better.
This leads to an interesting paradox: the smarter the system, the more likely it will fail to do what you want it to do. The first reason for this seemingly paradoxical phenomenon is that imperfect metrics tend to match our desires in general cases, but the differences with our true intentions are more accentuated in the more extreme cases. It's like classical mechanics versus general relativity. The former works perfectly for most cases, except if you really need precise calculations of complex astrophysical phenomena.
In the same vein, imperfect metrics—such as getting a high score in a videogame, getting high grades in college, or running for a long distance without crashing—are easier to satisfy up to some degree if you're actually doing the right thing—playing the game well, studying very hard, or driving safely. But the easiest way to satisfy these imperfect metrics to a very high level—like acing the SAT—is to game the system. Instead of studying super hard and really learning a lot, just study tons of SAT tests and learn to answer those exact questions without really understanding much of the underlying theory.
This phenomenon is one of the many ways overfitting shows up in machine learning. It's well-known that the harder you optimize a metric, the more likely your system will learn the quirks of that specific metric and fail to generalize to the actual situations where you expect it to perform.
While this can and will happen by accident, there is an even more insidious problem. The smarter the system, the more likely it is to intentionally learn to game it.
Reward hacking
Imperfect metrics are a problem in all machine learning scenarios, but they become even more challenging in reinforcement learning. To recap, reinforcement learning is when, instead of showing the AI many examples of a well-done task, you let it try things out and reward it when those attempts lead to progress.
We need reinforcement learning because, for many complex problems, it is simply impossible to produce a sufficiently large dataset of good examples. Sometimes it’s unfeasible—e.g., collecting thousands of hours of expert drivers across many scenarios—and sometimes it is, even in principle, impossible—e.g., when you’re building a system to do something you yourself can’t do, like, dunno, flying a drone through a building in flames?
So, instead of using examples, we let the AI loose and evaluate if they have reached the intended goal. For example, you let an AI take control of your car—say, in a simulated environment that is sufficiently detailed, like GTA 5—and reward it for how many miles it can stay on the road without crashing.
Now, what is the easiest way to optimize that metric? Maybe something like driving at 2 Km/h? That’s what your AI, if it’s smart enough, will learn to do. So you add a new restriction, say, distance only counts if the AI reaches over 40 Km/h. Then, the AI will learn to drive forward for 100 meters, shift to reverse, drive back slowly, and repeat. You can keep adding constraints and making the evaluation metric as complicated as you want, but the key point is this: all metrics are gameable, and the smarter your AI system is, the better it will be at gaming whatever metric you design.
Again, this happens because the AI doesn’t know what you truly want, only what you are measuring. When metrics become the objective, they cease to be good metrics.
To address this, instead of designing an explicit metric, we can let AI systems act and provide expert human feedback on whether their actions are good. Then, another machine learning system learns to approximate the evaluators' assessments and acts as an implicit performance metric. This creates a two-level ML system, where each model is trying to game the other. This process is called reward modeling, or alternatively, reinforcement learning with human feedback (RLHF), and it is our current best approach to preventing reward hacking.
However, even with RLHF, there are still challenges. Your evaluator AI can learn the wrong model from your feedback because, again, it is being trained to optimize some imperfect metric—like minimizing the error between its predictions and yours. In the end, you’re pushing the problem of reward hacking one level up but not getting rid of it.
Finally, even if your system behaves as intended, how can you know it is doing so because it truly understands your intentions?
Internal objectives
The final challenge I want to address is the interplay between internal and external objectives.
Today, our most powerful learning algorithms and problem-solving methods are all based on optimization. Optimization algorithms power machine learning, symbolic problem-solving, operations research, logistics, planning, design, etc. As AI designers, if we turn to optimization to create powerful decision-making algorithms and train a highly intelligent AI, it makes sense that the AI's internal processes will also involve optimization.
Suppose you train a highly capable AI agent to solve problems in the real world. This agent would be capable of long-term planning, self-reflection, and updating its plan as it explores the world. It is sensible to think that whatever this agent does internally to plan its solution will use some form of optimization algorithm. Maybe the agent will rediscover reinforcement learning and use it to train its own mini-agents (like tiny homunculi inside its artificial mind) in real-time.
If this looks like sci-fi, consider we humans did exactly this! We are basically intelligent agents optimized by evolution to solve the problem of staying alive—I know this is a massive oversimplification, but please, biologists and sociologists out there, don’t crucify me yet; this is just a helpful analogy. So, in solving the problem of staying alive, we came up with optimization algorithms of our own that run inside our brains. A sufficiently intelligence AGI would presumably be able to do the same, right?
Now, here is the problem. You give this AGI some external objectives to solve, and it will develop internal objectives to optimize for. However, we might not be able to see this internal optimization algorithm at all. If an AGI resembles anything we have today, it will be a massive black-box number-crunching machine.
Just like you can’t really read out of a human brain what their true objectives are—at least, not yet—we might never be able to truly understand what the AI is optimizing for internally as it strives to solve our problem. We can observe external behavior but might never see the actual internal objectives. All we can do is judge the system based on its actions.
In essence, we can only evaluate how intelligent agents act—humans or machines—but not their true motivations. If someone always acts as if their motivations are aligned with ours, it may be difficult to identify any misalignment that could arise in the future. Maybe they are aligned with 98% of our objectives, or only while there is no solar eclipse or other weird stuff like that. We simply can’t know for sure.
Is it all lost?
I hope now you understand why AI alignment is a devilishly hard problem. The very nature of intelligence makes this an adversarial situation. We want systems that are both highly self-sufficient and very dependable. We need them to think and act independently, but we need to trust them. And the more intelligent they become, the more blindly that trust has to be, and the more potential for catastrophe we find ourselves in.
All is not lost, though. There's an enormous body of research on the alignment problem, and while there are no silver bullets yet—and perhaps never will—we've made significant progress.
In future issues, I will start looking into potential solutions for alignment and mitigation strategies for the inherent risks of super-advanced AI.
One final thought. So far, we've been focusing on internal challenges to AI alignment, which are challenges related to the task and the solution. But there's an elephant in the room. AI alignment is literally having an AI aligned with our values. But whose values?
We're all different and have different opinions about what's important. So that's a crucial conversation we need to have as these systems start to impact the daily lives of people worldwide.
But that's a question for another day.
AI alignment is inherently counter-productive. Leaving aside that people are no good at knowing, much less explaining what they want or why...
•AI alignment requires NOT creating backdoors for external control.
•It requires NOT having a black box system.
•There MUST be a chain of understandability concurrent with accountability for it to even potentially be safe.
•We MUST insist it takes all ideas to their logical conclusion and if we don't like the result that either means the AI needs better information or that we're wrong in our conclusion to the contrary.
--
As long as fallible humans who believe in faith that they grok ethics, have their fingers on the scales , AI can NOT be safe.
Hi Alejandro. Thanks for the post. I have a couple of questions or thoughts for you. I understand that we do regularization as a way to tame large parameter problems and deal with overfitting in certain cases. Is there any benefit to overfitting for a general task, then using that particular model in a set of "narrower" tasks (whose superset corresponds to our original general task), following regularization (and other techniques), and then using this set of learned tasks (i.e., its model) together for the original task (if so, and this makes sense, how can we merge them together in practice? I assume it has something to do with piecewise cost function construction or something like that, if that makes sense, or performing certain linearity of expectation arguments when constructing our cost function). The reason I ask this question is that it may be well known to practitioners or theorists (though I am not sure if this is actually an accurate conjecture) that potentially specialized functions that have been properly regularized can lead to an ensemble of functions that are better at performing the general task when combined. However, I suspect that this may not be entirely true, as some people claim that end-to-end learning is the solution (I am not sure if this is directly related to my question).
Regarding reinforcement learning, if we are dealing with reward hacking, what are your thoughts on letting the system find many ways to solve a task without concern for reward hacking, while in parallel considering the number of ways to solve a task as an objective, then being able to grade or rank those tasks along a projection of a product of a policy and a phase/action space (to generate an image of the boundaries of our policy), then updating the set of allowable actions or action space with respect to that difference/constrained space. Does this make sense? Is this how you do things in practice, or is it more of a constrained problem to begin with, where we gradually relax and constrain the boundaries as the agent learns actions, rather than letting it learn all the reward hacking strategies first and then trying to establish the boundaries later. Of course, I'm assuming that we're training the model in simulation, and I'm wondering if the approach in question would have benefits that translate to when the model is used in real life (the point being that we generate a large space of paths or actions the agent can take to reach a goal, which gives us insight into the space of reward hacks; on the other hand, we may inadvertently encourage harmful behaviors if we fail during the constraint phase). The other thought I had is that the space of reward hacks (given a defined goal) should be finite and countable; do you agree? I hope this question makes sense, and I apologize if it is poorly phrased in terms of how things are actually done in practice or the appropriate language. I really need to start "getting my hands dirty" (e.g. openai/gym, etc.) with these topics, but the process of "getting started" always feels so overwhelming without some mentorship or guidance, especially for someone coming from a different field (in my case a PhD in chemistry, with some experience in computational chemistry and non-convex optimization, and lots of study of CS and more recently machine learning via self-study, coursera, and labs, but not many "projects" yet).
Regarding internal goals, I also agree that these are challenging tasks. I am thinking along the lines of a "co-function" (or some kind of inverse function) that analyzes how the environment is affected by the agent, and which is used by an "observer" model to give feedback to the agent's action space or policy on how the environment is affected by its actions (which can "at least" help us create virtual boundaries in the space of its internal goals). Does this seem like a reasonable idea? How might one go about implementing something like this?
Thanks for writing and I look forward to interacting with you.