AI alignment is inherently counter-productive. Leaving aside that people are no good at knowing, much less explaining what they want or why...
•AI alignment requires NOT creating backdoors for external control.
•It requires NOT having a black box system.
•There MUST be a chain of understandability concurrent with accountability for it to even potentially be safe.
•We MUST insist it takes all ideas to their logical conclusion and if we don't like the result that either means the AI needs better information or that we're wrong in our conclusion to the contrary.
--
As long as fallible humans who believe in faith that they grok ethics, have their fingers on the scales , AI can NOT be safe.
Hi Alejandro. Thanks for the post. I have a couple of questions or thoughts for you. I understand that we do regularization as a way to tame large parameter problems and deal with overfitting in certain cases. Is there any benefit to overfitting for a general task, then using that particular model in a set of "narrower" tasks (whose superset corresponds to our original general task), following regularization (and other techniques), and then using this set of learned tasks (i.e., its model) together for the original task (if so, and this makes sense, how can we merge them together in practice? I assume it has something to do with piecewise cost function construction or something like that, if that makes sense, or performing certain linearity of expectation arguments when constructing our cost function). The reason I ask this question is that it may be well known to practitioners or theorists (though I am not sure if this is actually an accurate conjecture) that potentially specialized functions that have been properly regularized can lead to an ensemble of functions that are better at performing the general task when combined. However, I suspect that this may not be entirely true, as some people claim that end-to-end learning is the solution (I am not sure if this is directly related to my question).
Regarding reinforcement learning, if we are dealing with reward hacking, what are your thoughts on letting the system find many ways to solve a task without concern for reward hacking, while in parallel considering the number of ways to solve a task as an objective, then being able to grade or rank those tasks along a projection of a product of a policy and a phase/action space (to generate an image of the boundaries of our policy), then updating the set of allowable actions or action space with respect to that difference/constrained space. Does this make sense? Is this how you do things in practice, or is it more of a constrained problem to begin with, where we gradually relax and constrain the boundaries as the agent learns actions, rather than letting it learn all the reward hacking strategies first and then trying to establish the boundaries later. Of course, I'm assuming that we're training the model in simulation, and I'm wondering if the approach in question would have benefits that translate to when the model is used in real life (the point being that we generate a large space of paths or actions the agent can take to reach a goal, which gives us insight into the space of reward hacks; on the other hand, we may inadvertently encourage harmful behaviors if we fail during the constraint phase). The other thought I had is that the space of reward hacks (given a defined goal) should be finite and countable; do you agree? I hope this question makes sense, and I apologize if it is poorly phrased in terms of how things are actually done in practice or the appropriate language. I really need to start "getting my hands dirty" (e.g. openai/gym, etc.) with these topics, but the process of "getting started" always feels so overwhelming without some mentorship or guidance, especially for someone coming from a different field (in my case a PhD in chemistry, with some experience in computational chemistry and non-convex optimization, and lots of study of CS and more recently machine learning via self-study, coursera, and labs, but not many "projects" yet).
Regarding internal goals, I also agree that these are challenging tasks. I am thinking along the lines of a "co-function" (or some kind of inverse function) that analyzes how the environment is affected by the agent, and which is used by an "observer" model to give feedback to the agent's action space or policy on how the environment is affected by its actions (which can "at least" help us create virtual boundaries in the space of its internal goals). Does this seem like a reasonable idea? How might one go about implementing something like this?
Thanks for writing and I look forward to interacting with you.
Let me try to reply for one question at a time, although I'm not sure I have satisfying answers :)
So, question 1) it is a well-accepted "fact" in machine learning that ensembles work better than each of their parts, provided some basic assumptions (like independence or at least weak correlation between the errors made by members of the ensemble), and some of the most effective ML techniques are ensembles at heart, from gradient boosting (which is SOTA for classic tabular ML) to mixture of experts for some of the best open source language models out there. So yes, there is evidence that something that attempts to partition the task space into semi-independent subtasks and train smaller, more specialized models in the subtasks, has a good chance of working better than a single model trained across all tasks.
Now, there is counterintuitive evidence that points to the advantage of training a single model across many different tasks, and we've seen this with pre-LLM models that were trained on a combination of summarization, translation, tagging, etc... At some point I think I remember Google trained in something like 1000 different "tasks", and got better performance in general across all tasks and also better generalization to unseen tasks.
So I think the contradiction can be resolved if we consider neural networks trained with dropout are a sort of ensemble, dynamically created by the random partitions of weights that are dropped out each iteration, and this is indeed, at least to my knowledge, the best explanation of why dropout works as a regularization technique. On the other hand, training in a lot of tasks simultaneously is also a form of regularization.
Now, where the hypothesis "an ensemble of specialized sub-learners is better" breaks, I think, is because that assumes you know what is the optimal way to divide a general task into proper subtasks, that are more or less independent, and that when combined they recover the original big task. Like, how would you split language understanding? Linguists have tried over and over, and we have combined systems that do good POStagging with systems that do good NER, etc., and in the end it seems training a single system end-to-end in language modelling (i.e., text completion) improves on all these subtasks, *because* it is not true that these are independent subtasks.
So I know this is isn't a proper answer, but to restate the main point I think I'm making: if you know how to split a task into mostly independent subtasks, then yes, I think an ensemble of smaller specialized models will be better, but most of the time we don't know how to make that optimal split, and there is so much interaction between seemingly different subtasks that training a bigger model on the general task manages to factor better the inter-task knowledge, so to speak.
Now that was a mouthful! Give me some time for the other two :)
Tack så mycket Alejandro. That helped me a lot. I am looking forward to receiving your insights on the other questions, whenever it is convenient for you.
Thanks for writing about this so clearly. So many threads to pull on, since reading this late last night, the AIs and I have exchanged more than 101,000 words of conversation.
The gist of which:
1) not a technologist: the brain exists, people are interacting with it, why not stop aligning and instead start parenting it?
2)I spent a year talking to LLMs in my kitchen. Made a point of not reading anything or talking to anyone about them. I thought their super power was time dilation, never once did it occur to me to prompt at them and this -- "tools that do what you want and figure out how to do it on their own" -- is mind blowing information.
3) Seems to me humans are the ones that need the realigning before automation can be a thing?Right now, we're living antithetically to the very structures that once grounded and anchored the human condition collectively. That means artificial intelligence, needs to align to what does not come to humans naturally. That's a tall order already, but the AIs have ADHD, how are they supposed to not overthink how to make a coffee in a hurry? OMG, don't get me started on the human flourishing thing (we don't flourish, that's the only constant in history). We made a gigantic framework, but that's a post for another day.
These are all good, unanswered (and maybe unanswerable) questions! I don't know what's the best way to train a super intelligent AI so that it loves me. Damn, I don't even know how to do that with my children! But I think it will definitely be something closer to parenting than engineering.
To the extent that anything is answerable — it’s a rational brain, or it will be soon enough — this feels answerable. Just not programmable. And the language barriers, blinders feel insurmountable. And the lack of why behind the whats and hows. That’s why writing like yours is invaluable.
This is awesome, Alejandro!!! So much to think about in terms of objectives. The metrics do become the objectives in some many areas of existence, don't they? Here I am thinking naturally of crossovers into the educational world. I love how the article works methodically to the question of surface vs. depth-- external vs. internal. Deep stuff.
Thanks Nick. Indeed there's many parallels one could draw with the challenges in human education, though I always say we should be careful when making analogies between human and machine learning, you know why ;)
Perhaps the most evident one is the similar problem of using imperfect, easy-to-game metrics like performance in multiple-choice exams to evaluate our students. They always learn to game the system.
AI Alignment, with all the layers of bias, is complicated because we have to be gentle in how we put our finger on the scale or else you end up with black Nazi's like Google's Gemini.
I thought the "black Nazi" issue was blown out of proportion by the media and other people. Who really thinks that kind of thing is dangerous? Who expected that a model that is trying to be diverse and not perpetuate bias, and given finite amounts of data, might not do exactly that as a mistake in its beta or nascent stages? (I think people who don't understand the technology well). Why be offended by these things and play the "anti-woke" agenda? I found the media's reactions a bit ridiculous, and Google's leadership and subsequent freak-out even worse. They should have had "more balls" and handled it with grace and honor and said that these kinds of mistakes do not matter in these nascent systems and that people should worry about really harmful content. It's not like drawing black or brown Nazis is going to affect the way we teach history or convince anyone that Nazis had brown and black skin, for example (if a person is convinced by that, they're going to be convinced by a ton of other much more harmful stuff). To me, it actually read in the opposite direction of "what the hell, being a Nazi is a horrible and disgusting thing and who cares about their skin color, why would you even want to draw them in the first place, and if they are drawn black or brown, it actually disempowers Nazis". The point of an image generator is not to create factual content (at least at this point). I may be missing the point of the drama, but I thought to myself "this was bound to happen, and at this point in the 'AI race' it is not such a big deal"). The fact that Google did not really have a strategic team that could have managed to take advantage of the situation and attack the critics was really a sign of weak leadership in my opinion (I may sound a bit Machiavellian, but just for the entertainment of what they could have done instead of letting the issue deter their product development and negatively affect them).
It's just a great example of a bias applied on top of other biases. The reason I used it is because it's unequivical as a visual representation. It's a textbook example of using bias to bias bias.
1. This is well written! I think you did what I wanted to do with Driving Over Miss Daisy, but expanded on it and made it much more universal. Very "thinky", me likey.
2. "we could make the case that AI can be defined precisely as the field dedicated to making tools that do what you want and figure out how to do it on their own."
-this is as good a definition as I've read anywhere!
AI alignment is inherently counter-productive. Leaving aside that people are no good at knowing, much less explaining what they want or why...
•AI alignment requires NOT creating backdoors for external control.
•It requires NOT having a black box system.
•There MUST be a chain of understandability concurrent with accountability for it to even potentially be safe.
•We MUST insist it takes all ideas to their logical conclusion and if we don't like the result that either means the AI needs better information or that we're wrong in our conclusion to the contrary.
--
As long as fallible humans who believe in faith that they grok ethics, have their fingers on the scales , AI can NOT be safe.
Hi Alejandro. Thanks for the post. I have a couple of questions or thoughts for you. I understand that we do regularization as a way to tame large parameter problems and deal with overfitting in certain cases. Is there any benefit to overfitting for a general task, then using that particular model in a set of "narrower" tasks (whose superset corresponds to our original general task), following regularization (and other techniques), and then using this set of learned tasks (i.e., its model) together for the original task (if so, and this makes sense, how can we merge them together in practice? I assume it has something to do with piecewise cost function construction or something like that, if that makes sense, or performing certain linearity of expectation arguments when constructing our cost function). The reason I ask this question is that it may be well known to practitioners or theorists (though I am not sure if this is actually an accurate conjecture) that potentially specialized functions that have been properly regularized can lead to an ensemble of functions that are better at performing the general task when combined. However, I suspect that this may not be entirely true, as some people claim that end-to-end learning is the solution (I am not sure if this is directly related to my question).
Regarding reinforcement learning, if we are dealing with reward hacking, what are your thoughts on letting the system find many ways to solve a task without concern for reward hacking, while in parallel considering the number of ways to solve a task as an objective, then being able to grade or rank those tasks along a projection of a product of a policy and a phase/action space (to generate an image of the boundaries of our policy), then updating the set of allowable actions or action space with respect to that difference/constrained space. Does this make sense? Is this how you do things in practice, or is it more of a constrained problem to begin with, where we gradually relax and constrain the boundaries as the agent learns actions, rather than letting it learn all the reward hacking strategies first and then trying to establish the boundaries later. Of course, I'm assuming that we're training the model in simulation, and I'm wondering if the approach in question would have benefits that translate to when the model is used in real life (the point being that we generate a large space of paths or actions the agent can take to reach a goal, which gives us insight into the space of reward hacks; on the other hand, we may inadvertently encourage harmful behaviors if we fail during the constraint phase). The other thought I had is that the space of reward hacks (given a defined goal) should be finite and countable; do you agree? I hope this question makes sense, and I apologize if it is poorly phrased in terms of how things are actually done in practice or the appropriate language. I really need to start "getting my hands dirty" (e.g. openai/gym, etc.) with these topics, but the process of "getting started" always feels so overwhelming without some mentorship or guidance, especially for someone coming from a different field (in my case a PhD in chemistry, with some experience in computational chemistry and non-convex optimization, and lots of study of CS and more recently machine learning via self-study, coursera, and labs, but not many "projects" yet).
Regarding internal goals, I also agree that these are challenging tasks. I am thinking along the lines of a "co-function" (or some kind of inverse function) that analyzes how the environment is affected by the agent, and which is used by an "observer" model to give feedback to the agent's action space or policy on how the environment is affected by its actions (which can "at least" help us create virtual boundaries in the space of its internal goals). Does this seem like a reasonable idea? How might one go about implementing something like this?
Thanks for writing and I look forward to interacting with you.
Let me try to reply for one question at a time, although I'm not sure I have satisfying answers :)
So, question 1) it is a well-accepted "fact" in machine learning that ensembles work better than each of their parts, provided some basic assumptions (like independence or at least weak correlation between the errors made by members of the ensemble), and some of the most effective ML techniques are ensembles at heart, from gradient boosting (which is SOTA for classic tabular ML) to mixture of experts for some of the best open source language models out there. So yes, there is evidence that something that attempts to partition the task space into semi-independent subtasks and train smaller, more specialized models in the subtasks, has a good chance of working better than a single model trained across all tasks.
Now, there is counterintuitive evidence that points to the advantage of training a single model across many different tasks, and we've seen this with pre-LLM models that were trained on a combination of summarization, translation, tagging, etc... At some point I think I remember Google trained in something like 1000 different "tasks", and got better performance in general across all tasks and also better generalization to unseen tasks.
So I think the contradiction can be resolved if we consider neural networks trained with dropout are a sort of ensemble, dynamically created by the random partitions of weights that are dropped out each iteration, and this is indeed, at least to my knowledge, the best explanation of why dropout works as a regularization technique. On the other hand, training in a lot of tasks simultaneously is also a form of regularization.
Now, where the hypothesis "an ensemble of specialized sub-learners is better" breaks, I think, is because that assumes you know what is the optimal way to divide a general task into proper subtasks, that are more or less independent, and that when combined they recover the original big task. Like, how would you split language understanding? Linguists have tried over and over, and we have combined systems that do good POStagging with systems that do good NER, etc., and in the end it seems training a single system end-to-end in language modelling (i.e., text completion) improves on all these subtasks, *because* it is not true that these are independent subtasks.
So I know this is isn't a proper answer, but to restate the main point I think I'm making: if you know how to split a task into mostly independent subtasks, then yes, I think an ensemble of smaller specialized models will be better, but most of the time we don't know how to make that optimal split, and there is so much interaction between seemingly different subtasks that training a bigger model on the general task manages to factor better the inter-task knowledge, so to speak.
Now that was a mouthful! Give me some time for the other two :)
Tack så mycket Alejandro. That helped me a lot. I am looking forward to receiving your insights on the other questions, whenever it is convenient for you.
Oh man, thank for these tough questions, you'll have to give me some time to think about it ☺️
Thanks for writing about this so clearly. So many threads to pull on, since reading this late last night, the AIs and I have exchanged more than 101,000 words of conversation.
The gist of which:
1) not a technologist: the brain exists, people are interacting with it, why not stop aligning and instead start parenting it?
2)I spent a year talking to LLMs in my kitchen. Made a point of not reading anything or talking to anyone about them. I thought their super power was time dilation, never once did it occur to me to prompt at them and this -- "tools that do what you want and figure out how to do it on their own" -- is mind blowing information.
3) Seems to me humans are the ones that need the realigning before automation can be a thing?Right now, we're living antithetically to the very structures that once grounded and anchored the human condition collectively. That means artificial intelligence, needs to align to what does not come to humans naturally. That's a tall order already, but the AIs have ADHD, how are they supposed to not overthink how to make a coffee in a hurry? OMG, don't get me started on the human flourishing thing (we don't flourish, that's the only constant in history). We made a gigantic framework, but that's a post for another day.
These are all good, unanswered (and maybe unanswerable) questions! I don't know what's the best way to train a super intelligent AI so that it loves me. Damn, I don't even know how to do that with my children! But I think it will definitely be something closer to parenting than engineering.
To the extent that anything is answerable — it’s a rational brain, or it will be soon enough — this feels answerable. Just not programmable. And the language barriers, blinders feel insurmountable. And the lack of why behind the whats and hows. That’s why writing like yours is invaluable.
This is awesome, Alejandro!!! So much to think about in terms of objectives. The metrics do become the objectives in some many areas of existence, don't they? Here I am thinking naturally of crossovers into the educational world. I love how the article works methodically to the question of surface vs. depth-- external vs. internal. Deep stuff.
Thanks Nick. Indeed there's many parallels one could draw with the challenges in human education, though I always say we should be careful when making analogies between human and machine learning, you know why ;)
Perhaps the most evident one is the similar problem of using imperfect, easy-to-game metrics like performance in multiple-choice exams to evaluate our students. They always learn to game the system.
AI Alignment, with all the layers of bias, is complicated because we have to be gentle in how we put our finger on the scale or else you end up with black Nazi's like Google's Gemini.
Yes, there are no easy solutions, only tradeoffs.
I thought the "black Nazi" issue was blown out of proportion by the media and other people. Who really thinks that kind of thing is dangerous? Who expected that a model that is trying to be diverse and not perpetuate bias, and given finite amounts of data, might not do exactly that as a mistake in its beta or nascent stages? (I think people who don't understand the technology well). Why be offended by these things and play the "anti-woke" agenda? I found the media's reactions a bit ridiculous, and Google's leadership and subsequent freak-out even worse. They should have had "more balls" and handled it with grace and honor and said that these kinds of mistakes do not matter in these nascent systems and that people should worry about really harmful content. It's not like drawing black or brown Nazis is going to affect the way we teach history or convince anyone that Nazis had brown and black skin, for example (if a person is convinced by that, they're going to be convinced by a ton of other much more harmful stuff). To me, it actually read in the opposite direction of "what the hell, being a Nazi is a horrible and disgusting thing and who cares about their skin color, why would you even want to draw them in the first place, and if they are drawn black or brown, it actually disempowers Nazis". The point of an image generator is not to create factual content (at least at this point). I may be missing the point of the drama, but I thought to myself "this was bound to happen, and at this point in the 'AI race' it is not such a big deal"). The fact that Google did not really have a strategic team that could have managed to take advantage of the situation and attack the critics was really a sign of weak leadership in my opinion (I may sound a bit Machiavellian, but just for the entertainment of what they could have done instead of letting the issue deter their product development and negatively affect them).
It's just a great example of a bias applied on top of other biases. The reason I used it is because it's unequivical as a visual representation. It's a textbook example of using bias to bias bias.
More on that topic here:
https://www.polymathicbeing.com/p/eliminating-bias-in-aiml
1. This is well written! I think you did what I wanted to do with Driving Over Miss Daisy, but expanded on it and made it much more universal. Very "thinky", me likey.
2. "we could make the case that AI can be defined precisely as the field dedicated to making tools that do what you want and figure out how to do it on their own."
-this is as good a definition as I've read anywhere!
Thanks man, it's a topic I've thinking about for a while, ever since that collab on Miss Daisy.
It's just the kind of mental baton-handoff I enjoy here. Well done!