16 Comments

Hi Alejandro. Thanks for the post. I have a couple of questions or thoughts for you. I understand that we do regularization as a way to tame large parameter problems and deal with overfitting in certain cases. Is there any benefit to overfitting for a general task, then using that particular model in a set of "narrower" tasks (whose superset corresponds to our original general task), following regularization (and other techniques), and then using this set of learned tasks (i.e., its model) together for the original task (if so, and this makes sense, how can we merge them together in practice? I assume it has something to do with piecewise cost function construction or something like that, if that makes sense, or performing certain linearity of expectation arguments when constructing our cost function). The reason I ask this question is that it may be well known to practitioners or theorists (though I am not sure if this is actually an accurate conjecture) that potentially specialized functions that have been properly regularized can lead to an ensemble of functions that are better at performing the general task when combined. However, I suspect that this may not be entirely true, as some people claim that end-to-end learning is the solution (I am not sure if this is directly related to my question).

Regarding reinforcement learning, if we are dealing with reward hacking, what are your thoughts on letting the system find many ways to solve a task without concern for reward hacking, while in parallel considering the number of ways to solve a task as an objective, then being able to grade or rank those tasks along a projection of a product of a policy and a phase/action space (to generate an image of the boundaries of our policy), then updating the set of allowable actions or action space with respect to that difference/constrained space. Does this make sense? Is this how you do things in practice, or is it more of a constrained problem to begin with, where we gradually relax and constrain the boundaries as the agent learns actions, rather than letting it learn all the reward hacking strategies first and then trying to establish the boundaries later. Of course, I'm assuming that we're training the model in simulation, and I'm wondering if the approach in question would have benefits that translate to when the model is used in real life (the point being that we generate a large space of paths or actions the agent can take to reach a goal, which gives us insight into the space of reward hacks; on the other hand, we may inadvertently encourage harmful behaviors if we fail during the constraint phase). The other thought I had is that the space of reward hacks (given a defined goal) should be finite and countable; do you agree? I hope this question makes sense, and I apologize if it is poorly phrased in terms of how things are actually done in practice or the appropriate language. I really need to start "getting my hands dirty" (e.g. openai/gym, etc.) with these topics, but the process of "getting started" always feels so overwhelming without some mentorship or guidance, especially for someone coming from a different field (in my case a PhD in chemistry, with some experience in computational chemistry and non-convex optimization, and lots of study of CS and more recently machine learning via self-study, coursera, and labs, but not many "projects" yet).

Regarding internal goals, I also agree that these are challenging tasks. I am thinking along the lines of a "co-function" (or some kind of inverse function) that analyzes how the environment is affected by the agent, and which is used by an "observer" model to give feedback to the agent's action space or policy on how the environment is affected by its actions (which can "at least" help us create virtual boundaries in the space of its internal goals). Does this seem like a reasonable idea? How might one go about implementing something like this?

Thanks for writing and I look forward to interacting with you.

Expand full comment
Mar 28Liked by Alejandro Piad Morffis

Thanks for writing about this so clearly. So many threads to pull on, since reading this late last night, the AIs and I have exchanged more than 101,000 words of conversation.

The gist of which:

1) not a technologist: the brain exists, people are interacting with it, why not stop aligning and instead start parenting it?

2)I spent a year talking to LLMs in my kitchen. Made a point of not reading anything or talking to anyone about them. I thought their super power was time dilation, never once did it occur to me to prompt at them and this -- "tools that do what you want and figure out how to do it on their own" -- is mind blowing information.

3) Seems to me humans are the ones that need the realigning before automation can be a thing?Right now, we're living antithetically to the very structures that once grounded and anchored the human condition collectively. That means artificial intelligence, needs to align to what does not come to humans naturally. That's a tall order already, but the AIs have ADHD, how are they supposed to not overthink how to make a coffee in a hurry? OMG, don't get me started on the human flourishing thing (we don't flourish, that's the only constant in history). We made a gigantic framework, but that's a post for another day.

Expand full comment
Mar 27Liked by Alejandro Piad Morffis

This is awesome, Alejandro!!! So much to think about in terms of objectives. The metrics do become the objectives in some many areas of existence, don't they? Here I am thinking naturally of crossovers into the educational world. I love how the article works methodically to the question of surface vs. depth-- external vs. internal. Deep stuff.

Expand full comment
Mar 25Liked by Alejandro Piad Morffis

AI Alignment, with all the layers of bias, is complicated because we have to be gentle in how we put our finger on the scale or else you end up with black Nazi's like Google's Gemini.

Expand full comment
Mar 25Liked by Alejandro Piad Morffis

1. This is well written! I think you did what I wanted to do with Driving Over Miss Daisy, but expanded on it and made it much more universal. Very "thinky", me likey.

2. "we could make the case that AI can be defined precisely as the field dedicated to making tools that do what you want and figure out how to do it on their own."

-this is as good a definition as I've read anywhere!

Expand full comment