Why Alignment is the Hardest Problem in AI

Mar 28, 2024

These are all good, unanswered (and maybe unanswerable) questions! I don't know what's the best way to train a super intelligent AI so that it loves me. Damn, I don't even know how to do that with my children! But I think it will definitely be something closer to parenting than engineering.

Expand full comment

Cybil Smith

Mar 29, 2024Edited

To the extent that anything is answerable — it’s a rational brain, or it will be soon enough — this feels answerable. Just not programmable. And the language barriers, blinders feel insurmountable. And the lack of why behind the whats and hows. That’s why writing like yours is invaluable.

Expand full comment

Nick Potkalitsky

Mar 27, 2024

This is awesome, Alejandro!!! So much to think about in terms of objectives. The metrics do become the objectives in some many areas of existence, don't they? Here I am thinking naturally of crossovers into the educational world. I love how the article works methodically to the question of surface vs. depth-- external vs. internal. Deep stuff.

Expand full comment

Mar 28, 2024

Thanks Nick. Indeed there's many parallels one could draw with the challenges in human education, though I always say we should be careful when making analogies between human and machine learning, you know why ;)

Perhaps the most evident one is the similar problem of using imperfect, easy-to-game metrics like performance in multiple-choice exams to evaluate our students. They always learn to game the system.

Expand full comment

Michael Woudenberg

AI Alignment, with all the layers of bias, is complicated because we have to be gentle in how we put our finger on the scale or else you end up with black Nazi's like Google's Gemini.

Expand full comment

Reply (2)

Yes, there are no easy solutions, only tradeoffs.

Expand full comment

Comment deleted

Comment deleted

Expand full comment

Michael Woudenberg

https://www.polymathicbeing.com/p/eliminating-bias-in-aiml

It's just a great example of a bias applied on top of other biases. The reason I used it is because it's unequivical as a visual representation. It's a textbook example of using bias to bias bias.

More on that topic here:

Expand full comment

Andrew Smith

1. This is well written! I think you did what I wanted to do with Driving Over Miss Daisy, but expanded on it and made it much more universal. Very "thinky", me likey.

2. "we could make the case that AI can be defined precisely as the field dedicated to making tools that do what you want and figure out how to do it on their own."

-this is as good a definition as I've read anywhere!

Expand full comment

Thanks man, it's a topic I've thinking about for a while, ever since that collab on Miss Daisy.

Expand full comment

Andrew Smith

It's just the kind of mental baton-handoff I enjoy here. Well done!

Expand full comment

Comment deleted

Comment deleted

Expand full comment

Mar 31, 2024

Let me try to reply for one question at a time, although I'm not sure I have satisfying answers :)

So, question 1) it is a well-accepted "fact" in machine learning that ensembles work better than each of their parts, provided some basic assumptions (like independence or at least weak correlation between the errors made by members of the ensemble), and some of the most effective ML techniques are ensembles at heart, from gradient boosting (which is SOTA for classic tabular ML) to mixture of experts for some of the best open source language models out there. So yes, there is evidence that something that attempts to partition the task space into semi-independent subtasks and train smaller, more specialized models in the subtasks, has a good chance of working better than a single model trained across all tasks.

Now, there is counterintuitive evidence that points to the advantage of training a single model across many different tasks, and we've seen this with pre-LLM models that were trained on a combination of summarization, translation, tagging, etc... At some point I think I remember Google trained in something like 1000 different "tasks", and got better performance in general across all tasks and also better generalization to unseen tasks.

So I think the contradiction can be resolved if we consider neural networks trained with dropout are a sort of ensemble, dynamically created by the random partitions of weights that are dropped out each iteration, and this is indeed, at least to my knowledge, the best explanation of why dropout works as a regularization technique. On the other hand, training in a lot of tasks simultaneously is also a form of regularization.

Now, where the hypothesis "an ensemble of specialized sub-learners is better" breaks, I think, is because that assumes you know what is the optimal way to divide a general task into proper subtasks, that are more or less independent, and that when combined they recover the original big task. Like, how would you split language understanding? Linguists have tried over and over, and we have combined systems that do good POStagging with systems that do good NER, etc., and in the end it seems training a single system end-to-end in language modelling (i.e., text completion) improves on all these subtasks, *because* it is not true that these are independent subtasks.

So I know this is isn't a proper answer, but to restate the main point I think I'm making: if you know how to split a task into mostly independent subtasks, then yes, I think an ensemble of smaller specialized models will be better, but most of the time we don't know how to make that optimal split, and there is so much interaction between seemingly different subtasks that training a bigger model on the general task manages to factor better the inter-task knowledge, so to speak.

Now that was a mouthful! Give me some time for the other two :)

Expand full comment