It's Tokens all the Way Down
How language models understand image, audio, and video.
Part of the run-up to the second edition of Mostly Harmless AI — 50% off during early access — where this is the spine of a new chapter on generative and multimodal AI. You can also read the whole book online for free. More at the end.
One morning, not so long ago, perhaps you asked Claude (or Gemini, or ChatGPT) to do something for you, and decided it was easier to just give it a picture of it than explain the whole thing. Perhaps it was “how do I cook this thing?” or “what building is that?” or “do this homework for me, please, please, my live depends on it”. Then you uploaded the picture, and back came a textual response.
Not happy with what the bot understood, you decided a thorough explanation was owed. But, alas, since all we got is a couple fatty fingers for typing, you decided it was best if you explained it with your own voice. And again, uhms and ehms notwithstanding, you again got a full response back, this time with an audio voice over.
Ten years ago, this simple dance of back-and-forth multimodal information would have required four separate research fields, each with its own conferences, its own vocabulary, and its own priesthood. They have quietly become one single thing. It’s all tokens all the way down. Language has subsumed all modalities. This is how.
The recipe never cared what it was eating
Strip “generative AI” down to the one idea doing all the work and you get a single sentence: look at a big pile of examples, learn the distribution that produced them, then draw new samples from it. That is the whole trick. It is what a language model does, and it is the only thing a language model does. It is also what an image model does, and an audio model, and a video model.
The recipe is indifferent to what the examples are. Text is a one-dimensional run of symbols. An image is a two-dimensional field of colour. Audio is a pressure wave sampled tens of thousands of times a second. Video is all of that, plus time, which is why it is the hardest. Four different shapes of data, one identical question asked of each: given what I have seen so far, what plausibly comes next? The machinery that answers that question does not need to know whether “next” means a word, a patch of pixels, or a slice of waveform. It only needs the data turned into a sequence of countable things.
Tokens.
So the thing we have been calling a language model was never really about language. Or, put better, it was never about written language. It turns out, language is something far more powerful.
Ask any linguist, and they’ll say any set of sequences of distinct symbols (tokens) can be modelled as a language. It doesn’t matter what your symbols are—letters, words, patches of images, numbers in a math formula, whatever—language is just the structure around them, what makes some sequences valid and others nonsense.
This is the key idea. All else is (incredibly good) engineering.
A decade of building the same machine, separately
It did not look that way while it was happening. For about a decade every modality got its own bespoke contraption, and each one looked like its own discipline.
Image people had generative adversarial networks: a forger and a detective locked in a training duel, the forger getting better at faking until the detective could no longer tell. The beautiful idea buried in there — and the one that survived the technique itself — was the latent space: a compressed interior map of “all possible faces,” where walking in a straight line morphs one plausible face smoothly into another. GANs were temperamental, prone to collapsing into a single good fake and refusing to leave, and by the early 2020s they had lost the lead. The latent-space intuition outlived them and runs underneath everything that came after.
Then diffusion took over image generation with a trick that sounds like it shouldn’t work. Take a real photo, add a little static, add a little more, keep going until it is pure snow. Now train a network to undo one step of that. To make a new image, start from snow and run the undo, over and over, until something coherent surfaces. It is sculpture by removing noise instead of removing marble, and it is what powers essentially every image generator you have used.
Audio had its own separate lineage: speech-to-text built one way, text-to-speech another, music a third. Text had the large language models, off in their own enormous-budget corner of the field. Four communities, four sets of architectures, four sets of war stories. If you had asked, in 2021, whether the image people and the language people were building the same machine, both sides would have laughed.
CLIP quietly knocks out the wall
The crack in the wall came from a 2021 model whose job sounds almost too modest to matter: teach one system that the word dog and a photograph of a dog are talking about the same thing.
The way you do that is to train a text encoder and an image encoder together, on hundreds of millions of caption-and-picture pairs, with one instruction: put a picture and its true caption close together in a shared space, and shove mismatched pairs apart. What you get at the end is a single space where “a photo of a golden retriever” and an actual photo of a golden retriever land as neighbours. Text and pixels, in the same room, with the same coordinates.
That sounds like a party trick for image search. It was the hinge the whole field turned on. Once text and images live in one space, text can steer image generation — point the diffusion process at the region of the space that means “golden retriever in a spacesuit,” and let it denoise toward there. Every text-to-image system you have used is, under the paint, that move. And the deeper implication was harder to ignore than the application: if you can put two modalities in one space, the wall between them was never structural. It was just a wall nobody had walked through yet.
Tokens all the way down
Here is where it lands. By the mid-2020s the bespoke machines stopped being separate machines.
The move is almost embarrassingly direct. Tokenise everything. Text already broke into tokens. Cut an image into a grid of patches and treat each patch as a token. Run audio through a neural codec that emits discrete chunks, and those are tokens too. Now you do not have a text stream and an image stream and an audio stream. You have one stream of tokens that happen to have come from different alphabets, and you train a single model on the only objective that was ever in play: predict the next token, whatever kind it is.
A model trained that way reads and writes everything, because to it there is no “everything” — there is just the sequence and the next position in it. You have used these. The one that holds a spoken conversation with sub-second latency, looks at the photo you paste in, and writes you a paragraph back is not a language model bolted to an image model bolted to a speech model. It is one model that was never told these were different problems.
Which is why the question that organised the field for a decade — is this a language model or an image model? — has quietly stopped having an answer. It is the same machine. The only thing that ever changed between text and pixels and sound was the alphabet, and the transformer emitting the next token has never cared which alphabet it is spelling in. It is tokens all the way down. “Language modelling” was a local name for something with no allegiance to language at all: modelling sequences of anything we can count.
The honest part
It would be easy to end on the astonishment, and the astonishment is real. One model, every modality, falling out of one stubbornly simple objective applied to a wider and wider definition of “token” — that is one of the genuinely beautiful results of the decade, and the kind of unification that does not come along often.
But unification is not the same as understanding, and I am not going to let the elegance smuggle that past you. A system that can place “dog” next to a dog in its latent space has learned the statistics of how dogs are described and depicted.
Whether it has learned what a dog is is a different question, and the convergence story does not answer it. It just makes the question apply to every modality at once instead of only to text. The machine got more general. It did not get more grounded necessarily. Both of those are true at once, and the interesting work of the next few years lives in the gap between them.
Until next time, stay curious.
This is the core argument of a new chapter in the second edition of Mostly Harmless AI — the full chapter walks GANs, diffusion, CLIP, audio, and native multimodality with the scenes and citations this post had to cut, and it is 50% off during early access. The whole book is also free to read online. If you want the rest of the argument — how these systems are trained, where they break, and what to actually do about it — that is what the book is for.




“Whether it has learned what a dog is is a different question, and the convergence story does not answer it. It just makes the question apply to every modality at once instead of only to text.”
really well put. interesting read the entire time. thanks!
What this brought to my mind is the fact the neocortex has pretty much the same structure everywhere, so, doing about the same thing everywhere. Hmmmm…