AI-Driven Storytelling with Multi-Agent LLMs - Part I
Results of an Ongoing Research in the AI Group at the University of Havana
If you’ve ever tried to coax a language model into writing a long, coherent story, using just prompt engineering, you’ve probably hit a wall: characters lose their personalities, plots meander or stall, and the whole thing feels more like a sequence of clever paragraphs than a living narrative.
Precisely this topic is one major research line in my group at the University of Havana. Right now, there are three undergrad students wrapping up their theses on AI-driven story generation from different perspectives and strategies.
The common idea underlying this whole research line is that combining multi-agent systems and traditional symbolic AI with LLMs in well-designed workflows can overcome many of the limitations of pure LLM-based story generation. Crucially, we aim to explore ways of improving story generation without any form of fine-tuning or retraining—that is, no need to adjust model weights.
In this and follow up articles, I'll partner with my students to bring you a few high-level summaries of what they've done and found.
This first article (and the corresponding thesis) is about story emergence. We set loose a few characters in an AI-driven world and let them interact, to see what kinds of stories come out. But there is a catch: we want some level of control, but not too much.
The motivating question, then, was how can we introduce some mechanism for indirect control of a story while still allowing characters to evolve more or less naturally and plot points to emerge.
Let's see what Franco came up with.
Why this matters
Let’s get something out of the way first: the purpose of this research is not to “solve” storytelling, nor to replace writers, artists, or the creative process itself. Storytelling is a deeply human craft, and no one here is under the illusion that a handful of LLM-driven agents will compose the next literary masterpiece.
So, why do we invest so much effort into building systems that generate stories?
The answer is both practical and strategic. Storytelling, especially in the form of simulated worlds with autonomous agents, is a uniquely demanding environment for testing the capabilities—and limits—of large language models. It’s a domain where long-term coherence, character consistency, planning, and subtle control all collide. In other words, it’s the perfect laboratory for exploring how to govern and steer the behavior of powerful generative models without sacrificing their creativity or flexibility.
This is not unlike the role that games like chess and Go played in the development of AI search and planning algorithms. Those domains were never the end goal; rather, they were controlled, well-understood environments where researchers could rigorously test new ideas. The techniques honed in those settings—like Monte Carlo Tree Search—eventually found their way into applications as far-reaching as protein folding and robotics.
In the same spirit, we use storytelling as a proving ground for strategies of indirect control, agent autonomy, and emergent behavior in LLMs. Here, we can measure and observe how different architectures balance autonomy and direction, how memory and planning affect long-term coherence, and how subtle interventions shape complex outcomes. The lessons we learn in this bounded, creative sandbox are directly relevant to much broader and higher-stakes domains: from AI assistants that must follow nuanced instructions, to multi-agent systems in logistics, education, or even critical infrastructure.
Ultimately, the goal is to develop robust, generalizable techniques for controllable and safe AI. By pushing the limits in a domain as rich and challenging as narrative simulation, we’re laying the groundwork for systems that can be trusted to act autonomously, adaptively, and in alignment with human intentions—no matter the context.
And if we get a few fun stories along the way, all the better.
Why pure LLM-driven storytelling falls short
Large Language Models (LLMs) have changed the game for natural language generation. They’re great at producing short, contextually rich responses and can even simulate dialogue or simple stories with impressive flair. But as soon as you ask them for something more ambitious—a novel-length mystery, a world populated by autonomous characters, or a story that evolves over dozens of turns—the cracks start to show.
The main issues? We identified some well-known limitations in LLM-driven storytelling.
Long-term coherence: LLMs forget what happened a few thousand tokens ago, even if their context is larger.
Character consistency: Personalities drift, motivations vanish, and “out-of-character” moments abound.
Proactivity: Agents react, but rarely plan or pursue long-term goals in a believable way. There is no planning ahead of time, it's all reactive.
Narrative control vs. autonomy: Too much authorial intervention and characters turn into puppets; too little and the story meanders or stalls. It's hard to craft just the right prompt.
These are not just academic complaints. If we want LLMs to power the next generation of interactive fiction, virtual worlds, or even collaborative writing tools, we need architectures that can balance control, coherence, and genuine emergence.
Our idea? Agents, lots of LLM agents
Our approach borrows a page from both agent-based modeling and narrative AI research. Instead of a single omniscient narrator, we simulate a society of autonomous agents—each powered by its own LLM instance, each with its own identity, memory, and goals—interacting in a shared, dynamic environment.
But here’s the twist: rather than scripting the story or directly controlling the agents, we introduce a “Director” agent. The Director never tells the agents what to do. Instead, it manipulates the environment—e.g., changing the weather, introducing objects, setting up casual events—and lets the agents interpret and react according to their personalities and memories.
Think of it as setting the stage and dropping hints, not pulling strings—more as some postmodern, emergent theater play than a traditional movie script. Actors have some constraints but they are free to pursue whatever goals they desire.
Here are the key architectural components that make the whole system come together:
LLM-driven agents: Each with its own memory (short-term and long-term reflection), planning, and perception modules.
World state: A mutable environment that records locations, objects, events, and global properties.
Action resolver: Ensures agent actions are valid and consistent with the world.
Event dispatcher: Manages what each agent perceives, maintaining a plausible flow of information.
Director: Observes the world and subtly nudges the narrative by changing the environment, not the agents.
To test all these ideas, we built a prototype implementation of the proposed architecture in Python, leveraging Google’s Gemini 2.0 Flash Lite for all LLM tasks. Each component—agent, action resolver, director—gets its own LLM instance and carefully tuned generation parameters. Memory is handled outside the LLM, with dual-level storage: agents remember both recent events and distilled reflections, which are periodically generated and used to inform future decisions.
The Director’s interventions are strictly limited: it can, e.g., change the weather or add objects to locations here and there, but never force an agent’s hand.
Timing is granular—the Director considers whether to intervene before every agent’s turn, allowing for context-sensitive, minimally intrusive direction.
How does this stack up?
To see if this architecture actually improves narrative generation, we ran head-to-head comparisons between stories generated by the multi-agent system and those produced by a monolithic LLM given the same scenario. We evaluated on several axes:
Coherence and plot progression
Character consistency
Originality and emergent plot richness
Prose quality
Narrative pacing and suspense
What did we find?
The multi-agent system produced far more believable, consistent characters and surprising plot developments. Because agents had their own memories and goals, their actions made sense and sometimes surprised even us. However, the monolithic LLM still wins on sheer polish—its stories are more linear, its prose more refined, and its pacing more controlled.
There’s an inherent tension here. You can have tight, well-structured prose (monolithic LLM) or you can have emergent, believable characters and plots (multi-agent simulation), but getting both at once remains a challenge.
Some of the most compelling moments came from the Director’s indirect interventions:
Dropping a key object in the right room led an agent to discover it and set off a chain of events that advanced the plot.
Changing the weather or introducing an environmental clue shifted the focus of a conversation or escalated a conflict, without ever breaking the agents’ autonomy.
Sometimes, interventions were ignored or interpreted in unexpected ways—an important reminder that true emergence is unpredictable.
Limitations and Next Steps
No system is perfect. The current prototype has its share of constraints:
Agents can only interact with objects in their current location.
There’s no persistent, explicit inventory management, so long-term planning with items is limited to what LLMs can remember.
Agent-to-agent interactions are mostly conversational; direct state changes (like “killing” another agent) aren’t supported yet (we'd need to improve the action resolver to understand these intentions).
The Director’s toolkit is intentionally minimal, which limits the subtlety and richness of its interventions.
Future work should expand environmental manipulation, improve memory persistence, and explore hybrid approaches that combine the strengths of both architectures.
Final Thoughts
This work is less about “solving” narrative generation and more about charting a new direction. By treating characters as autonomous agents and narrative direction as an emergent property of environmental context, we get stories that feel less like scripts and more like living worlds. The trade-off is real, but so is the potential: with the right architecture, LLMs can do more than just write—they can simulate societies, surprise their creators, and perhaps even teach us something about how stories really emerge.
If you’re interested in the technical details or want to see some sample stories, check out the full thesis, and read a few generated stories.
And if you have ideas for making these systems even richer, leave us some comments.
This work is so rich it’s hard to know where to start! You’ve just blown up my mental model of LLMs and forced me to rethink everything, Alejandro. Thanks!! As your three year old daughter taught us all, we can always build new castles after a wave washes one away:) nice note. So is a few thousand tokens the limit for a sustained chat?
I have been building a teaching app using an AI assistant. I don’t know how to code myself. How accessible is your research to someone like me who is trying to build by relying on AI coding assistants?