Towards Reliable, Consistent, and Safe LLM-based Agents

My vision for building LLM agents that, I promise you, is not the usual deal.

May 24, 2025

men rowing boat — Photo by Mitchell Luo on Unsplash

Why is it so hard to build conversational AI agents that can reliably solve complex problems, follow strict safety rules, and coordinate seamlessly with other agents? The answer, I believe, lies in three foundational challenges that any robust AI system must overcome: reasoning, governance, and orchestration. These are not just abstract technical terms—they represent real, practical hurdles that block the path to trustworthy, domain-specific AI assistants capable of handling multi-turn conversations in business-critical settings.

Reasoning is about more than just generating fluent text. It demands the ability to solve complex problems that often require composing multiple tools and performing iterative reasoning loops. An AI agent must decide which tools to call, in what order, interpret intermediate results, and know when to stop. Without this capability, AI responses remain shallow, inconsistent, or outright wrong when faced with intricate tasks that humans solve by chaining logical steps and external resources.

Governance tackles a different but equally important problem: ensuring the AI follows strict guidelines for safety, reliability, and ethical behavior throughout the conversation. This means preventing harmful outputs, avoiding jailbreaking attempts, and maintaining alignment with domain-specific rules. Crucially, governance also requires transparency and auditability so that AI decisions can be inspected and held accountable—an absolute must in regulated or sensitive environments.

Orchestration is the glue that holds everything together in complex AI ecosystems. It involves managing interactions among multiple specialized agents and tools in loosely coupled, distributed systems. Effective orchestration enables these diverse components to collaborate autonomously, handling multi-turn workflows and evolving conversational states without brittle dependencies or centralized control. Without it, scaling AI assistants beyond simple single-agent setups becomes impractical.

In this article, I will explore these three core problems in detail, laying bare the challenges for building useful conversational AI systems. Understanding them precisely is the first step toward building systems that are not only intelligent but also safe, reliable, and scalable. I will also propose my own vision for how to tackle these challenges today, working within and around all the current limitations of LLMs. Finally, I will introduce a Python framework designed to tackle reasoning, governance, and orchestration head-on, providing principled building blocks for the next generation of conversational AI.

The problems

Let's start by analyzing these three core issues in detail: reasoning, governance, and orchestration. I will outline why I think these are fundamental roadblocks to building truly effective LLM-based agents and systems. Later, I will explain what I think are ways to move forward with the tools we have today.

Reasoning

At first glance, reasoning in the context of AI agents might seem like simply knowing what each individual tool or API endpoint does. But the real challenge lies in the higher-level, tacit knowledge of how to combine these atomic tools effectively to solve complex domain-specific problems. Domain experts intuitively understand sequences of tool calls, conditional logic, and iterative steps required to reach a solution.

However, this kind of procedural expertise is rarely captured in low-level tool descriptions, which typically focus on inputs, outputs, and basic functionality. Expecting a language model to infer the correct combination and sequencing of tools solely from these isolated descriptions is unrealistic. The nuanced decision-making about when and how to chain tools together is a form of knowledge that is difficult to extract or emerge naturally from atomic tool specs alone.

Beyond the difficulty of tool composition, reasoning also suffers from the lack of structure in the outputs generated by LLMs. When all you have is natural language, it becomes extremely challenging to reliably perform multi-step inference, especially when reasoning requires loops, conditional branches, or repeated evaluation of intermediate results. Natural language is inherently ambiguous and unstructured, making it hard to enforce consistency or correctness across iterative reasoning steps. Without explicit control flow or a formal mechanism to track state, attempts to encode complex reasoning purely in natural language prompts tend to be brittle, error-prone, and difficult to validate or debug.

This means reasoning in conversational AI takes more than knowledge of individual tools or fluent text generation. It requires bridging the gap between atomic tool capabilities and high-level procedural knowledge, as well as moving beyond unstructured natural language to support systematic, verifiable inference processes involving loops, conditions, and multi-step tool use.

Governance

Governance is one of the toughest nuts to crack when building conversational AI for real-world applications. At its core, governance means making sure your AI agent follows strict company policies, legal regulations, and ethical guidelines—all the time. These aren’t just vague suggestions; they’re hard constraints designed to keep users safe, protect sensitive data, and ensure compliance with industry standards. The problem is that maintaining this level of discipline over multi-turn conversations is incredibly difficult. Agents tend to drift from prescribed behavior as conversations grow longer, and even minor lapses can have outsized consequences in business-critical or regulated environments.

Add to this the constant threat of jailbreaking and prompt injection attacks—clever ways users or adversaries try to trick the AI into ignoring its guardrails or producing harmful outputs. It’s like trying to keep a fortress secure when the attackers keep inventing new siege engines. Because conversational agents operate in open-ended, adversarial settings, they must be robust against a wide spectrum of malicious inputs. Preventing these exploits isn’t just about patching holes; it requires a proactive, systematic approach to detect, block, and mitigate attempts to subvert the AI’s intended behavior.

Finally, governance demands transparency and auditability. It’s not enough for an AI to “just behave.” We need to understand why it made a particular decision or gave a certain answer. This is essential not only for debugging unexpected behavior but also for ensuring fairness, building user trust, and meeting regulatory requirements. Imagine trying to explain a loan denial or a medical recommendation without a clear, traceable rationale—this is where transparency becomes a non-negotiable. Without it, deploying AI in sensitive domains is a leap of faith rather than a calculated risk.

Orchestration

If reasoning is the brain, and governance the rulebook, orchestration is the central nervous system of a conversational AI system. At first glance, it might seem like the most straightforward problem—after all, distributed systems aren’t new. But here’s the catch: when you’re dealing with LLM-powered agents that combine stochastic reasoning with deterministic tool use, even “simple” coordination becomes deceptively complex.

Let’s break it down. Imagine you’re building an email assistant ecosystem with three agents: one fetches messages, another summarizes content and extracts actionable items, and a third adds events to calendars. Each agent operates autonomously, but they need to collaborate seamlessly. The fetch agent might process thousands of emails hourly, the summarizer needs to handle variable-length content, and the calendar agent must interface with multiple external APIs. Now scale this to hundreds of specialized agents working asynchronously across time zones and user bases.

The real challenge isn’t just making these agents work—it’s making them work together in a system that’s both flexible and bulletproof. You want to add new agents dynamically (say, integrating a billing system, or reading a new source like Slack) without destabilizing existing workflows. You need horizontal scaling: if email volume spikes, spinning up more fetch agents should be as easy as launching new processes. Crucially, the system must remain distributed, avoiding single points of failure while maintaining coherent state across interactions.

This demands a delicate balance. Traditional microservice architectures solve similar problems, but LLM-powered agents introduce new wrinkles. First, conversational agents often carry context across interactions, unlike stateless API calls. Also, asynchrony is crucial—an agent might need to pause mid-task waiting for user input or external service responses. All of this with agents that range from simple deterministic tools (like a calendar API wrapper) to complex reasoning modules making probabilistic decisions.

The solutions

Let’s get to the heart of the matter: how do we actually solve the thorny problems of reasoning, governance, and orchestration in conversational AI? My thesis is that the answer lies in three key ideas that work together to bridge the gap between high-level intelligence and low-level execution, while ensuring control and scalability.

First, skills act as a powerful abstraction that captures domain expertise on how to combine multiple tools into coherent workflows. Second, structured reasoning moves us beyond free-form language outputs to well-defined, machine-readable formats. Finally, asynchronous message passing provides the backbone for orchestrating multiple agents working on different tasks.

Together, these concepts form a principled foundation for building conversational AI systems that are not just smart, but safe, reliable, and scalable. In the sections ahead, I’ll unpack each idea and show how they collectively address the core challenges we’ve laid out.

Skills

A skill is a semi-structured workflow that captures domain knowledge about how to solve a specific problem by combining multiple tools and prompts. It acts as the crucial bridge between the high-level reasoning capabilities of an LLM and the granular, atomic operations exposed by individual APIs or services. Skills encode not only what needs to be done but how to do it—capturing procedural knowledge that’s often tacit and difficult to extract from isolated tool descriptions alone.

Skills can be as flexible or as restrictive as necessary, blending natural language prompts with traditional programming constructs like conditionals and loops. On one end of the spectrum, you might have a skill that’s essentially a general-purpose chat interface: it provides the LLM with some basic instructions and lets it generate free-form responses with minimal procedural control. This kind of skill is highly flexible but offers limited guarantees about behavior or output structure.

On the other end, consider a skill designed for a complex enterprise application, such as an ERP system. This skill might invoke several API endpoints in sequence, carefully checking each response with targeted prompts to verify correctness. It uses conditionals to decide which tools to call next, loops to handle iterative processes like paginated data fetching, and error handling to manage unexpected results. Here, the skill acts like a finely tuned program, encoding domain-specific workflows that ensure reliability, adherence to business logic, and precise control over the agent’s actions.

By combining prompts with executable code, skills provide a powerful abstraction that closes the gap between the conceptual reasoning of LLMs and the concrete, deterministic operations required to solve real-world problems.

Structured Reasoning

Structured reasoning is about asking the LLM to respond not with free-form text, but with well-defined, machine-readable objects—think JSON with explicit fields. Why? Because this approach lets us do three crucial things. First, we can verify the output of each reasoning step before moving on, catching errors early instead of letting them cascade. Second, we can write procedural code that ties reasoning steps together using conditionals and loops, turning the AI’s output into a controlled, repeatable workflow rather than a one-shot guess. Third, structured outputs make it crystal clear what reasoning paths the LLM is exploring, which is invaluable for transparency, debugging, and governance.

This concept fits perfectly with skills, which encode the procedural knowledge of how to use specific tools in a specific order. Structured reasoning provides the scaffolding that lets skills define complex workflows reliably.

For example, consider the ReAct reasoning paradigm, where the LLM’s response includes distinct concepts like observation, thought, and action. Instead of parsing ambiguous text, we can get a structured object where procedural code can check if the agent should loop, invoke a tool, or stop. This makes multi-step reasoning systematic and auditable, rather than brittle and opaque. In short, structured reasoning transforms the AI’s “chain of thought” from a fuzzy narrative into a precise, verifiable sequence of steps.

Asynchronous Collaboration

To solve orchestration, let’s consider the goal of having multiple specialized agents working together to solve complex problems. These agents will be autonomous programs—not just chatbots—that fetch data, summarize content, generate insights, and update databases independently. To enable this kind of collaboration, we need an asynchronous architecture based on message passing. Why? Because it decouples agents in time and function, allowing each to operate at its own pace without waiting on others, making the system more robust, scalable, and flexible.

First, asynchronous message passing ensures robustness. If one agent fails or needs to restart, it doesn’t bring down the entire system. Other agents continue processing messages independently, so the system remains available and resilient. Second, scalability becomes straightforward: to handle increased load, you simply add more agents of the same type. For example, if the volume of requests spikes, you spin up more fetch agents without disrupting the rest of the workflow. Third, agents remain loosely coupled by communicating through typed messages—clearly defined requests and responses that ensure everyone understands what’s being asked or reported. This loose coupling means you can add or remove agents dynamically without breaking the system.

Returning to our email assistant example, the fetch agent pulls messages from various inboxes and posts them to a shared message board. The summarizer agent reads those messages, extracts key points, and posts summaries as new messages. The calendar agent listens for actionable items and schedules events accordingly. If the workload increases, more fetchers or summarizers can be added seamlessly. Introducing a new data source, like Slack, is as simple as adding an agent that posts Slack messages in the same format. Want to add sentiment analysis? Just introduce another agent that processes existing messages without disrupting the rest.

This asynchronous, message-driven architecture turns a complex multi-agent system into a flexible, scalable, and resilient network. It enables conversational AI ecosystems that can grow organically, adapt to changing demands, and maintain smooth operation even when individual components fail or change—exactly what real-world deployments require.

Putting it all together

Now let's put this whole vision together. My proposal is to tackle any sufficiently complex problem—of the type we've been talking—with a tapestry of specialized agents, each built for a concrete task. Some of these agents may be conversational, handling user queries, managing dialogue, and providing explanations. Others are more traditional, working quietly in the background to fetch data, update records, or trigger workflows. Still others leverage LLMs not for chat, but for specific NLP tasks: summarizing documents, extracting structured information, or generating reports. What unites them all is a common foundation built on skills and structured reasoning.

Each agent, whether it’s orchestrating a conversation or crunching through a batch of emails, is powered by domain-specific skills that encode exactly how to combine tools and prompts to solve the task at hand. Structured reasoning ensures every step is explicit, verifiable, and traceable—so you always know not just what the system did, but why it did it. This transparency is invaluable for debugging, auditing, and demonstrating compliance, while the skills themselves serve as living documentation of domain expertise, written and maintained by experts.

These agents don’t operate in isolation. They’re coordinated via asynchronous message passing, exchanging typed requests and responses through shared message boards. This architecture allows each agent to function independently, scaling up or down as needed, and makes the entire system robust to failures—if one agent crashes or needs to be updated, the rest keep humming along. It’s a flexible, loosely coupled ecosystem where adding new capabilities is as simple as introducing a new agent that speaks the same message language, and where complex domain problems are solved through the emergent collaboration of many specialized parts.

The result is a system that’s not just intelligent, but, hopefully, also safer, more reliable, and more adaptable than existing solutions. Traceability and transparency are baked in at every level, thanks to structured reasoning and explicit skill design, and not duct-taped as a forethought to ensure compliance. Strict governance is enforced through domain-specific skills authored by experts, ensuring that every action is aligned with business rules and regulatory requirements, and can be updated at any moment with ease.

If this vision resonates with you, consider ARGO—a Python framework for LLM agents built from the ground up around these key principles of agent-based reasoning, governance, and orchestration. ARGO is intentionally unopinionated: it has zero dependencies on any LLM framework, doesn’t tie you to specific backend or communication technologies, and lets you implement any agentic or reasoning paradigm you can imagine, from simple CoT and ReAct to all forms of dynamic planning and problem-solving. I like to think of it as the FastAPI of LLM agents: simple, modular, and designed for real-world flexibility.

Now, don't get me wrong. all of this is still in active development, so I'm not claiming the problems of reasoning, governance, and orchestration are solved. Furthermore, even if this vision crystallizes in its best possible form, there are still potentially unsurmountable limitations in LLMs regarding explainability, reasoning, and control, that may require some fundamental breakthrough. But, for all of their inherent and current limitations, I still believe LLMs are one of the most powerful technologies we have today to build truly transformative computational systems.

If you’re interested in building the next generation of reliable, transparent, and governable AI systems, I humbly think the paradigm explained in this article is a reasonable path forward, and, by extension, ARGO may be a project worth watching—and perhaps even contributing to :)

If there is enough interest, I can write an article on concrete implementations of real-world use cases leveraging these principles. Just let me know in the comments.

Juan Jose Gomez

May 24

It sounds very interesting, Alejandro. A small demo would be extremely useful!

Expand full comment

3 replies by Alejandro Piad Morffis and others

Thanks for the article. Looking forward to installing and experimenting with the framework. Your articles explaining more of the framework - it's design, usage etc., will be incredibly helpful.

1 reply by Alejandro Piad Morffis

10 more comments...