Fabiana Fournier and her colleagues at IBM Research tackle one of the thorniest issues in agentic AI with their work on Agentic AI Process Observability.
We focus on the undesired form of variability that arises accidentally due to insufficiently rigorous specifications. These ‘loose ends' in the design enable agents to perform unforeseen behaviors during execution.
One of the great promises of agentic AI is its autonomy, but this is also its greatest challenge from a design perspective. The non-deterministic behaviour of LLM-based agents means that giving an agent the same input twice can produce two startlingly different results. This creates a massive problem for building reliable systems. How can you debug, let alone trust, a system whose behaviour is fundamentally unpredictable?
This paper proposes a more rigorous approach, borrowing tools from the world of business process management. The core idea is to treat an agent's execution path as a process that can be mapped and analysed. By running an agent system hundreds of times and logging every single action, they create a comprehensive map of all possible behaviours. This allows developers to move from staring at a single, confusing execution trace to seeing a holistic picture of the agent's behavioural patterns, including the strange detours and outliers.
I think the most crucial insight here is the distinction between intended and unintended variability. Some variation is by design, a choice we want the agent to make. Much of it, however, is accidental, an emergent quirk of the model's black-box nature. The paper shows how their method can identify these "breaches of responsibility," such as a 'manager' agent that was never meant to use tools directly suddenly invoking them. This gives developers a concrete way to find and tighten the 'loose ends' in their agent specifications.
This work points towards a necessary maturation in how we build with AI. It represents a shift away from pure prompt-craft and towards a more robust engineering discipline. If we are to build complex, multi-agent systems that we can safely deploy in the real world, we need observability tools like these. It’s less about seeking the one perfect prompt and more about architecting systems that are resilient to their own inherent messiness, which, I suspect, is a far more interesting and sustainable path forward.