From Chaos to Choreography: Multi-Agent Orchestration Patterns That Actually Work

Multi-agent AI systems are fundamentally distributed systems, and treating them as simple feature additions leads to exponential complexity and common pitfalls like race conditions.
Successful scaling requires applying established distributed system patterns for coordination, state management, and failure recovery to build production-grade architectures.
Prioritizing "systems thinking" over just "AI thinking" by implementing robust patterns like immutable state, data contracts, circuit breakers, and compensation is crucial for reliable, scalable multi-agent solutions.

Multi-agent AI complexity explodes exponentially; a five-agent system can be 25 times more complex than a single agent due to coordination and failure points.
Choose between Choreography (event-driven, decentralized, high autonomy, requires strong observability for debugging) and Orchestration (centralized coordinator, controlled, easy debugging, good for complex dependencies and rollbacks).
Avoid shared mutable state to prevent race conditions and lost updates; instead, use immutable state snapshots with versioning where agents append new, sealed versions of data.
Implement explicit data contracts (input/output schemas) between agents to enforce data quality and catch integration errors at boundaries, not downstream.
Employ the Circuit Breaker pattern for agent calls: it opens on repeated failures, preventing cascading failures by fast-failing requests to unhealthy agents and allowing graceful degradation.
Utilize the Compensation (Saga) pattern in orchestrated workflows, where each agent has execute and compensate methods, enabling transactional rollbacks for partial failures.
A robust production architecture positions the orchestrator as the single source of truth, manages state versions, and integrates comprehensive observability for control and debugging.

Distributed System — A system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages. Race Condition — A problematic situation in a system where the outcome depends on the sequence or timing of uncontrollable events, often leading to incorrect or unpredictable results. Multi-agent AI System — A system composed of multiple autonomous AI agents that interact with each other and their environment to achieve a common goal. Choreography — A distributed coordination pattern where agents coordinate through events, operating autonomously without a central manager. Orchestration — A distributed coordination pattern where a central component (orchestrator) manages and directs the workflow of multiple agents. Immutable State — Data or an object whose state cannot be modified after it's created, ensuring consistency and predictability. Append-only Log — A data structure where new entries are always added to the end, and existing entries are never modified or deleted, useful for auditing and versioning. Data Contract — An explicit agreement or schema defining the expected input and output data structures and types between different components or agents. Circuit Breaker Pattern — A design pattern used in distributed systems to prevent a failing service from causing cascading failures by stopping requests to it when it's deemed unhealthy. Compensation Pattern (Saga) — A pattern for managing transactions across multiple distributed services, allowing for a rollback of previously completed actions if a subsequent step fails.

Hi everyone, I'm Sandy. I've spent 18 years building data systems, a major part of it focusing on building and scaling distributed data systems in the Claude. I've done it for multi-tenant systems for software and SaaS companies and then for scaling data and AI platforms and regulated industries like financial services and healthcare. I've learned a great deal about production grade distributed systems while I have been working at AWS and now in data bricks. For the last two years, I've been deploying multi-agent AI systems in production and I have watched brilliant engineers make the same mistakes over and over. They think adding more agents is just like adding more features. It's not. It's building a distributed system and today I'm going to show you the patterns that actually work when you make the transition. These are lessons that I have learned working in the trenches and today I'm here to share it with you. Here's what we are covering today. First, the problem. I'll share you a very basic production world story about race condition and why complexity of explodes when you go from one agent to five agents. I'll talk about the patterns, choreography and orchestration patterns for coordination of agents. Talk about state management. Talk about failure recovery and how we can. Design for failure in production systems and then I'll share how a production grade architecture look like in as simple way possible. I'll also show you an example on how we build this on data bricks. So let's dive into it. You see one agent works beautifully. You have got your LLM, some prompts, maybe a retrieval augmented generation pipeline, maybe some tool calls. It demos great leadership loves it. You feel happy and your team is happy. And then product comes back with a request that changes everything they want five more agents and here's what happens. You think, okay, I know how to build agents and I will add five more except now you have coordination problems. Agent A produces data that agent B needs agency is waiting on both agent a and agent B agent D just updated the shared state that agent B was reading and agent E just crashed and took down the entire workflow. This is no longer an AI problem. This is a distributed system problem and most of you didn't sign up to be distributed systems engineer. Let me tell you about a production deployment where this went very wrong. We built a credit decisioning system for a financial services company. The first agent credits core calculation work perfectly. It worked great in demos two weeks in production zero issues. Then we added four more agents income verification risk assessment fraud detection and final approval. We deployed all five in three days time we started seeing weird approvals. 20% of decisions had incorrect risk ratings customers who should have been flagged were getting approved the business team was panicking it took us two days to find out what was happening. Credit score agent calculated a score of 750 and wrote to the database the risk assessment agent on the other hand read from the database 500 milliseconds later and got a score of six 80 for the same customer why did it happen because we had a caching layer for customer records the right to post grade sequel succeeded but the cash was not invalidated the risk agent read from the cash and it got stale data. Use it use the wrong score and made the wrong decision this is a classic distributed system problem we had caching layer between the agents and the database cash invalidation failed and the agent was reading stay value values the race condition wasn't in the database it was in the architecture multiple agents shared cash no coordination on cash invalidation. This took us quite a while to find the pattern it created delays in delivery and led to wrong decisions and here's the lesson we learned the problem was of course not with the model the problem wasn't with the problems the problem was we built a distributed system without distributed system thinking and that's what kills multi agent projects not bad AI but bad architecture now I will show you the architecture that works we will also look in the data. We will look into a production grade architecture but first let's understand why this complexity explodes so quickly now when you move from a one agent system to a multi agent let's say five agent systems it doesn't get just five times harder it gets 25 times more complex coordination complexity grows exponentially one agent has got zero coordination problems two agents have got at least one connection five agents have got at least 10 potential connections and coordination each connection is a failure point a race condition a state synchronization problem you are not just building five agents you are building a coordination problem across multiple relationships and across and and possibility to have multiple failure modes and that's why the complexity increases very very quickly now I'm going to show you two critical patterns first pattern is about how to coordinate multiple agents then we will talk about the how you can manage state and then we'll talk about how we can recover and design for failure now this patterns come from multiple years of distributed systems work and I can directly apply them on multi agent AI system once you get the basics it's really hard to miss this patterns when you build multi agent AI architecture the first decision you need to make is about choreography or orchestration these are the two fundamental patterns for distributed coordination choreography means agents coordination through events they are decentralized their autonomous orchestration means a central coordinator manages the workflow this is centralized and controlled most teams pick one instinctively and regret it let me show you when to use each let's start with choreography choreography is even driven the research agent finishes research and publishes of research completed event to a message bus even be subscribed to that message bus and listens for that even time it is interested in the analysis agent subscribes to the event type picks it up does analysis and publishes analysis ready then the report agent picks that analysis ready even generous the report there is no central coordinator here each agent each agent is autonomous listening for events it cares about publishing when it is done this is the beauty of choreography agents are loosely coupled is easy to add new agents and make them subscribe to the events that they're interested in this drives high autonomy and scales really well however the nightmare of choreography is debugging when something fails you're playing detective with no real clue which agent fail to publish did the even get a consumed did the even get consumed twice you need bullet proof of observability to make choreography work even with the even propagation you need strong guarantees across delivery of these events without this debugging is really hard so when should you use choreography you use choreography when your workflow is naturally even driven when agents need to operate independently when you're adding agents frequently and don't want to update a central coordinator but it is important to understand it is possible only if you have strong observability if you can't trace even through your system choreography will destroy you I've seen teams choose choreography because it feels more agentic more autonomous than the span months firefighting because they can't debug distributed even flows don't make that mistake now let's look at the alternative orchestration orchestration is centralized you have a workflow orchestrator that calls each agent directly agent a runs first the orchestration calls agent a waits for the result cats the result back then the orchestrator calls agent B and see in parallel if they are agents that need to run in parallel the orchestrator manages the parallelism not the agents B and see return their results to the orchestrator then the orchestrator calls agent D with the combined results from B and see every call goes through the orchestrator agents never call each other the orchestrator is the single source of truth it knows the entire execution graph it manages state it handles retries it logs every step agents are dumb they just take the input they do the work they return the output the orchestrator does all the smart coordination in data breaks one way to implement this pattern would be with land graph wired into AI agent framework as the orchestrator but any workflow that gives you the data is directly directly directly the site and proper retry mechanisms would fit in this kind of orchestrator patterns you use orchestration when you have complex dependencies that need central management when you need to roll back compensate for failures when you want one dashboard showing the entire system state when your workflow is relatively stable in financial services for example we use orchestration almost exclusively why because it provides easy debugging and the ability to roll back and that matters more than autonomy in this kind of industries when something goes wrong with a credit decision for example we need to know exactly which agent made that call in what order and with what data orchestration gives us that choreography doesn't so how do you choose here is your decision framework to access workflow complexity simple to complex autonomy requirements low to high simple workflow high autonomy you go with choreography unit complex workflow with low autonomy tolerance you go with orchestration the interesting quadrant is the top right where you need complex workflow but agents need autonomy this is where you use hybrid patterns choreography with saga patterns for compensation I'll talk about this pattern later in this session as well tools like agent breaks on data breaks are starting to package these orchestration patterns for common multi agent use cases so you don't need to rebuild them every time it makes building this patterns really easy in production environments now I use the decision metrics every time to make decisions with customers based on their use cases it's worth you take a screenshot I'm sure you will reference it let me show you what a production orchestration actually looks like at the tail end of this session all right now we have chosen a call coordination a pattern now let's talk about the thing that actually breaks when you scale state how do agent share data without race conditions without scale reads without mystery bugs here's what most people do first and it's wrong shared mutable stage multiple agents writing at the same database records at the same time agent a reads credit score calculates the value writes it back agent be does the same thing at the same time both read 680 agent a writes 750 agent be writes 720 last right wins agent is update disappears lost update I understand yes modern databases have protections in place in place of this role logs isolation levels a set up but you have to use them correctly explicit transactions you have to build serialized is well isolation you have to make sure that you select for update and many times don't they use default isolation they don't use explicit locks and the ship race condition to production we did it we did that mistake and that result in delayed value to the business we just assume that the database would handle this conditions but they don't when it gets really complex you have to handle them explicitly in the code now here's what works immutable state snapshots with versioning agent a produces a state version let's say version one its sealed its immutable nobody can modify it state is stored in the orchestrated database as an append only log these are insert operations not not any update agent a hands state version one to agent be agent be valid is the schema checks that the data contract matches with its expectations it processes it produces state version to also immutable agent be inserts version to as the new row it doesn't update version one and then hands it to agency same thing schema validation version tracking immutable identity guarantee at each hand of agency fails now if agency fails your own back to version to if you need to debug your replace state evolution from version one through version and you can see exactly what each agent received and produced this eliminates race conditions no concurrent modification to the same record each agent depends a new version instead of updating the shared state now of course if you want to save the state snapshots they can be logged in any sort of append only storage for audit replay but they are never shared for read all right now here's how it looks like in code agent state class the frozen means immutable in second it has a version number the data pack payload and who created it the hand of function does three things first it validates the schema this is the contract enforcement we are checking that agent is output matches agent be is input contract this is critical and we will come back to this second increment version create a new immutable state object with version N plus one third execute the next agent with that immutable state the agent can modify the input state it can only produce a new state this prevents an entire class of bugs it prevents race conditions on share state no state rates it provides a clear lineage every state has a version and you know who has created it when something goes wrong you can trace back through state evolution version seven produced where output look into version six that went into the agent look at version five before that you can find a research through your state history to find where things went wrong and this becomes really really powerful now state management is half the battle data contracts are the other half agent a can just throw out of the military data at agent be and hope it works this doesn't work that way they need a contract in place in this example research agent promises to output findings confident score sources time stamp etc. The police's agent declares it requires research agent output with type 1. And it validates if confidence is below 0.7, it will reject the handoff. This is the contract. If the research agent tries to hand off low quality data, the contract catches it at the boundary. You find out immediately not three agents down the stream when it produces a report in garbage. Every work with our customers using data breaks, one way of doing it is registering these input output schemas in Unity catalog. So every agent's contract is versioned and governed in one place. All right, we talked about coordination patterns. We talked about state management, now let's talk about another thing that you need to keep in mind. And that's failure and recovery. And the reason this is important is because agents will fail. That's inevitable. The LLM will time out, the API will rate limit you, the agent will crash, make it workflow. What happens then? What happens then is what you need to plan for and design in the system. Let's talk about few patterns. Let's talk about the first pattern, which is a circuit breaker pattern. And this comes straight from the distributed system. When agent A calls agent B, it wraps that call in a circuit breaker. If agent B fails repeatedly, say five times in a row, the circuit breaker opens. Now instead of waiting for a time out every single time, you basically fail fast. Circuit open agent B is down. You just try again later. You are not bombarding agent B with requests. You're protecting your system. After a tape time out period, let's say 60 seconds, the circuit goes half open. Then you test agent B again with one request. If it succeeds, the start circuit closes and normal operation resumes. If it fails, the circuit opens again and it resets the timer. This prevents you from cascading failures into the system. One agent going down doesn't bring your entire workflow down. You gracefully degrade. Maybe you skip that agent and continue with reduced functionality. Maybe you use cast results. Maybe you alert a human, but you don't crash the entire workflow. Circuit breakers are the single most important failure recovery pattern for multi-agent systems. Every agent call should be wrapped with the circuit breaker. We enforce this circuit breaker policies at the serving layer on data breaks through model serving or through AI gateway. Here's how it looks like in code. You track the failure count and you track the state. When you call an agent, you check the state first. If it is open, you fail fast. You don't even try. If it is closed, you make the call. If the call succeeds, you reset the failure count and stay closed. If it fails, you increment the failure count. If you hit the threshold, you open the circuit. After the time out period, your transition to half open. You test one request. If it succeeds, you close the circuit. If it fails, you open it again. This is a simple pattern, but it has got a massive impact. In data breaks, you can log every open closed transition in MLflow. You can see when an agent started flaking out. Now, let's talk about another pattern. We call it the compensation pattern. Also called saga pattern. Every agent has two methods. Execute and compensate. Execute does the work compensate a roll sit back. Under it, the orchestrator tracks which agents have executed. If the execution agent fails, the orchestrator walks backward through the executed agents. And it calls compensate for each one. Analysts agent compensates. It deletes the draft recommendation from the system that it has written originally. And then the research agent compensates by clearing the cash to research data that it gathered previously. So you're back to the initial state. No partial transactions, no stock workflows. This is a simple rollback pattern that you can implement in multi-agent system. Compensation gives you transactional semantic across distributed agents. It is not sexy, but it's how production systems handle partial failures. Every orchestrated workflow needs this kind of compensation pattern. And you need to plan for it depending on what you're doing with your workflows. Here's how compensation looks in code. Every agent, as I mentioned earlier, has got two methods, the execution method and the compensate method. The execution does the work, the compensate, undoes it. That's the contract. Every operation must be reversible. The orchestration tracks which the orchestrator tracks which agents have run successfully. And then it keeps the list. Agent A executes gets added. Agent B executes gets added. Agency fails. Now we walk backward through the list in reverse order. Agent B compensates first is it undoes the work that it has done. Agent A compensates next. It undoes the work that Agent A has done and it goes back to the initial state. This is saga pattern from distributed databases. Financial services requires this. Now that we have covered these different patterns, I want to show you what a production architecture would look like when you bring these things together. You have got the orchestrator at the left hand side. It's the brain of the workflow. It contains the workflow engine. It contains the state store holding versions through 0 to n. And it can look into the observability layer. It handles the observability data. Every call goes to the orchestrator. Orchestrator calls Agent A. Agent A returns state version 1 to the orchestrator. Orchestrator then calls Agent B and C in parallel. If they need to run in parallel, both receives state version 1 from the orchestrator. They return results. Orchestrator stores version 2 and 3. Finally, orchestrator calls D with these combined results. Agents never call each other. All coordination happens through the orchestrator. And this is what gives us control, observability, capability to roll back. This runs 24 cross 7 across billions of transactions because the orchestrator is the single source of truth. All right, here's a production architecture that you could implement with the Databricks Data Intelligence Platform. The orchestration layer, you can have Langraph wired into Mozike AI, Agent Framework. It handles multi-agent orchestration. It manages the workflow graph and knows which agents to call in what order. Each agent is implemented as a Unity Catalog function. It could be written in SQL or Python. Or it could be a model registered in Unity Catalog. They are, when you register these assets in Unity Catalog, they are discoverable centrally within the organization. They can be governed in one place and they can be versioned, which are really critical in terms of operating this workflow's introduction. We expose these agents through a Databricks model serving or function serving. And that's where we enforce this circuit breaker style policies like retrise or timeout or rate limits at the serving layer, typically via AI gateway configuration. Now, when we talk about the data layer, Delta Lake stores everything. It not only stores the state versions from the agent, it also stores customer data and all the data that you need for your workflows to work. Talking about the state snapshots, Delta Table is immutable and versioned. For us, those state versions are just rows in a Delta Table. We never update them in place. Each agent run is tied to a state version via MLflow traces, so we can step through the evolution when something breaks. Now, I just wanted to touch upon Unity Catalog. It governs everything, access control, lineage, audit trail for both data and agents. MLflow gives us per agent tracing evaluation capabilities with out of the box LLM as judges and metrics on every call. And as I mentioned earlier, tools like agent breaks is the higher level way of data breaks packaging these orchestration patterns for common multi-agent use cases. So you don't need to rebuild them every time. So just to wrap up this workflow, you see the Lang graph orchestrator calls agent a Unity Catalog function or model. It gets the result, writes version one state to Delta. It then calls agent B with state version one, writes version two and so on. MLflow traces every call, latency, inputs, outputs, token usage, a circuit breaker at the serving layer guards each call. If agency fails, Lang graph triggers compensation logic and walks backward, calling the compensate functions for previous successful steps. These kind of patterns run in production day in and day out. So thank you for hearing me out. You can reach out to me over LinkedIn. You can scan this keyword that will take you directly to my LinkedIn profile. I would like to leave you with three final thoughts. First of all, agent chaos is inevitable. When you scale past one agent, you will hit coordination problems, raise conditions, cascading failures. That's guaranteed. The complexity curve doesn't lie. Agent choreography is a choice. You can build systems with proper patterns, orchestration, choreography, immutable state, circuit breakers, compensation patterns, data contracts. Make sure you understand these patterns and bring them to your production architecture. Doing so will help you build systems, not demos. Demos are easy. You use an LLM to show something cool. Everyone can do it. These things don't work in production. In production, you have to build systems. And systems are hard. Systems are what create value for businesses. Everything I showed you today, choreography versus orchestration, immutable state, circuit breakers, these are all unsexy infrastructure work. You don't get a pause for implementing a circuit breaker. But you make your systems more reliable. They don't fail at 2 a.m. in the night. That is what people notice over time. Be a systems engineer. The patterns here, they work. Apply these patterns in your production architecture. Thank you very much for watching. Bye.

From Chaos to Choreography: Multi-Agent Orchestration Patterns That Actually Work — Sandipan Bhaumik

TL;DR

Takeaways

Vocabulary

Transcript