Balaji Srinivasan

The Saga Pattern: "Undo" Buttons for AI Agents

6 minutes (1578 words)

We treat AI agents like magic wands. We ask them to "onboard this new employee," "process this claim," or "add a spouse to the insurance policy," and we expect them to figure it out.

And often, they do. Until step 4 of 5 fails.

Then we discover the hard truth about "agentic" workflows: They are distributed transactions in disguise.

Let's look at a real-world scenario in the Indian Insurtech space: Group Health Insurance (GHI) endorsements.

An employee asks an AI bot: "I just got married. Please add my spouse to my company insurance policy."

Behind the scenes, the agent must:

  1. Update HRMS (e.g., Darwinbox/Workday) to record the dependent.
  2. Calculate Prorated Premium and setup a payroll deduction for the next cycle.
  3. Notify the Insurer (e.g., ICICI Lombard/Star) to endorse the policy.
  4. Notify the TPA (e.g., MediAssist/Vidal) to issue the e-card.

If the TPA API is down on Step 4, you can't just "error out."

You have a partial state failure. The system is now inconsistent: The employee pays for insurance they can't use.

In traditional microservices, we solved this with the Saga Pattern. In the AI world, we're mostly ignoring it.

Let's fix that.


🔗The "Happy Path" Trap

Here is the code most teams write for an "Endorsement Bot":

async def process_endorsement(request):
    # 1. Ask LLM to extract spouse details
    details = await llm.extract_dependent_data(request)
    
    # 2. Execute tools blindly
    hrms_id = await hrms_tool.add_dependent(details)
    deduction_id = await payroll_tool.schedule_deduction(amount=2500, month="Feb")
    policy_no = await insurer_tool.add_member(details, coverage="5L")
    card_id = await tpa_tool.generate_ecard(policy_no, details)
    
    return f"Done! Here is your card: {card_id}"

It looks clean. It demos beautifully.

But what happens if the TPA API (tpa_tool) times out?

This isn't a bug; it's a lack of distributed consistency.

In a monolithic database, you’d wrap this in a transaction (BEGIN... ROLLBACK). But you can't wrap Darwinbox, the Insurer API, and the TPA API in a database transaction.

You need a way to roll back the real world.


🔗Enter the Saga Pattern

Use cases like this are why the Saga Pattern exists.

The Core Idea: A long-running transaction is broken into a sequence of local transactions. If one fails, you execute a series of compensating transactions to undo the changes made by the preceding steps.

For every Action, you must define a Compensation.

StepActionCompensation (Undo)
1hrms.add_dependent()hrms.remove_dependent()
2payroll.schedule_deduction()payroll.cancel_deduction()
3insurer.endorse_policy()insurer.cancel_endorsement()
4tpa.issue_card()N/A (This is the failure point)

If Step 4 crashes, the system must reliably execute cancel_endorsement(), then cancel_deduction(), and finally remove_dependent().


🔗Implementing Sagas for Agents

You don't need a heavy framework to start. You need a data structure that remembers what it did.

Here is a mature pattern for an "Agentic Saga" in Python.

🔗1. The Saga Context

First, we need a way to track our "stack" of completed actions.

class SagaContext:
    def __init__(self):
        self.compensations = [] # Stack of undo functions

    def add_compensation(self, func, **kwargs):
        """Register a function to call if we need to rollback."""
        self.compensations.append((func, kwargs))

    async def rollback(self):
        """Execute compensations in LIFO order (Last-In, First-Out)."""
        print("🚨 Initiating Rollback...")
        for func, kwargs in reversed(self.compensations):
            try:
                await func(**kwargs)
            except Exception as e:
                # CRITICAL: structured logging here. 
                # If a compensation fails, you have a 'Zombie Transaction'
                # and a human MUST intervene.
                logger.critical(f"Compensation failed: {e}", extra={"context": kwargs})

🔗2. The Step Execution

Now, we rewrite our Endorsement Bot to be Saga-aware.

async def process_endorsement_safely(details):
    saga = SagaContext()
    
    try:
        # Step 1: Update HRMS (Darwinbox/Workday)
        emp_record = await hrms_tool.add_dependent(details)
        # REGISTER COMPENSATION IMMEDIATELY
        saga.add_compensation(hrms_tool.remove_dependent, 
                            emp_id=details.emp_id, 
                            dep_id=emp_record.id)
        
        # Step 2: Payroll Deduction
        deduction = await payroll_tool.schedule_deduction(amount=2500)
        saga.add_compensation(payroll_tool.cancel_deduction, 
                            txn_id=deduction.id)
        
        # Step 3: Insurer Endorsement
        policy_update = await insurer_tool.add_member(details)
        saga.add_compensation(insurer_tool.cancel_endorsement, 
                            endorsement_id=policy_update.id)

        # Step 4: TPA e-Card (Let's say this Fails)
        card = await tpa_tool.generate_ecard(policy_update.policy_no)
        
        return "Endorsement Complete"

    except Exception as e:
        print(f"Endorsement failed at step: {e}")
        # TRIGGER THE UNDO STACK
        await saga.rollback()
        raise e

🔗What actually happens now?

  1. HRMS adds the spouse. Undo added to stack.
  2. Payroll schedules the ₹2,500 deduction. Undo added to stack.
  3. Insurer API runs successfully. Undo added to stack.
  4. TPA API FAILS (Timeout/Down).
  5. Exception is caught.
  6. saga.rollback() runs:
    • Pops cancel_endorsement. Executes it. Insurer record removed.
    • Pops cancel_deduction. Executes it. Salary is safe.
    • Pops remove_dependent. Executes it. HRMS cleaned up.
  7. System returns to a clean state. The employee can try again later without financial loss.

🔗The "Agent" Twist: Hallucinated Parameters

In traditional software, we code the inputs rigidly. In AI, the model might hallucinate the parameters for the compensation.

Do not let the LLM decide how to compensate.

If the LLM says "Refund the user," it might hallucinate the amount or the transaction ID.

Architectural Rule: The Action can be probabilistic (AI deciding what to do), but the Compensation must be deterministic (Code knowing how to undo it).

When your tool returns success, it should return the exact artifact (ID, receipt, resource URI) needed to undo it. Store that in your code's memory (the Saga Stack), not in the LLM's context window.


🔗Real-World Nuance: Distributed Logic

The simple try/catch block above works for a single script. But production agents run on queues, serverless functions, or across days.

If your process crashes (OOM, server restart) during the insurer API call, the in-memory compensations list is lost.

For true production resilience, you need Durable Execution.

Tools like Temporal, LittleHorse, or Orkes Conductor are built exactly for this. They persist the "Instruction Pointer" and the variable state to a database after every step.

If your node dies effectively at line 20, the system restarts another node, restores variables, and resumes at line 20.

If you are building an "Enterprise Agent Platform," you shouldn't be writing while loops in Python. You should be defining workflows in a durable engine.

But if you are just starting? The stack-based Saga pattern above is 100x better than the "fire and forget" code currently powering most demos.


Models are probabilistic. Systems must be deterministic.

When you move from "Chat with PDF" to "Agent that does things," you leave the read-only world and enter the read-write world. In the read-write world, you are responsible for the mess you make.

Implement Sagas. Give your agents an Undo button. Your support team (and your finance team) will thank you.


🔗More in this Series

Tags: #engineering #distributed-systems #ai #agentic #system-design #architecture #patterns