Balaji Srinivasan

Why Your AI Agents Need Circuit Breakers Before They Burn You

5 minutes (1351 words)

When you hear “circuit breaker,” you probably think of electrical panels. In software, the idea is the same: stop sending traffic to something that’s already failing, before it takes everything else down.

It’s a battle-tested pattern in traditional systems like e-commerce and payments. Now, in the AI era, it’s quickly becoming the difference between an AI demo and an AI system that actually survives in production.


🔗Circuit Breakers in Traditional Systems

Let’s start with the basics. A circuit breaker in software sits between your service and a dependency (like a payment gateway). It tracks failures and decides whether to allow new requests, block them immediately, or probe occasionally to see if the dependency has recovered.

🔗Classic E-Commerce Example

Imagine a checkout flow:

  1. Customer fills a cart.
  2. Checkout service calls Payment Gateway API.
  3. If payment succeeds, create order + ship item.

Now imagine the payment gateway goes down for 2 minutes during peak sales:

Key point: Circuit breakers contain the blast radius. They don’t fix the outage — they keep your system from collapsing with it.


🔗Why This Matters More in AI Systems

In AI workflows, circuit breakers aren’t optional. They’re survival gear.

Here’s why:


🔗AI Example: Multi-Agent Workflow Without Circuit Breakers

Picture a customer support AI built with multiple agents:

  1. Classifier Agent — decides if the issue is billing, tech support, or legal.
  2. Retriever Agent — queries a vector DB for relevant documents.
  3. Action Agent — updates account or creates a support ticket.

Now imagine the retriever (vector DB) is flaky:


🔗The Circuit Breaker Pattern Refresher

Like an electrical fuse, it has three states:

  1. Closed (normal) — Requests flow through. Failures are counted.
  2. Open (tripped) — After threshold exceeded, requests are blocked immediately.
  3. Half-Open (test mode) — After cooldown, a few requests are allowed through to check recovery.

🔗Pseudocode Example

import java.time.Duration;
import java.time.Instant;
import java.util.function.Supplier;

public class CircuitBreaker<T> {

    private enum State { CLOSED, OPEN, HALF_OPEN }

    private final int failureThreshold;
    private final Duration recoveryTimeout;

    private int failureCount = 0;
    private State state = State.CLOSED;
    private Instant lastFailureTime;

    public CircuitBreaker(int failureThreshold, Duration recoveryTimeout) {
        this.failureThreshold = failureThreshold;
        this.recoveryTimeout = recoveryTimeout;
    }

    public synchronized T call(Supplier<T> supplier) {
        if (state == State.OPEN) {
            if (Duration.between(lastFailureTime, Instant.now()).compareTo(recoveryTimeout) < 0) {
                throw new RuntimeException("Circuit is OPEN: failing fast");
            } else {
                state = State.HALF_OPEN;
            }
        }

        try {
            T result = supplier.get();
            reset();
            return result;
        } catch (Exception ex) {
            recordFailure();
            throw new RuntimeException("CircuitBreaker call failed", ex);
        }
    }

    private void reset() {
        failureCount = 0;
        state = State.CLOSED;
    }

    private void recordFailure() {
        failureCount++;
        lastFailureTime = Instant.now();
        if (failureCount >= failureThreshold) {
            state = State.OPEN;
        }
    }

    public State getState() {
        return state;
    }
}

Usage

public class Example {
    public static void main(String[] args) {
        CircuitBreaker<String> breaker =
                new CircuitBreaker<>(3, Duration.ofSeconds(10));

        for (int i = 0; i < 5; i++) {
            try {
                String result = breaker.call(() -> unreliableService());
                System.out.println("Got: " + result);
            } catch (Exception e) {
                System.out.println("Call failed: " + e.getMessage() +
                                   " | State: " + breaker.getState());
            }
        }
    }

    private static String unreliableService() {
        if (Math.random() < 0.7) { // 70% failure
            throw new RuntimeException("Service error");
        }
        return "Success";
    }
}

🔗Design Tips for AI Systems


🔗Business Translation

Circuit breakers aren’t just technical hygiene. They directly affect:

Without circuit breakers, your AI system isn’t resilient. It’s brittle. And brittle systems don’t survive in production.


🔗More in this Series

Tags: #engineering #distributed-systems #ai #agentic #system-design #architecture