When you hear “circuit breaker,” you probably think of electrical panels. In software, the idea is the same: stop sending traffic to something that’s already failing, before it takes everything else down.
It’s a battle-tested pattern in traditional systems like e-commerce and payments. Now, in the AI era, it’s quickly becoming the difference between an AI demo and an AI system that actually survives in production.
🔗Circuit Breakers in Traditional Systems
Let’s start with the basics. A circuit breaker in software sits between your service and a dependency (like a payment gateway). It tracks failures and decides whether to allow new requests, block them immediately, or probe occasionally to see if the dependency has recovered.
🔗Classic E-Commerce Example
Imagine a checkout flow:
- Customer fills a cart.
- Checkout service calls Payment Gateway API.
- If payment succeeds, create order + ship item.
Now imagine the payment gateway goes down for 2 minutes during peak sales:
-
Without a circuit breaker
- Every checkout request still tries to hit the gateway.
- Retries pile up.
- Threads and DB connections clog waiting for timeouts.
- Soon, your whole checkout service dies even though only the payment gateway was failing.
-
With a circuit breaker
- After a few failures, the breaker opens.
- New requests fail fast with a controlled error (“Payment temporarily unavailable”).
- System resources stay free, order DB still works, carts are preserved.
- After a cooldown, the breaker lets a few test calls through. If they succeed, traffic resumes.
Key point: Circuit breakers contain the blast radius. They don’t fix the outage — they keep your system from collapsing with it.
🔗Why This Matters More in AI Systems
In AI workflows, circuit breakers aren’t optional. They’re survival gear.
Here’s why:
- Multi-step orchestration — AI agents call multiple tools in sequence. One flaky tool derails the entire chain.
- Expensive retries — Each retry isn’t cheap; it burns API quota, GPU cycles, or third-party credits.
- Amplified failures — Multiple agents can hammer the same broken dependency at once.
- Cascading effects — One failing tool can stall queues, cause duplicate actions, and confuse downstream agents.
🔗AI Example: Multi-Agent Workflow Without Circuit Breakers
Picture a customer support AI built with multiple agents:
- Classifier Agent — decides if the issue is billing, tech support, or legal.
- Retriever Agent — queries a vector DB for relevant documents.
- Action Agent — updates account or creates a support ticket.
Now imagine the retriever (vector DB) is flaky:
-
Without a breaker
- Classifier calls retriever. It times out.
- Orchestrator retries… 3 times per request.
- With 200 concurrent tickets, that’s 600 wasted calls.
- Action Agent never runs; customer gets silence.
- Costs spike, and system grinds to a halt.
-
With a breaker
- After N failures, retriever calls fail fast.
- Orchestrator degrades gracefully:
- Skip retrieval and fall back to canned responses.
- Or queue the ticket for human handoff.
- Action Agent still runs when possible (e.g., creates a support ticket).
- Customers get something instead of silence.
🔗The Circuit Breaker Pattern Refresher
Like an electrical fuse, it has three states:
- Closed (normal) — Requests flow through. Failures are counted.
- Open (tripped) — After threshold exceeded, requests are blocked immediately.
- Half-Open (test mode) — After cooldown, a few requests are allowed through to check recovery.
🔗Pseudocode Example
import java.time.Duration;
import java.time.Instant;
import java.util.function.Supplier;
public class CircuitBreaker<T> {
private enum State { CLOSED, OPEN, HALF_OPEN }
private final int failureThreshold;
private final Duration recoveryTimeout;
private int failureCount = 0;
private State state = State.CLOSED;
private Instant lastFailureTime;
public CircuitBreaker(int failureThreshold, Duration recoveryTimeout) {
this.failureThreshold = failureThreshold;
this.recoveryTimeout = recoveryTimeout;
}
public synchronized T call(Supplier<T> supplier) {
if (state == State.OPEN) {
if (Duration.between(lastFailureTime, Instant.now()).compareTo(recoveryTimeout) < 0) {
throw new RuntimeException("Circuit is OPEN: failing fast");
} else {
state = State.HALF_OPEN;
}
}
try {
T result = supplier.get();
reset();
return result;
} catch (Exception ex) {
recordFailure();
throw new RuntimeException("CircuitBreaker call failed", ex);
}
}
private void reset() {
failureCount = 0;
state = State.CLOSED;
}
private void recordFailure() {
failureCount++;
lastFailureTime = Instant.now();
if (failureCount >= failureThreshold) {
state = State.OPEN;
}
}
public State getState() {
return state;
}
}
Usage
public class Example {
public static void main(String[] args) {
CircuitBreaker<String> breaker =
new CircuitBreaker<>(3, Duration.ofSeconds(10));
for (int i = 0; i < 5; i++) {
try {
String result = breaker.call(() -> unreliableService());
System.out.println("Got: " + result);
} catch (Exception e) {
System.out.println("Call failed: " + e.getMessage() +
" | State: " + breaker.getState());
}
}
}
private static String unreliableService() {
if (Math.random() < 0.7) { // 70% failure
throw new RuntimeException("Service error");
}
return "Success";
}
}
🔗Design Tips for AI Systems
- Wrap every external call — vector DB, payment processor, LLM API, knowledge service.
- Set realistic thresholds — AI services naturally timeout more often; tune failure counts accordingly.
- Provide graceful fallbacks — don’t just block; degrade intelligently.
- Instrument for observability — track when breakers trip, why, and recovery success rates.
🔗Business Translation
Circuit breakers aren’t just technical hygiene. They directly affect:
- User trust — “System temporarily unavailable” beats infinite loading spinners.
- Cost — Fail fast prevents runaway retries against paid APIs.
- Stability — Prevents one dependency from taking your entire AI stack down.
Without circuit breakers, your AI system isn’t resilient. It’s brittle. And brittle systems don’t survive in production.