Balaji Srinivasan

Bulkheads: Don't Let One Agent Sink the Ship

5 minutes (1338 words)

We build AI platforms as if all agents are created equal. We deploy a single FastAPI service, throw it behind a load balancer, and let it handle everything from high-speed query resolution to massive document ingestion.

This works until it doesn't.

One Tuesday morning, a few users perform a heavy action. Suddenly, your simple, fast agents stop responding. Your health checks fail. The entire platform creates a death spiral.

You have just been hit by the Noisy Neighbor problem.

In traditional distributed systems, we solved this with the Bulkhead Pattern (named after the watertight compartments in a ship).

Bulkhead

The maritime analogy is worth understanding in depth. Before the widespread adoption of bulkheads, a single hull breach could flood the entire lower deck of a ship, sinking it within minutes. Naval architects solved this by dividing the hull into isolated, watertight compartments separated by reinforced steel walls. If a torpedo strikes Section 3, water floods Section 3 and only Section 3. The watertight doors slam shut. The engine room (Section 7) stays dry. The ship limps home instead of sinking to the ocean floor.

The Titanic famously had bulkheads, but they didn't extend high enough. The water spilled over the top of one compartment into the next, a tragic lesson in incomplete isolation. In software, we repeat this mistake constantly: we isolate thread pools but share a single database connection pool. We isolate containers but run them on the same node with shared memory limits.

In Agentic Systems, we need to be even more aggressive. We need to isolate not just by service, but by Agent Personality.


๐Ÿ”—The Lesson from Web 2.0: The "Export to CSV" Button

To understand the fix, we must look at the history of web engineering.

Every senior engineer remembers the day they took down production with a simple report. In 2012, you built a dashboard. It was fast. Then, a user requested an "Export All Data to CSV" feature.

A developer implemented it naively: query the database, loop through 100,000 rows, build a string in memory, and stream it back.

The next day, an Operations Manager clicked that button. The server's CPU spiked to 100%. The Garbage Collector went into a frenzy trying to hold the massive CSV in RAM. And suddenly, the Login Page stopped working.

A background reporting task had starved the critical transactional flow.

We fixed this by architectural separation.

  1. Phase 1 (Queues): We moved "Reporting" to a background queue (Celery/Sidekiq). The user clicked "Export", a job was pushed to Redis, and a separate worker picked it up.
  2. Phase 2 (Evented / Kafka): As we scaled, we decoupled it further. The "Export" button simply emitted a ReportRequested event to a Kafka Topic. The Reporting Service (a completely separate microservice with its own database replica) consumed that event.

If the Reporting Service crashed or lagged by 5 hours, the Main App didn't care. It just kept emitting events. The decoupling was absolute.

Agents are the new "Export to CSV" button. Except now, instead of generating a CSV, they are loading 5GB models into VRAM and holding database connections open for 3 minutes while "reasoning."


๐Ÿ”—The Specific Failure Mode: "Status Bot" vs "Ingest Bot"

To understand why this happens, let's stick to our Insurance domain from the previous posts. Imagine you have two agents sharing the same platform:

  1. PolicyBot (The Fast Lane)

    • Job: Answer quick queries like "Is cataract surgery covered?" or "What is the status of Claim #9001?"
    • Resource Cost: 1 DB Query + 1 LLM Call.
    • Expected Latency: 500ms - 2 seconds.
    • Traffic Priority: CRITICAL (This is user-facing chat).
  2. IngestBot (The Heavy Lane)

    • Job: Read a 50-page "Hospital Discharge Summary" PDF, extract line items, and map them to tariff codes.
    • Resource Cost: Heavy OCR + Long-Context LLM Analysis + 50+ DB Inserts.
    • Expected Latency: 45 seconds - 3 minutes.
    • Traffic Priority: LOW (This is a background task).

๐Ÿ”—The "Shared Pool" Trap

The naive architecture dumps both agents into the same application container with a shared thread pool.

# The "Naive" Implementation
# A single thread pool shares resources for EVERYONE
from concurrent.futures import ThreadPoolExecutor

# We have 50 works for the WHOLE container
executor = ThreadPoolExecutor(max_workers=50)

@app.post("/chat/policy-status")
async def policy_chat(request):
    # Low latency expectation
    return await loop.run_in_executor(executor, run_fast_agent, request)

@app.post("/claims/ingest-document")
async def ingest_document(request):
    # High computation cost
    return await loop.run_in_executor(executor, run_heavy_ocr_agent, request)

Here is exactly what happens on Monday morning:

  1. At 9:00 AM, the Operations team dumps a batch of 60 Discharge Summaries for processing.
  2. The /claims/ingest-document endpoint is hit 60 times.
  3. The executor (size 50) immediately fills up with 50 Heavy tasks.
  4. The remaining 10 Heavy tasks go into the queue.
  5. At 9:01 AM, the CEO opens the app and asks: "Is my policy active?"
  6. The request hits /chat/policy-status.
  7. The application tries to schedule run_fast_agent on the executor.
  8. There are no threads left. The 50 Heavy agents are grinding through OCR. They won't finish for 2 minutes.
  9. The CEO's request sits in the pending queue behind 10 other Heavy tasks.
  10. The request times out. The CEO is unhappy.

You have allowed a background batch process to starve your real-time user traffic.


๐Ÿ”—The Solution: Application-Level Bulkheads

The first line of defense is Resource Isolation. You must explicitly partition your execution resources based on the workload profile.

If you have 50 worker threads, you don't give them to "whoever asks first." You allocate them strictly:

If the Heavy Queue fills up, those specific requests get rejected (HTTP 429) or Queued. The Fast Lane remains wide open.

# The "Bulkheaded" Implementation

# 1. Create ISOLATED pools
fast_executor = ThreadPoolExecutor(max_workers=40, thread_name_prefix="fast_lane")
heavy_executor = ThreadPoolExecutor(max_workers=10, thread_name_prefix="heavy_lane")

@app.post("/chat/policy-status")
async def policy_chat(request):
    # This pool is protected. 
    # Even if 1000 documents are being ingested, these 40 threads are empty.
    return await loop.run_in_executor(fast_executor, run_fast_agent, request)

@app.post("/claims/ingest-document")
async def ingest_document(request):
    try:
        # We explicitly CAP the heavy load.
        # If 10 documents are processing, we REJECT the 11th.
        # We protect the system stability over the batch throughput.
        return await loop.run_in_executor(heavy_executor, run_heavy_ocr_agent, request)
    except BlockingIOError:
        # Fail fast. Let the client retry later.
        raise HTTPException(status_code=429, detail="Ingestion queue full")

This is the software equivalent of a watertight door. The "Ingest" section of the ship can flood (fail/queue), but the "Chat" section stays dry.


๐Ÿ”—The Advanced Solution: Infrastructure Bulkheads

For high-scale systems, thread isolation isn't enough. A heavy Ingest process might not just consume threadsโ€”it might consume RAM (loading vector indices) or GPU VRAM (local models).

If IngestBot triggers an OOM (Out of Memory) crash, it kills the containerโ€”taking PolicyBot down with it.

At this stage, you must push the bulkhead down to the Infrastructure layer (Kubernetes).

๐Ÿ”—Node Pools & Taints

Don't let heavy agents schedule on the same nodes as light agents.

  1. System Node Pool: Run FAST pods here (PolicyBot).
  2. Worker Node Pool: Run HEAVY pods here (IngestBot). Use Taints (key=workload, value=heavy:NoSchedule) to forbid the Chat pods from ever drifting there.

๐Ÿ”—Database Connection Pools

A "Report" query might lock a table row for 30 seconds. A "Chat" query takes 5ms. Never share the same SQLAlchemy connection pool.


In Agentic Systems, Isolation is the only guarantee.

You cannot "optimize" away the risk of a 5-minute RAG task starving a 200ms chat task. You have to physically or logically separate them.

If you don't build bulkheads, your least important user (the one running a batch job) will inevitably take down your most important user (the customer trying to get help).

Prioritize by partitioning.


๐Ÿ”—More in this Series

Tags: #engineering #distributed-systems #ai #agentic #system-design #architecture #patterns