We are witnessing a strange regression in software architecture.
After a decade of decomposing monoliths into stateless microservices to survive the chaos of cloud infrastructure, we learned that servers are cattle, not pets. We accepted that any process can be killed at any time by the Kubernetes scheduler.
Yet, when building "Autonomous Agents," senior engineers seem to be reverting to patterns we abandoned fifteen years ago. We are building stateful, long-running Python scripts that assume 100% uptime. We write time.sleep(3600) inside a container that gets patched and rotated every week, holding critical business state—like an employee onboarding progress—in RAM.
If your agent represents a business process that lasts longer than a few minutes, it is no longer a script. It is a workflow, and it demands Durable Execution.
🔗The "DIY State Machine" Trap
Consider the standard enterprise use case of Employee Onboarding. The requirement is deceptively simple: onboard a new hire named Balaji, which involves provisioning an email in two seconds, ordering a laptop in five minutes, and waiting for security clearance that might take three to five days.
The "3-5 days" constraint is where the naive while loop breaks.
When engineers realize the while loop won't survive a restart, their instinct is often to build their own state machine backed by a database. They add a status column to the employees table (PENDING_LAPTOP, WAITING_CLEARANCE) and write a cron job that polls every minute:
# The "DIY" approach (Don't do this)
def cron_job():
users = db.query("SELECT * FROM employees WHERE status = 'WAITING_CLEARANCE'")
for user in users:
if check_clearance(user.id):
user.status = 'COMPLETED'
db.save(user)
send_email(user)
This looks robust, but it is an architectural dead end.
You have just coupled your business logic to your database schema. Every new step in the agent's reasoning requires a new column or status enum. You now have to handle:
- Race Conditions: What if two cron jobs pick up the same user?
- Dual Writes: What if you update the DB but the email fails to send?
- Complex Retries: How do you implement exponential backoff for the API call within a cron job?
You are accidentally trying to implement CQRS (Command Query Responsibility Segregation) and Event Sourcing from scratch, but poorly. You are building a bad workflow engine.
Do not build a bad workflow engine. Use a battle-tested one.
🔗The Solution: Event Sourcing & Durable Execution
Durable Execution engines (like Temporal, LittleHorse, or Orkes Conductor) solve this by applying Event Sourcing for you, invisible to your code.
They don't store "current status." They store a History of Events.
When your agent script runs, the engine intercepts your API calls. It doesn't just execute them; it writes an event to an append-only log: ActivityTaskScheduled, ActivityTaskStarted, ActivityTaskCompleted.
This is why durability matters. It allows the system to decouple the Application State (variables in memory) from the System State (events in the DB).
The architecturally sound version of our onboarding agent looks different. It defines a workflow where every activity is idempotent and retriable:
# The "Architecturally Sound" Version (Pseudocode)
@workflow.defn
class OnboardingWorkflow:
@workflow.run
async def run(self, employee_data):
# Activity 1: Idempotent & Retriable
await workflow.execute_activity(
create_email,
employee_data,
retry_policy=RetryPolicy(max_attempts=3)
)
# Activity 2
await workflow.execute_activity(order_laptop, employee_data)
# The Signal Pattern (Hybrid Event-Driven Architecture)
# The workflow hibernates here. It consumes ZERO CPU/RAM.
# It exists only as a row in the database.
await workflow.wait_condition(
lambda: self.is_cleared == True
)
# Resumes seamlessly days later
await workflow.execute_activity(send_welcome_email, employee_data)
@workflow.signal
def update_clearance(self):
self.is_cleared = True
Crucially, when the workflow reaches the wait_condition, it doesn't sleep. It hibernates. The system serializes the stack frame and kills the worker thread.
When the signal arrives four days later—or after a cluster restart—the worker wakes up. It doesn't load a "status" from a column. It replays the event history.
- Did we create email? Yes, history says Event 5. (Skip execution).
- Did we order laptop? Yes, history says Event 8. (Skip execution).
- Are we waiting? Yes.
It reconstructs the entire memory state exactly as it was four days ago, purely from the event log. This is Event Sourcing applied to compute. It gives you the resilience of an event-driven architecture with the simplicity of imperative code.
🔗The Strategic Shift for Agentic Systems
We need to stop thinking of Agents as "Chats that call tools" and start treating them as Process Owners.
A Support Agent isn't just generating text. It is orchestrating a distributed transaction that involves verifying a user, checking refund eligibility, waiting for a supervisor override, processing a refund, and updating a ticket. This is a Saga. This is a Workflow.
If you build this with ephemeral scripts, you are restricting your agents to tasks that complete within a single HTTP request timeout. You are effectively crippling them. If you build this with Durable Execution, you give your agents an infinite attention span. They become reliable digital workers that can handle processes spanning minutes, days, or months, without operational fragility.
Reliability is not a feature; it is an architectural requirement. Stop writing while loops. Stop relying on memory for business state. If the process matters, put it in a durable workflow.