Your Workflow Engine Is Durable. Your Agent’s Decisions Aren’t.

11 min read Original article ↗

Viren Baraiya

We built Conductor at Netflix because microservice workflows would not stay still.

Teams shipped daily. Workflow shapes changed constantly. A runtime that required the entire graph to be frozen before execution could not keep up. So Conductor took a different path: workflow structure could be influenced by data produced during execution. The engine could materialize new work, persist it as state, schedule it, retry it, and record it in the same execution history as everything authored up front.

Today, we call this property dynamic synthesis.

At the time, it was just a practical answer to a microservices problem. Now it looks like the missing primitive for agents.

Nearly ten years later, at Orkes — the company commercializing Conductor — we hear the same question from almost every customer running a serious AI initiative:

Can we run our agents on Conductor?

Yes.

But the more important question is why so many agent systems start to break when they meet the durability model of traditional workflow-as-code platforms. After watching teams try to run serious agents in production, we have reached one conclusion:

Agents do not just call tools. Agents synthesize control flow.

The model decides what to do next, which tools to call, which branches to create, which work to fan out, when to ask a human, what to retry, and what compensation should exist if an action has to be unwound later.

If that control flow is forced into pre-authored workflow code, the agent becomes narrower. If it is generated outside the workflow engine, the audit trail splits. If it is left in application logs, the system is not truly durable.

The runtime under an agent has to record the agent’s decisions as state — not merely record that a model call happened.

That is the gap.

What we keep seeing

A team picks a durable execution platform — usually a workflow-as-code one, because that is what their platform team operates. They build the agent. It works in development. They demo it. Leadership is happy.

Then the agent meets a security review, a compliance review, an audit, or a real production incident. The audit story is the one that breaks first.

When someone asks “what did the agent do, and why?”, the team reconstructs the answer from application logs, model gateway traces, tool logs, and workflow event history. Four systems. Four clocks. Four versions of the truth.

The workflow engine can tell you that a model call happened and what it returned. It cannot tell you what the agent actually planned — the structure it synthesized, the branches it introduced, the compensations it paired with actions, the validation gates it passed through. That work lives outside the durable state machine.

That is not an audit trail. That is forensics.

The teams that hit this wall do one of two things. They build a planning service next to the workflow engine — which doubles the surface area without making the agent’s decisions native to either runtime. Or they make the agent narrower, wrapping the model in an activity and branching on its return value from a pre-coded set of options. The second approach is fine when the branches are known ahead of time. It is not the same as letting the agent compose the execution graph.

The team ships something. It is not the agent they set out to build.

The replay contract is not a detail

Workflow-as-code platforms are built on a powerful contract: the workflow function is deterministic, and the runtime achieves durability by replaying that function over a persisted event history.

Side effects — clock reads, network calls, model calls, database writes — are pushed into activities. The first execution records the result. Later replays reuse the recorded result.

This model has real strengths. It gives durable recovery from failures, a strong programming model, type-checked workflows, IDE ergonomics, and a clean way to reason about long-running code. For workflows whose shape is known ahead of time, it is excellent.

But the replay contract is not an implementation detail. It is the foundation of the durability model. And that foundation creates a hard boundary for agents.

To be clear: replay-based systems can call models from activities. They can persist the returned value. They can branch on that value. They can build useful agentic systems when the possible branches are known ahead of time.

The limitation appears when the model is not merely selecting among pre-authored branches, but synthesizing new workflow structure during execution.

New tasks. New dependencies. New fan-outs. New human gates. New compensations. New follow-up steps based on context the developer did not know when the workflow was authored.

That synthesized structure is the agent’s real control flow.

If the workflow function must remain deterministic, that control flow cannot simply emerge inside the function as arbitrary new structure. It has to be fenced into pre-authored branches, interpreted by application code, or generated into a separate workflow outside the runtime.

That is where durability stops.

The platform may durably record that a model was called. It may durably record the model’s returned value. But unless the runtime can accept the agent’s newly synthesized execution graph as a first-class state transition, the agent’s decisions are not native to the durable record.

The system is durable around the agent. Not through it.

Open-world execution

The runtime under an agent has to do one specific thing:

Accept new workflow structure as a legitimate state transition during execution.

Borrowing from the open-world / closed-world distinction in logic and databases, I call this property open-world execution.

Open-world execution means the runtime does not require the full workflow graph to be known before execution begins. New tasks, dependencies, branches, fan-outs, human approvals, and compensations can be introduced during execution, validated by policy, persisted as state, and scheduled by the same engine that runs pre-authored workflows.

The runtime’s answer to “what can happen next?” is not limited to what was committed in advance. It is defined by what has been safely recorded.

In a closed-world runtime, the set of possible futures is bounded by the workflow code deployed before the execution started.

In an open-world runtime, the execution record is allowed to grow. A worker — ordinary code or an LLM-driven worker — can return a payload that says:

“Here are the next tasks under this workflow.”

The runtime validates that payload, persists the tasks into the same execution record, schedules them, records the results, and advances the workflow.

There is no second code path for dynamically synthesized work. Once the new structure is recorded, it is just another durable state transition.

Replay does not need to re-derive the agent’s decision from a model. It does not need to ask the model to be deterministic. It does not need to reconstruct intent from logs.

The decision has already become state.

Why Conductor fits

Conductor was not created for agents. It was created because microservices needed a workflow runtime where execution could evolve without freezing the world every time a team shipped.

That led to primitives like FORK_JOIN_DYNAMIC, SUB_WORKFLOW, HUMAN, EVENT, WAIT, DECISION, and INLINE tasks.

The important part is not any one task type. The important part is the state model: Conductor can accept runtime-generated structure, persist it under the execution, schedule it through the same coordinator, and record it in the same ledger as the rest of the workflow.

Human-authored workflows and runtime-authored agent plans feed the same underlying primitive — a workflow definition with a tasks array, where each task carries a name, type, reference, and inputs, and ordering follows from the array plus task types like FORK_JOIN_DYNAMIC or SUB_WORKFLOW for runtime fan-out:

{
"name": "agent_investigation",
"version": 1,
"tasks": [
{
"name": "fan_out_lookups",
"taskReferenceName": "fan_out_lookups_ref",
"type": "FORK_JOIN_DYNAMIC",
"inputParameters": {
"dynamicTasks": "${agent_proposal.output.tasks}",
"dynamicTasksInput": "${agent_proposal.output.inputs}"
}
}
]
}

Whether that structure was authored by a developer at deploy time or synthesized by an agent at runtime, the engine treats it as workflow state.

That is what makes Conductor interesting for agents. Not because it has an AI feature bolted onto the side. Because the core execution model already fits the agent workload.

Agentspan: reasoning plane and execution plane

Agentspan is our attempt to package this primitive for agents. It separates the system into two planes.

Reasoning plane: the model. The model receives a context projection: session state, available tools, prior results, conversation history, policy hints, and relevant memory. It proposes intent.

It does not execute tools directly. It does not mutate production systems directly. It does not become the runtime.

It proposes what should happen next.

Execution plane: Conductor. Conductor validates the proposal against the agent’s tool catalog and policy. It materializes the proposal as workflow tasks under the agent’s parent execution. It schedules the work, retries failures, records results, and enforces the same RBAC, traceability, and recovery semantics used for ordinary workflow tasks.

In that split, the model proposes structure, Conductor executes it under policy, and the durable ledger preserves the proposal, materialization, and results together.

That is the boundary we want in production AI systems. The LLM should be allowed to propose structure. It should not be allowed to become the invisible place where production control flow lives.

On-call: an agent investigating an incident

Suppose an on-call agent is investigating latency on the checkout service.

The agent has tools for metrics, deploy history, alert correlation, incident updates, rollback, roll-forward, and paging a human.

The model sees the page, reads the current context, and proposes a parallel investigation:

{
"intent": "FORK_JOIN_DYNAMIC",
"tasks": [
{
"type": "metrics_lookup",
"input": { "service": "checkout", "window": "30m" }
},
{
"type": "deploy_history",
"input": { "service": "checkout", "window": "2h" }
},
{
"type": "alert_correlate",
"input": { "service": "checkout", "window": "30m" }
}
]
}

Conductor validates the tools, checks the policy, materializes the fan-out as tasks, schedules them, records each result, and resumes the agent with a new context projection.

The model did not execute anything; the runtime did. The execution ledger now contains the proposal, the materialized tasks, the tool results, and the next decision point in one durable record.

That is the difference between tool calling and durable agent execution.

Tool calling says: the model asked for tools.

Durable agent execution says: the model proposed a state transition, the runtime validated it, the engine materialized it, and the ledger recorded it.

The agent’s plan is not living in application logs. It is part of the durable execution record, alongside the workflow steps that run beside it. Whatever the agent decides to do next — fan out further, escalate to a human, retry under a different policy, hand off to another agent — flows through the same primitive and lands in the same ledger.

That is what production requires.

Pick the runtime whose durability reaches the work

This is not an argument for rip-and-replace. If you operate a workflow-as-code platform today for code-defined workflows, keep it. Those platforms are good at what they were built for.

When the workflow is a program and the program is known ahead of time, replay-based durability is a strong model. It gives developers a clean abstraction for long-running code.

But agents are not ordinary long-running code. An agent’s control flow is produced at runtime. The model is not merely computing a value inside a fixed graph. It is proposing the graph.

That changes the durability requirement.

The question is no longer:

Can my workflow engine call an LLM?

Every serious platform can call an LLM. The real question is:

Can my workflow engine durably record the control flow the LLM synthesizes?

If the answer is no, the agent’s decisions live somewhere else: in logs, in a planning service, in generated workflow definitions, or inside application code that interprets model output. That may be acceptable for simple systems. It is not enough for agents that need governance, auditability, replay, compensation, human review, and production recovery.

The question is not whether your workflow engine is durable. The question is where that durability stops.

For code-defined workflows, closed-world workflow platforms can be excellent. The workflow is the program, and the program is known ahead of time.

For agents, the workflow is not fully known until the model begins reasoning. The agent’s decisions are the control flow.

If those decisions live outside the durable state machine, your agent is not durable. It is surrounded by durable infrastructure.

That distinction will matter more as agents move from demos to production systems that touch money, infrastructure, customer data, and security policy.

Conductor had open-world execution because microservices needed it first. Agentspan builds on that primitive for agents.

The model reasons. The runtime governs. The ledger records. The system recovers.

Pick the runtime whose durability extends to where the work actually happens.