Article 記事

One runtime cannot do every job: the case for specialised agent runtimes and harnesses

      author
      Jonathan Conway
    

      timestamp
      15 June 2026
    

      classification
      realtime-data-plane / cell-runtime / agent-runtime / harness / durable-state / dbsp / raft / clustering / ai-sre / substrate / business-factory
    

There is a demo that everyone has seen by now. You type a sentence at a chat box, a general-purpose assistant agent picks up the task, and a minute later your restaurant is booked, your inbox is triaged, and a draft reply is sitting in front of you. It feels like magic, and for that class of work it genuinely is good. Call the category OpenClaw: a single, capable, general agent runtime that you point at whatever shows up in your day.

The trouble starts when a business tries to take that same runtime and ask it to do real work. Build us a production service. Run our month-end reconciliation across forty thousand transactions every night, with a named human approving the exceptions, and a record we can hand a regulator. Coordinate two hundred agents working the same case file without standing on each other. The demo that booked your table does not survive contact with any of these. Not because the model is weak, but because the runtime around the model was built for a person’s afternoon, not for a factory’s production line.

This post is about why we build specialised runtimes and harnesses instead of one general one, and about the two systems in Substrate that carry the heaviest version of that argument: the realtime data plane, the durable real-time state plane that agents coordinate through, and the cell runtime, the AI SRE that deploys and heals fleets of agent swarms on tiny isolated machines. The short version is that the hard properties (durability, real-time shared state, clustered survival, governance, and an infrastructure bill that does not embarrass you) are not features you bolt onto a generalist. They are the shape of the runtime itself, and that shape is different for different jobs.

The runtime is the part nobody sees, and the part that decides everything

When people compare agent products they compare the model and the prompt. Those are the visible parts. The runtime is the part underneath: where state lives, how an action becomes a durable fact, what happens when a machine dies mid-task, how a hundred agents see a consistent picture of the same data, who is allowed to do what, and how much all of it costs per agent-hour at scale.

A harness is the layer above the model that turns a raw language model into a worker for a specific class of task. A coding harness knows how to plan a change, edit a repository, run the tests, read the failures, and try again. An SRE harness knows how to read health telemetry, classify a fault, and pick a remediation that stays inside policy. A general assistant harness knows a little of everything and the deep parts of nothing. That is the right trade for a personal assistant. It is the wrong trade for building software you intend to ship, or for running a workflow your auditors will read.

The OpenClaw class of runtime makes a specific and reasonable set of assumptions. One user. One session. State that lives in memory and a transcript, and is gone when the session ends. No hard guarantee that an action committed. No need for a second agent to see the same state at the same instant. No machine-failure story beyond “start again.” Those assumptions are why it boots fast and feels light. They are also exactly the assumptions a production factory cannot make.

Hold three workloads next to each other and the mismatch becomes obvious.

Building scalable production software is a long-horizon job. It runs for hours, touches a repository hundreds of times, has to recover when a step fails on iteration ninety, and has to leave behind a trail of what changed and why. A runtime that forgets everything when the session ends cannot do this. It needs durable, replayable state and a harness that treats the build as a mission with a budget, not a chat.

Running a governed business workflow is a coordination and evidence job. Many agents work the same case, a human approves the parts that need a human, and the output has to carry its own proof of how it was produced. A runtime with no shared transactional state and no notion of identity or policy cannot do this. It needs a state plane all the agents read and write through, and governance built into the path rather than stapled on after.

Running either of these at the scale of a real business is an economics job. If every agent needs its own heavyweight container that takes seconds to start and hundreds of megabytes to sit idle, the bill at a thousand concurrent agents is the thing that kills the project, not the model cost. It needs a compute fabric designed for bursty, short-lived, dense workloads.

Three different shapes. One generalist runtime cannot be all three at once, because the assumptions that make it good at the assistant job are the same assumptions that make it wrong for the other two. So we built the runtimes the jobs actually need.

the realtime data plane: the durable real-time state plane agents coordinate through

The realtime data plane is the live data plane of the factory. It is the runtime and the realtime database that sits inside every agent cell, and it is the thing that lets a swarm of agents behave like one coordinated system rather than a crowd of processes shouting over a network. I am going to describe what it gives you and why it matters, and I am going to stay deliberately light on the internals, because the internals are the part we would rather competitors did not get for free.

Start with the property that matters most and is the hardest to fake: every agent reads and writes the same durable, transactional state, in real time, and that state survives the loss of the machine it was running on.

Unpack that one sentence, because each clause is a thing most runtimes do not have.

Durable and transactional. When an agent takes an action, that action becomes a committed fact in a transactional log, not a value sitting in a process that vanishes on a crash. State changes are atomic. Either the action committed or it did not, and the system knows which. There is no half-written record, no “the agent thought it saved but the process died first.” For a workflow that has to be auditable, this is the floor. You cannot prove what happened if you cannot guarantee that what happened was written down.

Real time, with no polling. Agents do not poll a database on a timer to find out what changed. They subscribe to the slices of state they care about and receive the change the instant it is committed, as an incremental update, before the next action in the loop. The freshness of an agent’s view is not a function of a polling interval or replication lag. It is current to the last write.

Shared across all agents. The coordination primitive is shared state, not message passing. When one agent updates a case, every agent subscribed to that case sees the update. There is no message queue to fall behind, no “agent A and agent B disagree about what is assigned” because they are both reading the same committed truth. This is what lets a hundred agents coordinate on one mission on a single node instead of needing an external streaming database to hold the whole thing together.

Reducers: why state changes are predictable instead of hopeful

Agents in the realtime data plane change state through reducers. A reducer is a deterministic function: it takes the current committed state and an action, and it produces the next state. Nothing else. No reaching out to a network mid-update, no hidden side effect, no “it depends what time it is.” The same state and the same action always produce the same next state.

That sounds like a small discipline. It is the thing that makes everything downstream possible.

Because reductions are deterministic, the entire history of a mission can be replayed. Feed the committed log of actions back through the reducers and you arrive at exactly the same state, every time. That replay is your audit trail, your debugger, and your post-incident reconstruction, all from the same record. An auditor does not get a summary an agent wrote about itself. They get the actual sequence of actions and the deterministic state each one produced.

Because reductions are deterministic, conflicting writes have a well-defined resolution instead of a race. Two agents acting on the same state do not produce a corrupt result that depends on who happened to land first by a microsecond. The transactional log gives the actions an order, and the reducers apply them in that order, the same way on every node. A swarm of agents writing concurrently is not a hazard to be feared. It is the normal case, handled by construction.

Because reductions are pure, the state machine is testable in isolation. You can take a reducer, hand it a state and an action, and assert the result without standing up a cluster. The business logic of the factory is unit-testable in the most boring, reliable way there is.

An agent action enters as a single delta. A reducer folds it into the committed log, and only the affected operators recompute the view that subscribers read. Toggle between incremental and naive re-evaluation and fire a few writes: the cumulative compute counters tell the whole story.

DBSP: the reason real-time does not get more expensive as you add agents

The other half of the realtime data plane story is how subscriptions stay cheap. This is where DBSP earns its place.

The naive way to keep many agents up to date on a shared query is to re-run the query whenever anything changes. An agent subscribes to “all blocked tasks in this mission,” something changes, and the agent re-runs the whole query against the whole table. Add a join and an aggregate on top and every subscriber re-evaluates the entire join and the entire aggregate on every change, whether or not the change could possibly affect their result. The cost is the size of the data multiplied by the number of subscribers. It works in a demo and falls over the moment you have real agents on real data.

DBSP turns a query into a circuit that processes only the change. A single new row propagates through the circuit as a delta of one row: the join probes only the matching rows, the aggregate adds one contribution in constant time, the filter passes or drops one result. The subscriber sees its updated answer in sub-millisecond time, not because the hardware got faster but because the computation got fundamentally smaller. This is a mathematical property, proven compositionally in the original research, not a cache or a heuristic that degrades at the edges.

The consequence for an agent factory is direct. The cost of keeping a subscriber current is the size of the change, not the size of the data and not the number of subscribers. You can put hundreds of agents on the same shared mission state, each watching its own slice, and the propagation cost does not blow up. Real-time coordination across a large swarm becomes feasible on modest hardware. The economics of “every agent sees everything, instantly” finally close.

For a longer treatment of how this kills the polling-and-notify stack, see the death of polling. For the wider point that the runtime, the database, and the subscription engine are one binary rather than three systems with three consistency boundaries, see one binary, one agent runtime.

Clustered and durable: an agent outlives the machine it ran on

Here is the property that separates a production runtime from a clever single-node database, and it is the one I am proudest of.

The realtime data plane clusters. The transactional log is replicated across nodes using raft, the same consensus protocol that sits underneath the serious distributed databases. A write is not committed until a quorum of nodes has it. That means the committed state of a mission is not sitting on one machine waiting for that machine to have a bad day. It is on every node in the cluster.

Now the part that matters for agents specifically. Because the state an agent depends on is already replicated, the agent is not tied to the machine it started on. If a node dies, a power supply fails, a host falls off the network, a kernel panics, the agent that was running there does not take the mission down with it. The surviving nodes still hold the full committed log. Raft elects a new leader from the survivors, and the agent re-spawns on a healthy node, resuming from the last committed action. Not from a checkpoint someone remembered to take an hour ago. From the last action that committed, which on a transactional log is the last thing that happened. Zero entries lost.

This is the difference between an agent platform and a toy. In the OpenClaw model, a machine failure is the end of the session. You start again, and whatever the agent had figured out in the last two hours is gone. In a factory running real missions, a machine failure is a Tuesday. Hardware fails. The question is whether your runtime treats that as a catastrophe or as a routine event the cluster absorbs without anyone being paged. the realtime data plane treats it as routine, because the durability and the clustering were designed in from the start, not retrofitted.

Interactive: three nodes share one raft-replicated transactional log. Node A is the leader and runs Agent #7. Press “kill the leader node” and watch the surviving quorum, which already holds the full committed log, elect a new leader and re-spawn the agent from the last commit with nothing lost. Reset to revive the node and run it again.

the realtime data plane against the alternatives

Plenty of good systems solve a slice of this. None of the general-purpose ones solve the whole shape, because they were built for a different shape. The table below is the honest comparison: not “we are better at everything” but “here is which properties each one actually has for an agent swarm.”

Capability	the realtime data plane	Generalist assistant runtime (OpenClaw class)	Orchestration library (in-process graph)	Durable workflow engine	Distributed actor framework	DIY database plus pub/sub
Real-time shared state across many agents	Yes, native	No, single session	Partial, in one process	No, state is per-workflow	Partial, no consistent view	Eventual, via the queue
Durable transactional commits	Yes	No	No	Yes	No, actor state is volatile by default	Depends on how you wire it
Incremental subscriptions, no polling	Yes, DBSP circuits	No	No	No	No	No, re-query on notify
Deterministic replay as the audit trail	Yes, from the log	No	No	Partial, event history	No	No
Clustered failover with agent re-spawn from last commit	Yes, raft quorum	No	No	Engine is durable, the agent is not stateful-resumable	Manual, you build it	Manual, you build it
Governance, identity and policy on the path	Yes, by construction	No	No	No	No	No
Runs in your own perimeter, air-gappable	Yes	Usually vendor-hosted	Yes	Self-host or managed	Yes	Yes
Cost stays flat as subscriber count grows	Yes, cost is the delta	n/a	No	n/a	No	No

The pattern in that table is the argument. Each alternative is strong in one or two columns and absent in the rest, because each was designed for its own job: the assistant runtime for one person’s session, the orchestration library for wiring a single process, the workflow engine for durable long-running steps, the actor framework for raw distributed compute, the DIY stack for whatever you have time to build. the realtime data plane is the one designed for the specific job of a governed agent swarm sharing live state, and that is why it is the one with marks all the way down.

The the realtime data plane feature list, in one place

For the reader who wants the capabilities enumerated rather than narrated:

One durable, transactional, real-time state plane shared by every agent in a mission.
Reducer-based state changes that are deterministic, replayable, conflict-resolved by construction, and testable in isolation.
DBSP incremental subscriptions, so a subscriber’s update cost is the size of the change, not the size of the data or the number of subscribers.
No polling and no external message queue. Agents receive committed deltas, not notifications they have to react to with a re-query.
The runtime, the database, and the subscription engine are one process co-located with the agent, so there is no network hop between an agent and the data it reads and writes.
Raft-replicated log with quorum commit, so committed state survives the loss of any single node.
Clustered agent failover. An agent re-spawns on a healthy node from the last committed action when its host dies, with zero entries lost.
Deterministic replay of the whole mission from the committed log, which doubles as the audit trail and the debugger.
Governance, cryptographic identity, and policy live on the action path, not in a separate system bolted on afterwards.
Runs inside the customer’s perimeter, on bare metal, on-premises, or air-gapped, with no dependency on an external service for any of the above.

That is a runtime built for one job and built all the way down for it. You do not get this list by adding features to a personal assistant. You get it by deciding, at the start, that the job is a governed swarm of agents sharing durable live state, and shaping every layer around that.

the cell runtime: an AI SRE that runs dense fleets of agent swarms on very little

The realtime data plane answers “where does the agents’ state live and how does it survive.” the cell runtime answers a different question: “how do we run thousands of these agents, deploy them, heal them, and not go bankrupt doing it.” It is the compute fabric and the AI site reliability engineer for the whole factory, and it is where a body of long-horizon operational thinking we built for the SRE problem now gets shared across every runtime on the platform.

Different workloads need different boxes

The first thing to get straight is that a cell runtime cell and a realtime data plane agent are not the same kind of work, and pretending they are is how you end up with the wrong infrastructure.

A realtime data plane agent workload is stateful and coordinated. It holds live subscriptions to shared mission state, it reads and writes the durable log, it coordinates with its peers, and it expects its state to outlive any single machine. Its defining trait is shared, durable, real-time state.

A cell runtime cell workload is isolated and disposable. It is the unit of execution: a sandbox where a specific piece of work runs, with a strong security boundary around it, that boots in tens of milliseconds, does its job, and is torn down clean. Think of running an untrusted build, executing a tool, processing one document, compiling and testing one code change. Its defining trait is strong isolation and a short, bursty life.

The factory needs both, and it needs them to be different. You do not want the heavyweight durability machinery wrapped around a thirty-second sandboxed build, and you do not want a disposable sandbox holding state a mission depends on. So the cell is the isolation and execution primitive, the realtime data plane is the state and coordination primitive, and a cell mounts the slice of the realtime data plane state it needs for the duration of its work. The right tool sits at each layer instead of one tool stretched thin across both.

MicroVMs: real isolation that is still cheap and fast

the cell runtime runs cells as microVMs, not as containers. This is a deliberate and load-bearing choice.

A container shares the host kernel. For multi-tenant agent work, where a cell might run untrusted code or process sensitive data, a shared kernel is a weak boundary. A microVM gives each cell its own kernel and a hardware-enforced isolation boundary, which is the boundary you want when the thing in the box is an agent executing code you did not write. The security property is not a policy you hope holds. It is enforced by virtualisation.

The usual objection to virtual machines is that they are heavy and slow to start. That objection is true of traditional VMs and false of the microVM approach the cell runtime uses. A cell boots in tens of milliseconds, and an idle cell’s memory footprint is small enough that a single host carries them in the thousands. You get the strong isolation of a VM with a startup cost and a density closer to a function call. That combination is the whole game, because it is what lets the factory boot a fresh, strongly-isolated cell for every task instead of reusing warm, weakly-isolated workers and hoping nothing leaks between tenants.

The density is also the cost story, and the cost story is what decides whether a factory is a business or a science project. A container-based stack that needs seconds to start a workload and hundreds of megabytes to keep one idle does not extrapolate to a thousand concurrent agents without an eye-watering bill. Cells that boot in tens of milliseconds and sit idle for almost nothing change the arithmetic by an order of magnitude. The factory pays for peak load at peak and drops to a floor off-peak, because booting a cell is cheap enough to do on demand.

Interactive: each square is an isolated microVM cell. Watch them boot in tens of milliseconds, then click “inject fault” to trip a fault on a live cell and watch the AI SRE operator classify it, heal it, and roll the cell back within policy. Drag the capacity slider to see how idle density drives the cost-per-agent-hour.

The AI SRE, and the long-horizon thinking it shares with the rest of the platform

The piece that makes the cell runtime more than a fast sandbox runner is the operator. It is not a rules engine with a runbook attached. It is a model running inside the cell fabric, with read access to the health telemetry of every live cell and write access bounded by a policy envelope. When a cell faults, the operator classifies the fault, a transient crash, a state inconsistency, resource exhaustion, a policy violation, selects a remediation from the permitted set, and either applies it or escalates to a human gate when the fix would exceed its policy boundary. The turnaround from fault to proposed fix is measured in seconds, not in the minutes it takes to wake an on-call engineer.

What is worth saying, and what is new, is that the AI SRE is where we worked out how to make an agent reason and act well over a long horizon under tight constraints, and that work does not stay locked inside the cell runtime. Keeping a fleet healthy is a hard long-horizon problem: you have to hold context across many events, weigh an action against a budget and a policy, avoid the failure mode where an agent thrashes between fixes, know when to act and when to escalate to a human, and leave a clean record of why you did what you did. The strategies and the machinery we built to make the SRE operator good at that, the long-horizon reasoning, the policy-bounded action, the human-gate escalation, the signed audit of every decision, are the same strategies the rest of the platform’s agents need to run long missions well. So they are shared. The SRE was the proving ground for long-horizon governed autonomy, and the rest of the runtimes inherit it.

Every action the operator takes is signed by its own cryptographic identity and lands in the same audit log as the agents’ business actions. An operator reviewing a post-incident report sees what the SRE agent saw, what it proposed, and what a human approved or overrode, in the same replay interface used to review the work the cells were doing. Operational events and business events are one record, not two systems you have to reconcile after the fact. That is governance at the infrastructure layer, and it is the same principle that runs through the realtime data plane: the governance is on the path, not stapled on afterwards.

For a deeper look at the density numbers and the compute economics, see the compute fabric that makes the business factory viable. For how the whole thing deploys inside a customer’s walls without depending on a cloud provider, see sovereign AI, air-gapped by default.

Why this is the harder thing to copy

It is easy to wrap a model in a loop and demo an agent. It is hard to build the runtime underneath that makes the agent survive a node failure, coordinate with two hundred peers on live state, prove what it did to a regulator, and cost a defensible amount per hour at scale. The first is a weekend. The second is the work.

The generalist runtime is the right answer for the personal-assistant job, and it will keep getting better at that job. It is the wrong answer for building production software and for running governed, durable, fault-tolerant workflows at a scale that does not bankrupt the business, because those jobs have a different shape, and the shape is the runtime. Substrate ships the runtimes the jobs actually need: the realtime data plane for durable real-time agent state with clustered failover, the cell runtime for an AI SRE that runs dense fleets of agent swarms on isolated microVMs, each one built all the way down for its job, and each one carrying the same governance through every layer.

One runtime cannot do every job. So we stopped pretending it could, and built the ones that can.