Article 記事

Multi-agent coordination without races: embedded realtime state vs external queues

      author
      Jonathan Conway
    

      timestamp
      17 May 2026
    

      classification
      realtime-data-plane / multi-agent / coordination / business-factory / realtime-state / dbsp / substrate / race-conditions
    

A supply-chain team at a European logistics firm built a multi-agent prototype last year. Five agents, each watching a different feed: stock levels, inbound shipments, customs clearance, carrier capacity, and live pricing. The idea was that when a shipment slipped, the agents would coordinate a reroute automatically: reprice, rebook, update the warehouse schedule, flag the delay to the finance system. Smart architecture. Sensible idea.

It worked in the demo. In production, after about three weeks, the pricing agent and the carrier-capacity agent disagreed about the state of a booking that both of them had tried to modify in the same second. The conflict resolution logic wrote one agent’s view on top of the other’s. The booking was confirmed twice. The write-off was small but the embarrassment was not.

Nobody had written buggy logic. The logic was fine. The problem was that two agents reading from a shared Postgres instance and writing back to it with a 40-millisecond round trip between them is not a coordination model. It is a race.

This is the quiet failure mode that multi-agent systems fall into when the coordination layer is not built for agents. Queues help but do not fix it. Event buses help but introduce ordering problems that are worse to debug. External state stores help but put two consistency boundaries between the agent and the truth. The question this piece examines is why these patterns fail, what a genuinely agent-native coordination model looks like, and how Substrate’s realtime data plane cells solve the problem at the right level of the stack.

The coordination problem, stated plainly

A single agent is simple. It reads, reasons, writes. The state is wherever the agent left it. There is no question of who else might have touched it.

Two agents change the picture entirely. Each agent has a view of the world. That view was accurate when the agent read it. By the time the agent acts on it, the other agent may have changed the world. This is not a bug. It is physics. Any two processes reading and writing shared state will diverge unless the system provides a mechanism to keep them synchronised.

Three classical approaches exist, and each is a partial answer.

The first is message passing: agents communicate by sending each other messages rather than reading shared state. This works well when agents are loosely coupled and the message ordering is a first-class concern of the protocol. It breaks when you need a consistent view of the world at a given instant, because messages arrive in order or they do not, and reconstructing “what did every agent know at time T” from a message log requires work that almost nobody does correctly. Kafka-based multi-agent architectures hit this wall around the point when someone asks for a replay of a production incident.

The second is a shared external database: a Postgres or similar sits in the middle, agents read and write it, and transactions provide isolation. This is more honest than message passing about the shared-state nature of the problem. The cost is latency (a round trip to the database on every read) and the fact that the database knows nothing about agents, missions, budgets, or replay. Two agents in a 40 ms transaction window can still step on each other if the application logic does not defend every critical section explicitly, which it usually does not.

The third is an external queue with event sourcing: agents emit events, a queue delivers them, and a projection layer maintains current state. This is theoretically sound and practically complex. Three separate subsystems (producer, queue, consumer/projection) each have their own failure modes. Consistency between what is in the queue, what has been projected, and what an agent believes to be true requires careful coordination that is, ironically, exactly the coordination problem you started with.

None of these approaches is wrong as a general architectural pattern. The problem is that they were designed for humans building distributed services, not for agents coordinating at millisecond timescales inside a governed factory where every action must be signed, replayable, and subject to a hard budget.

Why agents need something different

The defining property of a multi-agent system in a governed factory is that the agents are not independent services. They are workers on the same mission, sharing the same plan, spending from the same budget. They need to see the same world at the same moment. When agent A updates the kanban board because it has finished a task, agent B should see that update before it decides what to do next. Not eventually. Now.

Consider the order-book pattern. A prediction-market cell has several agents watching different signals: one agent watches news feeds, one models historical volatility, one manages the current book. The book agent is only useful if it sees the current positions of all other agents in close to real time. A 200 ms lag is not “close enough” in an order book. It is a race condition waiting to happen.

The supply-chain pattern is similar. A factory mission that reroutes a shipment involves agents updating stock allocations, carrier bookings, and finance records. If the carrier-booking agent and the stock-allocation agent both read the same state, make incompatible decisions, and write them back in the same window, the result is the story I opened with. The fix is not better conflict resolution. The fix is ensuring the agents cannot make incompatible decisions because they cannot see incompatible views of the state.

This requires a different model: shared state that is live, consistent, and agent-native by design. Not a database bolted on the side. The state itself needs to be part of the agent runtime.

How DBSP incremental subscriptions change the picture

The theoretical underpinning here is DBSP, the Database Stream Processing model described by Budiu et al. in their VLDB 2023 paper. The key insight is that if you model a database query as a circuit over a ring of data changes, you can process only the delta on each write rather than re-evaluating the full query against the full dataset. Joins, aggregations, and filters all compose incrementally. The cost of propagating a one-row insert is proportional to the size of the delta, not the size of the relation.

The realtime data plane implements DBSP circuits as the native update-propagation layer. When an agent writes a row, the change propagates immediately through every live subscription that touches that row. Other agents subscribing to that view receive only the delta. They do not poll. They do not re-fetch. They receive the precise change that affects their view, as it happens.

Toggle the diagram between incremental (DBSP) and naive full re-evaluation, then hit “write” repeatedly. The compute-unit counter shows the divergence: incremental carries a constant cost per write regardless of how many agents are subscribed; naive re-evaluation scales with both the number of agents and the size of the dataset.

The coordination implication is significant. In the naive model, five agents all polling a shared database for updates will generate five reads on every poll cycle, each of which may or may not see the latest write depending on read-committed semantics and cache state. With DBSP subscriptions, a single write produces five targeted deltas, each arriving at the relevant agent with sub-millisecond latency within the same host. There is no poll. There is no cache staleness. There is no consistency boundary between the agent and the state it is acting on.

The realtime data plane’s measured throughput on a single node is 2,326 agent actions per second at 0.43 ms average latency. That figure is the result of having no network hop between the agent and its data. The embedded DB is part of the same binary.

realtime data plane cells: the unit of coordination

The architectural primitive that makes this practical is the realtime data plane cell. Each cell is an isolated agent runtime with its own embedded database, its own DBSP subscription graph, its own signed event log, and its own WASM execution sandbox. Cells are lightweight: they boot in under 50 ms and a single host can hold around 10,000 idle cells.

Coordination between agents in the same mission works through shared subscriptions. If two cells are assigned to the same mission, they can subscribe to the same state views. A write from one cell propagates to the other through the DBSP circuit without leaving the host. The receiving cell does not need to make a network request. The shared state is the coordination channel.

Each square in the grid is an agent cell. Drag the capacity slider to add or remove cells. Hit “inject fault” to watch one cell fail, heal, and recover with state preserved. Cells that are live and subscribed to each other coordinate through delta propagation within the host; there is no external broker.

This is the structural answer to the race. Two cells that share a state subscription cannot have incompatible views of that state for longer than a single DBSP propagation cycle, which is sub-millisecond within the host. The supply-chain scenario I opened with becomes: the pricing agent and the carrier-capacity agent are both subscribed to the booking state. When one of them writes a reservation, the other receives the delta before it reads the state again. The booking cannot be confirmed twice because the second agent’s view of the booking state is updated before it acts.

It is worth being specific about what “shared state with zero hops” means for different patterns.

For a kanban board, it means that when agent A marks a task complete, agent B sees the updated board state in the same sub-millisecond cycle. There is no risk of agent B picking up the same task because the task is gone from the available-work view by the time agent B queries it.

For an order book, it means that every agent managing a position sees all other positions as they change. A coordinated update (for example, hedging a position across two legs) can be atomic within a single cell or synchronised across two cells via a single propagation cycle.

For a supply-chain reroute, it means that all agents involved in the reroute share a live view of the shipment state. The sequence “carrier updates booking, stock allocation updates, finance records the change” happens as a propagated sequence of deltas, each one visible to downstream agents before they act.

Scaling: one cell per agent or one cell per mission

The question that comes up in every architecture discussion is how to partition cells. Should each agent get its own cell? Should a mission get one cell with all agents running inside it?

The answer depends on isolation requirements and coordination density.

If agents on the same mission are coordinating heavily (kanban, order book, shared planning), put them on the same cell or on cells within the same host. Delta propagation within a host is sub-millisecond. Delta propagation across hosts requires a replication hop, which adds latency.

If agents need strong isolation (for example, because they are operating on different missions with different budget ceilings and different audit scopes), give each mission its own cell group. the realtime data plane’s declarative row-level security means that even agents on the same host cannot read each other’s state unless the policy permits it. The isolation is enforced at the query level, not the network level.

A practical heuristic: start with one cell per mission, where the cell holds the shared state for all agents in that mission. Agents that need stronger process isolation (for example, WASM-sandboxed third-party logic) run in their own cell and subscribe to the mission state via a declared subscription. Scale to multiple hosts only when the coordination density or memory footprint genuinely requires it.

For the governed factory at the scale Substrate targets, around 2,300 actions per second per node, a single-host cell group can handle the coordination for a substantial mission without distributed state. The constraint is not usually compute. It is memory and storage, and the realtime data plane’s incremental views keep the in-memory state small by storing only deltas and materialised views, not full copies of every subscribed table.

Race conditions: what gets eliminated and what does not

It is worth being clear about what this architecture actually eliminates.

It eliminates the class of races that arise from polling latency. If agent B reads stale state because it polled before agent A’s write propagated, that race does not exist in the realtime data plane. The subscription is live. The write propagates before the next read.

It eliminates the class of races that arise from multiple consistency boundaries. In a system with a queue, a projection layer, and a database, a write from one agent may be in the queue but not yet projected when another agent reads the database. That ambiguity cannot exist when the state is in a single embedded DB with synchronous delta propagation.

It does not eliminate the class of races that arise from genuinely concurrent decisions. If two agents both decide to book the last carrier slot at the same moment, one of them will fail. That is correct behaviour. The factory should surface that as a conflict, route it to the appropriate agent or human gate, and log it. What matters is that the conflict is detected reliably (because the write that succeeds is propagated to the losing agent immediately) and that the conflict is logged with full provenance (because every write is signed and appended to the deterministic event log that forms the audit trail).

The distinction matters for regulated buyers. The question is not “can this system prevent all conflicts?” It cannot, and any system that claims otherwise is selling something. The question is “when a conflict occurs, is it detected immediately, logged completely, and resolvable with a full history?” The answer with realtime data plane cells is yes, because the detection is in the propagation layer and the log is the storage engine.

Observability: knowing what happened and why

Coordination without races is only half the problem. Regulated buyers also need to know what happened. Not an approximation of what happened reconstructed from separate logs. The exact sequence of state changes, which agent caused each one, under what policy, and at what cost.

The realtime data plane’s deterministic event log is the audit trail by construction. Every write is signed by the agent identity (the cryptographic identity system that gives every agent a provable Ed25519 key pair, covered in depth in cryptographic agent identity and the identity service). Every write is timestamped. Every write propagates to subscribers as a signed delta. The full sequence of deltas is the replay log.

This means that “what happened during the supply-chain reroute” is not a question that requires reconstruction from multiple systems. The answer is in the event log. You can replay the sequence deterministically, see every state transition, see which agent caused it, see the policy it was operating under, and see the budget spent at each step. Replay is covered in full in deterministic replay as the audit trail; the point here is that coordination without races and full-history observability are the same mechanism. The DBSP propagation layer that eliminates races is also the layer that records them.

The sector walkthrough: supply-chain coordination under governance

Take a supply-chain mission in a regulated context: a pharmaceuticals company rerouting a temperature-controlled shipment when the primary carrier fails. The compliance constraints are real. Chain-of-custody must be maintained. The temperature record must be continuous. The regulatory filing must be updated within four hours of a route change. Every handoff must be documented.

Before Substrate, this is a manual process with agents as advisors. A logistics coordinator gets an alert, phones the carrier, phones the warehouse, emails the compliance team, updates the ERP. Three hours of human coordination if everything goes well. Five hours if the carrier and the warehouse disagree about the handoff time. The compliance filing is assembled at the end from emails and call logs, and it almost always has a gap.

With a Substrate mission, the process changes at the coordination layer. The mission declares the goal (reroute shipment X, maintain chain-of-custody, file within four hours) and the budget. A cell group stands up: one cell watching carrier capacity, one managing the chain-of-custody log, one updating the ERP, one monitoring temperature telemetry, one preparing the regulatory filing. All five are subscribed to the shared shipment-state view.

When the primary carrier fails, the carrier-capacity agent writes the failure to the shipment state. All other agents receive the delta in the same propagation cycle. The chain-of-custody agent opens a new custody record. The temperature-monitoring agent flags the gap. The ERP agent updates the record. The filing agent begins assembling the evidence. They are working from the same state, at the same moment, without races.

The human gate is where policy requires it: a named compliance officer approves the new carrier assignment before the booking is confirmed. That gate is part of the mission orchestrator plan. It is not an ad-hoc phone call. The approval is logged, signed, and part of the audit record.

The regulatory filing is ready within the window not because someone worked quickly, but because the evidence was assembled as the work happened. The chain of custody record, the temperature log, the carrier handoff timestamps, the compliance officer’s approval: all of it is in the event log, signed, and exportable in the format the regulator requires.

The before and after for the compliance officer is stark. Before: four hours of coordination, a filing assembled from memory and emails, a gap in the temperature record that requires explanation. After: less than an hour of active work, a filing assembled by the factory, a complete and signed record of every decision, and a gap in the temperature record that the system flagged and routed to the compliance officer rather than buried.

What to demand in an RFP

When evaluating a multi-agent platform for regulated use, the coordination question separates systems that will hold up from systems that will fail under load or under regulatory scrutiny. Here is what to ask.

Ask whether the state shared between agents lives in the same process or crosses a network boundary. If it crosses a network boundary, ask what the guaranteed propagation latency is under load. A system that adds 200 ms to every coordination step will not deliver the cycle times that make the factory useful.

Ask how a race condition is detected and reported. A system that detects races after the fact via conflict-resolution logic is telling you it expects races. A system that eliminates them through synchronous propagation is telling you the architecture was designed for this.

Ask for a replay of a production incident. Not a reconstructed timeline. A deterministic replay of the exact sequence of agent actions from the event log. If the vendor cannot produce this in a demo, the observability story is aspirational.

Ask what happens to the audit log when a cell fails. The answer should be that the log is append-only and durable by construction, not that the operator will need to recover it from a backup. The event log is the product, not a feature.

Ask whether coordination state and audit state come from the same system. If cost control lives in one tool, audit logs in another, and coordination state in a third, you have three sources of truth that will disagree at exactly the moment you need them to agree, which is during a regulatory examination.

A 90-day pilot design

Pick one regulated workflow where multi-agent coordination under time pressure is the pain point. Supply-chain exception handling, clinical prior-authorisation routing, AML alert triage: the pattern is the same. Multiple specialised agents need to act on the same changing state, and the output needs to be auditable.

Set up a mission with three to five agent cells sharing a state view. Measure three things over ninety days. First, the rate of coordination conflicts: how often do two agents attempt incompatible actions in the same state? In a correctly configured the realtime data plane cell group, this number should be close to zero for the race class and low but non-zero for the genuine-conflict class. Second, the time from a state change to the last agent receiving the delta: this is your propagation latency, and it should be sub-millisecond within a host. Third, the completeness of the replay log: pick five incidents at random and replay them deterministically. The replay should match what happened exactly.

The third test is the one that matters most for regulated buyers. A system that coordinates well but cannot replay is a system you cannot defend to an auditor. In the realtime data plane, the replay test is not a special procedure. It is just reading the event log in order. That is what “the audit trail is the storage engine” means in practice.

You can read how the DBSP mechanism itself works in more detail in DBSP and the death of polling. The runtime that hosts these cells is covered in one-binary agent runtime. And if you want to understand the scaling story from the cell side, the boot-time and density figures are examined in the cell runtime: cell density and the 50ms boot guarantee.

Closing: coordination is a first-class property

The race condition at the start of this piece was not the result of incompetent engineering. It was the result of treating coordination as a solved problem: put a database in the middle, use transactions where needed, handle conflicts when they arise. That approach works for human-paced systems where 40 ms is invisible and conflicts are rare. It does not work for agent systems where 40 ms is an eternity and conflicts are structural.

The factory model Substrate runs on treats coordination as a first-class property of the runtime, not an application-level concern. Agents do not coordinate by sending messages to each other or by polling a shared store. They coordinate by subscribing to a shared state that propagates deltas synchronously, within the same binary, with deterministic semantics and a signed history.

That is not a subtle optimisation. It is the difference between a system that races and a system that does not. For a regulated buyer, it is also the difference between a system they can defend and one they cannot.

Declare the mission and the budget. The factory handles the rest, and it does not race.

If you want to see how this fits into the full factory picture, including the budget governance, the identity layer, and the sovereign deployment model, you can request the investor brief.