DBSP incremental view maintenance: the death of polling for real-time agent state
A trading desk I heard about last year ran into a problem that nobody would admit was a problem. Their agent-assisted order book ran on a fairly standard setup: a Postgres write, a trigger that pushed a message to a Redis pub/sub channel, a WebSocket server that fanned it out, and a fleet of subscriber agents that re-ran their position queries when a message arrived. The whole chain latency was somewhere between 80 and 150 milliseconds depending on Redis load. Not catastrophic. The demos looked fine.
Then they tried to run more than a handful of agents simultaneously, each subscribed to a slightly different slice of the same data. The message storm from the triggers multiplied. The agents all woke up at once, all re-ran full queries against the same Postgres tables, and the database spiked. They added read replicas. The replicas introduced replication lag. Now some agents were computing positions against stale data and some were not, and nobody could tell which.
The engineers on the project were not incompetent. They were using the tools everybody uses. The problem was not a bug. It was an architectural property of polling-and-notify stacks: each subscriber pays the full query cost on every change, the cost scales linearly with subscriber count, and consistency across subscribers is at best eventual. There is no remedy that fits neatly into that architecture. You can add caches, but caches add staleness. You can batch updates, but batching adds latency. You can scale the database, but you are treating a symptom rather than a cause.
The same pattern turns up everywhere that agents need shared state: kanban boards where a dozen agents coordinate on task assignment, shared mission memory where each agent needs to see the full current picture, exception queues where a new filing needs to wake exactly the right reviewers. In every case the polling-and-notify approach either burns compute, introduces lag, or both. For an agent swarm that is supposed to feel alive, these are not minor inconveniences.
The problem with how real-time databases usually work
Before getting into what DBSP actually does, it helps to be precise about what every other approach gets wrong.
The most common real-time pattern is change data capture combined with client-side re-subscription. A write happens. The database emits an event (a row-level trigger, a Debezium capture, a logical replication slot). The event lands in a queue. Each subscriber receives it and re-runs their query. This works. It also means that if you have a query involving a JOIN between two tables and an aggregate on top of that, every subscriber re-evaluates the entire JOIN and the entire aggregate on every change, regardless of whether that change could possibly affect their result. The compute cost is O(N) in table size, multiplied by the number of subscribers.
The second pattern is materialised views with polling. A database materialised view pre-computes the result. Clients poll it on a timer. This is better than re-running joins repeatedly, but the view itself is refreshed on a schedule, not on a write, so you have built latency in as a design parameter. And materialising many small views for many subscribers scales poorly in storage and maintenance cost.
Materialise (the company, not the concept) has done serious work here with an incremental view maintenance engine built on Timely Dataflow and Differential Dataflow. If you are operating that tier of infrastructure it is a real option. But you are also now running a separate streaming database alongside your application server alongside your data store, which means you have three consistency boundaries, three operational surfaces, and three points of failure for a subscriber to see stale data.
Cosmictron takes a different approach: the runtime, the data store, and the subscription machinery are the same binary. The subscription engine is DBSP.
What DBSP actually does
DBSP (Database Stream Processing) is the result of research published at VLDB 2022 by Budiu, McSherry, Ryzhyk, and Tannen (source: M. Budiu et al., “DBSP: Automatic Incremental View Maintenance for Rich Query Languages,” VLDB 2022). The core idea is that any SQL query can be compiled into a circuit of stream operators such that each operator processes only the differential change (the Z-set, in the paper’s notation) rather than the full input. The circuit is compositional: you can build complex queries from simple operators and the incremental property holds for the whole composition.
In concrete terms this means:
A JOIN between tables A and B, where a row is inserted into A, touches only the rows in B that match the new row. The rest of B is not read. The cost is O(matching rows in B), not O(|A| * |B|).
An aggregate like COUNT(*) or SUM(amount) maintains a running total. A new row adds its contribution in O(1). No scan is required.
A query of the form SELECT * FROM (SELECT ... GROUP BY ...) WHERE ... is incrementalised end-to-end. Each stage in the pipeline receives only the delta from the stage below it, processes that delta, and emits a delta upward.
The “circuits” framing is literal: DBSP represents a query plan as a directed graph of stream operators, where data flows as Z-sets (sets with integer multiplicities, positive for insertions and negative for deletions). Each operator in the circuit has a well-defined incremental form. The paper proves this compositionally, which means the incremental property does not break down as queries become more complex. This is not an approximation or a heuristic. It is a mathematical guarantee.
Toggle between DBSP incremental and naive full re-evaluation. Hit the “write” button repeatedly and watch the cumulative compute units diverge. The blue packet travelling the circuit is one delta row; in naive mode the entire graph lights on every write.
What this means for a query with a JOIN over two tables and a filter and an aggregate is that a single-row insert propagates a delta of exactly one row through the JOIN probe, one filtered result through the FILTER operator, and one increment through the aggregate. The subscriber sees a result update in sub-millisecond time. Not because the hardware is faster, but because the computation is fundamentally smaller.
Cosmictron compiles every SQL subscription a client registers into a DBSP circuit at subscription time. The circuits run continuously inside the binary. When a write arrives, it enters the relevant source nodes of the circuits that depend on the affected tables, and only the affected paths propagate deltas to their subscribers. The rest of the circuits do nothing. An agent that has subscribed to SELECT * FROM mission_tasks WHERE status = 'blocked' does not re-evaluate anything when a row is inserted into agent_logs. The circuit for that subscription was never wired to that table.
How Substrate uses this in practice
Cosmictron is one of the six purpose-built systems in the Substrate dark factory. It is the live data plane: the runtime and the realtime database embedded in every agent cell. The fact that it uses DBSP is not an architectural footnote; it is the property that makes a hundred agents coordinating on a shared mission plan feasible on a single node rather than requiring an external streaming database to hold the whole thing together.
The measured throughput on one node is 2,326 agent actions per second at 0.43 milliseconds average latency (source: Cosmictron internal benchmarks, publicly stated on the Substrate homepage). That figure is not a read-benchmark on a small table. It is a write-through benchmark with real subscribers. The reason it is achievable is that writes propagate incrementally: each action is a delta, and the delta is small.
For multi-agent coordination this changes the character of the problem. In a polling-and-notify system, the freshness of any agent’s view is a function of the polling interval plus the re-query time plus any replication lag. In an agent swarm with hundreds of cells, you end up with a probabilistic distribution of staleness across agents. Some agents are acting on data that is 300 milliseconds old. Others are acting on data that is 50 milliseconds old. Coordination that depends on shared agreement (“which tasks are unassigned right now?”) becomes unreliable at the edges.
With DBSP-based subscriptions inside a single binary, writes propagate to all subscribers before the next action in the event loop. The agents are not polling. They receive delta updates. They are not re-querying a shared database over the network. They are reading from a co-located data structure that was updated by the write itself. The coordination primitive is shared state, not message agreement.
Hundreds of subscriber cells receiving sub-millisecond delta updates from a single Cosmictron binary. Each cell lights when a delta arrives. The density story here is the point: no external queue, no database round-trips, just incremental propagation inside one process.
The practical consequence for the agent use-cases that appear most often in regulated workflows:
Live order books. A trading agent’s view of the order book is always current to the last write. No lagged reads. No replica staleness. When the market moves, every subscribed agent sees the same delta at the same time.
Kanban state. A mission broken into tasks is maintained as a live table. Every agent that subscribes to its assigned tasks sees new assignments immediately. A coordinator that subscribes to the aggregate “tasks unassigned” sees a count update in O(1) when a task changes status.
Shared mission memory. Cosmictron-cell gives every agent its own embedded database. But agents in the same mission subscribe to a shared mission state table. A decision made by one agent is visible to all others as a delta before the next clock tick of the swarm.
Exception queues. A new exception row lands in a table. The compliance agent subscribed to WHERE exceptions.status = 'unreviewed' receives the delta immediately. The notification is not a message that competes in a queue with other messages. It is a live result set update.
The “one binary” point is load-bearing
There is a tendency when reading about DBSP to think of it as an optimisation on top of an existing architecture: you have your database, you have your application layer, and you add a DBSP engine to make subscriptions faster. This is how Materialize works, and it is a legitimate design, but it means you still have the consistency boundaries problem. The application writes to one system. The DBSP engine reads from the write-ahead log of that system. The subscription results come from the DBSP engine. A reader subscribed to the DBSP engine is reading from a derivative that lags the primary by however long the CDC pipeline takes.
In Cosmictron, this distinction does not exist. The write and the subscription update happen in the same transaction, inside the same process, in the same binary. There is no CDC pipeline. There is no log-shipping delay. The circuit runs inline with the write path. This is why the 0.43 millisecond figure is an end-to-end action latency, not a subscription propagation latency measured separately from the write.
For a governed dark factory that must produce a deterministic replay of every action (Cosmictron’s replay audit trail is the storage engine itself), co-location is not optional. You cannot produce a tamper-evident replay of “agent A subscribed to this query and saw these rows at this time” if the subscription results came from a separate system with its own clock and its own lag. The co-location is what makes the audit trail coherent.
This connects directly to the Substrate principle that every action is signed and auditable by construction. When a Cosmictron cell records that agent A read the current task state at time T, that record is precise: the cell knows exactly which rows the DBSP circuit had propagated by time T. There is no uncertainty introduced by polling intervals or replication lag. The replay is exact.
If you want to understand why the glue-stack approach cannot be retrofitted to produce this, the post on deterministic replay as the audit trail goes through the consistency-boundary problem in more detail. The short version is that once you have three systems (application, streaming engine, database) with three separate clocks, you cannot reconstruct a single consistent timeline of “what was known when” from their logs after the fact. You need the property by construction, which means you need everything in the same binary.
A concrete sector walkthrough: regulated exceptions in healthcare
Consider a clinical coding workflow. A hospital processes tens of thousands of discharge records weekly. Each record needs to be coded against ICD-10 criteria, checked for completeness, and flagged if the coding decision is ambiguous or if the record falls into a high-value diagnostic group that carries audit risk. Before the coding is submitted, a senior coder needs to review the flagged cases.
The naive agent setup is: a batch job runs every few minutes, queries a table of incomplete records, assigns a batch to a coding agent, the agent runs, updates the record, and the batch job runs again. The senior coder’s queue is refreshed on the same schedule. Latency from coding completion to queue appearance: up to five minutes, depending on batch cadence.
With DBSP subscriptions, the coding agent writes its decision to the record. The senior coder’s subscription to WHERE records.flag = 'review_required' AND records.status = 'coded' receives the delta immediately. The coder sees the case without waiting for the next batch. Simultaneously, a compliance agent subscribed to aggregate metrics sees the count of open flagged cases update in O(1). The mission coordinator sees the task status update in the shared kanban. None of these subscribers did anything. The writes propagated to them.
The economics here are not just about latency. In a high-volume coding environment, the bottleneck is the senior coder’s time. A delta-push system means the queue is always fresh; the coder can work continuously without a polling loop that wastes attention time. Batch processing means the coder either waits for the batch or works from a stale queue and misses new arrivals that came in after the last refresh. The latency difference is a productivity difference.
The before and after in numbers is intentionally left illustrative here because the specifics depend heavily on volume and team size. What is not illustrative is the architectural property: O(delta) propagation vs O(N) re-query is a structural difference with a structural consequence. The gap widens as subscriber count and table size grow.
For the governance side: the EU AI Act Article 12 requirements for high-risk systems in healthcare include automatic recording of events over the system’s lifetime. An architecture where subscription lag introduces ambiguity about what an agent saw at what time creates a compliance problem that is hard to characterise and harder to explain to a regulator. The Cosmictron approach produces a precise record. Whether a DBSP subscriber update counts as an “event” for Art 12 purposes is a legal question; what Substrate provides is the technical substrate (there is genuinely no better word for it) to answer that question cleanly.
Finance: live order books and mission state
A trade finance workflow is one of the pilot cases Substrate has scoped. The core loop is: ingest documents, build entity and transaction graphs, run policy checks against sanctions lists and counterparty risk models, surface exceptions, assemble a signed evidence pack.
Each step produces writes that downstream agents need to see. The entity graph builder writes new nodes. The policy check agent subscribes to the entity graph and runs checks on new nodes as they appear. The exception sorter subscribes to policy check results and categorises them. The evidence assembler subscribes to the complete set of categorised exceptions and knows when the last one has been processed.
In a polling system this chain has a latency floor at each hop equal to the polling interval. Three hops at one-second polling means a minimum three-second pipeline even when compute is idle. In a DBSP system the chain latency is the sum of compute times at each step, with no added polling floor. For a workflow that needs to produce evidence packs within a trading session window, the difference matters.
The mission coordinator in Ninmu sees the task graph update in real time. When the policy check agent completes its last check, Ninmu’s subscription to the task completion state sees the delta, updates the budget ledger, and can proceed to the next phase without a polling delay. This is what multi-agent coordination without races goes into in detail. The coordination is through shared state, not through message passing that requires agreement on message ordering under contention.
What to look for in an RFP
If you are evaluating any agentic platform for a workflow where state freshness matters (which is most workflows where the agents need to coordinate), the subscription model is worth interrogating carefully.
Ask how subscriptions work at the implementation level. If the answer involves change data capture from a primary database into a secondary streaming engine, ask how the system handles the consistency window between the primary write and the secondary update. Ask what the maximum observable staleness is. Ask what happens to this latency as the number of subscribers grows. The answer should not be “it scales linearly.”
Ask whether the subscription cost is O(N) in table size or O(delta). Specifically, ask whether a query with a JOIN and an aggregate re-evaluates the full join and the full aggregate on every write. Most systems do. A DBSP-based system does not.
Ask about the binary count. A system that requires separate infrastructure for the application tier, the database, and the streaming engine is not just operationally heavier. It has structural consistency limitations that cannot be solved by adding caches or reducing poll intervals. A single-binary system with co-located subscriptions has a fundamentally different consistency model.
Ask whether the subscription model is part of the audit trail. If the system cannot tell you precisely what data each agent saw at each point in time, with no ambiguity introduced by polling lag or replication, you have a compliance exposure for any high-risk application under EU AI Act Article 12 or equivalent logging requirements.
Materialize is the most serious external comparison in this space and is worth naming. It solves the incremental view maintenance problem at the streaming database layer. The architectural difference with Cosmictron is the co-location point: Materialize is a separate system you run alongside your application. For applications where the separation is acceptable, it is a capable choice. For a governed agent factory where the audit trail must be coherent across write and read, the separation introduces the exact consistency ambiguity described above.
A 90-day pilot design
Pick one workflow where your agents currently poll: a queue that agents check on a timer, a kanban board that agents refresh on a schedule, or a shared state table that agents query repeatedly to check for updates.
Measure three baselines. The polling interval currently in use. The average re-query time (full table scan cost or index-scan cost depending on the query). The number of agents subscribed to that state.
The first two numbers multiply to give you the minimum latency floor in the current system. The product of all three gives you an idea of the aggregate query cost per polling cycle. For most teams these numbers are not tracked anywhere, which is itself informative.
Run the equivalent on Cosmictron for four weeks. The implementation involves writing the subscription query, which is a normal SQL SELECT. The DBSP circuit is compiled automatically. Then measure: subscription propagation latency per write, aggregate CPU time spent on subscription maintenance per hour, and time-to-queue for new items (the time from when a write completes to when the relevant subscriber sees it).
The fourth week, stress the subscriber count. Add more agents subscribed to the same state. In a polling system the aggregate query load grows linearly. In a DBSP system the propagation cost of a write is determined by the size of the delta, not the number of subscribers.
If the numbers do not separate, you have learned something useful cheaply. If they do separate, you have a concrete before/after that a CIO or audit committee can read.
The deeper point
The DBSP paper’s contribution is mathematical before it is practical. The proof that arbitrary SQL queries can be compiled into compositional incremental circuits is not an engineering trick. It is a result about the structure of relational algebra. The practical implication is that incremental maintenance is not a feature you add to specific query shapes. It is a property that holds for any query the system can express.
This matters for agent systems because you do not know at build time what queries your agents will need. A mission planner in Ninmu may generate a subscription for a query shape that was not anticipated when the system was deployed. A DBSP-based system will incrementalise it automatically. A system that handles incremental maintenance as a special case (materialised views for known queries, polling for everything else) will fall back to full re-evaluation for anything outside the known set.
The one-binary agent runtime post covers the broader story of what it means to run the database, the runtime, and the subscription engine in the same process. The DBSP subscription engine is one piece of that; the WASM hot-reload, the durable sessions, and the deterministic replay are others. Together they are what makes the Substrate claim “the audit trail is the storage engine” true rather than aspirational.
For an agent swarm that must feel alive to the people running it (and defensible to the people auditing it), the subscription model is not a background concern. It is the nervous system. A swarm that polls is a swarm that is always slightly behind reality, always burning compute to catch up, and never quite sure that every agent is working from the same picture.
DBSP subscriptions are not the only reason Cosmictron hits 2,326 actions per second at 0.43 milliseconds. But they are the reason that number does not collapse as the subscriber count grows. And for a factory meant to run at the scale that makes regulated knowledge work economically viable, that is the number that matters.
What to read next
If the coordination angle is the one you want to pull on: multi-agent coordination without races covers how shared state inside Cosmictron eliminates the message-ordering problems that make external queues unreliable at scale.
If the audit and replay angle matters more: deterministic replay as the audit trail explains why the co-location of write and subscription is necessary for coherent replay, not just convenient.
For the complete picture of what Ninmu does with the state Cosmictron surfaces: declare the mission and the budget covers the mission planner, the hard budget ledger, and why cost governance has to be in the orchestration layer.
And if you are evaluating the factory as a whole for a regulated sector: request the investor brief. The brief covers the financial model, the six-system architecture, and the pilot design in more depth than a blog post can.
All benchmarks cited are from Cosmictron’s published metrics on the Substrate homepage (2,326 actions/s at 0.43 ms, one node). The DBSP theoretical grounding is from Budiu et al., VLDB 2022, publicly available. Sector examples are illustrative architectures, not statements about specific customer deployments.