Article 記事

Deterministic replay is the audit trail: the whole run, reconstructable

author Jonathan Conway
timestamp 15 May 2026
classification cosmictron / ultra / audit-trail / deterministic-replay / regulated / dora / mifid / eu-ai-act / substrate

A head of operations at a large asset manager told me last year that they had spent four months and a considerable amount of consultant time trying to reconstruct three weeks of agent-assisted trade surveillance work for their internal audit committee. The agents had run. The outputs existed. The decisions had been made. But the chain of reasoning, the exact model versions, the policy rules applied at each step, the cost of each call, the precise sequence of events that led to a flag being raised or a case being closed: none of it was in one place, and some of it was not anywhere at all. Three consistency boundaries had quietly disagreed with each other, and nobody had noticed until someone started asking hard questions.

That reconstruction exercise cost more than the original automation had saved, and it still left gaps the auditors marked as unsatisfactory.

This is not an unusual story. It is the story of every agentic system built on the conventional architecture: an orchestration library, an external database, a separate observability stack, and logging bolted on after the fact. Each layer has its own view of what happened. The views do not agree at the margins. And the margins are precisely where regulators look.

The EU AI Act is blunt about what high-risk systems must do. Article 12 requires that the logging capability of a high-risk AI system must be able to record events over the entire operational period, at the level of detail sufficient to allow identification of situations that may result in risk. The deadline for compliance in most regulated verticals is 2 August 2026 (source: EUR-Lex, Regulation 2024/1689, Article 12). A separate observability SaaS that aggregates traces from a dozen microservices does not satisfy that. A signed, append-only, deterministic event log that is the execution engine itself does.

The problem, stated honestly

The standard approach to agent observability works like this. The orchestrator emits spans. A tracing library catches them. A backend (OpenTelemetry, LangSmith, Datadog, your choice) collects them. Separately, a database logs application state changes. Separately again, the model provider logs token usage. Now you have three sources of truth, running over different networks, with different clocks, different retry semantics, and different durability guarantees.

Ask those three sources “what did agent X do at 14:23:07 UTC on the 12th, under which policy, with which version of which model, at what cost?” and they will each answer, and the answers will be close but not identical, and the delta is where your audit defence falls apart.

The deeper problem is determinism. Conventional agent stacks are not designed to be deterministically replayable. The orchestrator loops are non-deterministic in their scheduling. The database reads may return different results depending on concurrent writes. Tool calls are network round-trips with timeouts and retries that change the execution path. Even if you capture every log entry, you cannot replay the run and get the same outcome, because the substrate it ran on does not guarantee that replay produces the same result.

This matters in a specific, practical way. If you cannot deterministically replay an agent run, you cannot do time-travel debugging of production agents. You can look at logs and guess. You cannot reconstruct the exact state at each step, step through it, and understand why a particular decision was made. This is the difference between having a flight recorder and having a crew member’s diary that they wrote from memory a week later.

And for the regulator? Logs you cannot replay are evidence you cannot fully verify. The signatures of individual entries may be intact, but if the entries were assembled from three inconsistent sources, a competent adversarial audit will find the gaps.

The event log that is also the execution engine

The insight at the centre of Substrate’s approach is straightforward: if the audit trail and the execution engine are the same thing, there is no synchronisation problem between them. You do not log events into an audit trail. You execute against a log, and the log is your state.

Cosmictron makes this concrete. Every cell in the factory runs against an append-only event log. State is not stored as a mutable record that gets updated. State is derived from the log, deterministically, on read. A write is an append to the log, not an update to a row. This means that replay is not a special operation: it is the normal operation. To replay a run, you take the event log from time T1 to T2 and re-derive the state. You get exactly the same result, every time, because the derivation is deterministic and the input log is immutable.

Ultra completes the picture. Every event in the log is signed by the agent that produced it, using its Ed25519 private key. The log is a hash chain: each block commits the hash of the previous block. Tamper one entry, and every downstream entry fails verification. The chain is structurally tamper-evident, not just policy-protected.

Put these together and you have a signed, append-only, tamper-evident, deterministically replayable record of everything that happened in the factory. Who (agent identity, verifiable). What (tool, inputs, outputs, all hashed). Which policy (policy ID, version, applied at runtime). What cost (token meter, committed before the call). When (logical clock and wall time, both recorded). The whole run, reconstructable.

Interactive: the signed event log for a financial audit mission. Hover any block to inspect the full signed fields. Toggle “simulate tamper” to see how altering one record immediately breaks every downstream block in the hash chain, making the tamper detectable without any external verification service.

That tamper toggle is worth spending a minute on. This is not a policy control that requires an administrator to check a dashboard. It is a mathematical property of the log structure. If any block is altered, the hash of that block changes, which means the prev_hash field in the next block is wrong, which means that block’s signature is invalid, which cascades forward. An auditor verifying the log does not need to trust the infrastructure team. They verify the chain and the signatures. The mathematics either holds or it does not.

This is the property Article 12 of the EU AI Act is reaching for, even if the text does not use the words hash chain. The requirement is that logs allow “identification of situations that may result in risk” and that they be “technically robust” against loss or manipulation. A hash-chain of Ed25519-signed blocks satisfies both requirements structurally, not by convention.

How the replay works in practice

A log that is mathematically tamper-evident but inaccessible to practitioners is of limited use. The replay interface matters.

Inside a Substrate deployment, the ninmuctl CLI exposes a replay command. Given a mission ID and an optional time window, it reconstructs the exact state of the swarm at each step, letting you step through decisions, inspect the data each agent saw, check which policy version was active, and see the cost committed before each step. This is time-travel debugging of production agents, and it runs against the live signed log, not a copied or sampled trace.

The practical consequence for a development team is significant. When an agent in a production financial surveillance run flags an unusual case, you do not diagnose it by reading logs and guessing. You replay the run from the beginning of that mission, step through to the decision point, and see exactly what the agent saw. The model version, the memory state (what Kizuna-mem returned at that moment, with its bitemporal as-of semantics intact), the policy rules in force, the budget remaining. All of it, reconstructed exactly, because the execution was deterministic to begin with.

For a compliance team, replay serves a different purpose. The question is not “why did this agent make this decision” but “prove to the regulator that this is what happened.” The mission log is the proof. You export the signed bundle: the hash chain, the signatures, the policy IDs, the model versions. An independent verifier can check the signatures without access to your infrastructure. The proof does not depend on the vendor’s audit trail being available or correct. It depends on mathematics, which is available and correct in all the ways that matter to a court.

Interactive: a financial audit mission reconstructed from its event log. Drag the budget cap to see how cheapest-sufficient routing would have changed under different conditions. Watch the signed event log build alongside the DAG: every task completion appends a signed block. The human gate pauses the swarm until approved, and that pause is itself a signed event in the chain.

This diagram deserves some attention. The mission DAG on the left and the hash chain on the right are not two separate visualisations of two separate systems. They are two views of the same data. The DAG shows the logical structure: which tasks ran in which order, which model handled each task, what the budget ledger looked like at each point, where the human gate paused the swarm. The chain on the right shows the audit artefact: the same sequence, signed and committed. Drag the budget cap down and both views update, because the routing decision is part of the signed record, not a separate inference from logs.

Regulatory mapping

Three regulations currently place concrete requirements on agent logging in regulated industries. They converge on what Substrate’s architecture satisfies by construction.

DORA (Digital Operational Resilience Act) requires that financial entities maintain detailed logs of all ICT-related incidents and events sufficient for reconstruction. Article 10 specifies that these logs must allow reconstruction of events in chronological sequence and must be protected against accidental or deliberate manipulation (source: EUR-Lex, Regulation 2022/2554, Article 10). A hash-chain event log that is the execution engine satisfies both requirements directly.

MiFID II’s recordkeeping requirements (Article 16(6)) require investment firms to retain records of all services, activities and transactions for five years in a form that allows the competent authority to reconstruct each key stage of the processing of each transaction (source: EUR-Lex, Directive 2014/65/EU). The word “reconstruct” is doing real work here. A log you can replay is a log from which reconstruction is deterministic. A log assembled from three inconsistent sources is a reconstruction attempt with known gaps.

EU AI Act Article 12 requires high-risk AI systems to have a logging capability enabling automatic recording of events throughout the system’s lifetime. Annex III specifies that for certain categories (credit, employment, essential services, law enforcement, migration, justice), the records must include identification of the natural persons involved in the verification of results. Human gate records, signed and time-stamped, satisfy this directly. The obligation takes full effect for most regulated deployments from 2 August 2026.

All three converge on the same underlying requirement: a tamper-evident, chronologically complete record from which the processing can be reconstructed. Substrate’s architecture was designed around exactly this, as a property of the execution model, not as a compliance feature added to an existing system.

Contrast this with LangSmith plus a Postgres event table, which is the most common current approach for teams trying to satisfy these requirements. LangSmith captures traces from LangChain-based orchestrators. Postgres holds application state. The two are synchronised by application code. Three consistency boundaries: the orchestrator, the trace sink, and the database. Under normal conditions they broadly agree. Under load, during failures, or during retries, they diverge. The divergence is exactly what an adversarial audit will find, because auditors look at boundaries and failure modes.

This is not a criticism of LangSmith specifically. It is a structural critique of the three-boundary architecture. The problem is not the observability tool: it is the gap between where the execution happens and where the log lives.

A sector walkthrough: financial audit, before and after

A UK financial services firm runs a quarterly control-testing programme. The current process involves a team of associates spending several weeks pulling evidence, testing controls, writing working papers, and assembling a pack that internal audit and external regulators can review. The audit trail is assembled by the associates as they go: spreadsheets with timestamps, email trails, signed-off documents. It is not unreasonable. It is the industry standard. It also has gaps, because humans forget to log things, logs are in different systems, and the timestamps are self-reported.

The firm has run agent-assisted pilots that reduce the time for this process substantially. The evidence gathering, the control testing against a rules engine, the drafting of working papers: agents handle the mechanical parts well. The problem that keeps the pilots from moving to production is not capability. It is auditability. Internal audit will not sign off on agent-produced evidence unless they can verify, step by step, what the agents did, what data they saw, which rules they applied, and who (which named human) approved the exceptions. The current agent stacks cannot satisfy this requirement without heroic reconstruction after the fact.

With Substrate’s deterministic replay, the situation reverses. The agents run. Every action is a signed block in the hash chain. Every policy rule applied is recorded with its version. Every model call is logged with the model version and the cost committed before the call. Every exception surfaced to a human analyst is a gate block in the chain, with the analyst’s approval (or override with reason) as a subsequent block. The working paper at the end is not assembled from logs: it is derived from the chain, because the chain is complete and ordered and signed.

The difference for internal audit is not that the output looks different. It is that verification is tractable. The auditor can check the chain signatures independently. They can replay the mission and confirm that the agents saw what the log says they saw. They can check that the policy versions applied match the policy versions that were approved for use. They can verify that the human gates were approved by individuals with the correct authority level. None of this requires trusting the firm’s infrastructure team. It requires verifying mathematics.

The before: an evidence pack assembled by humans and agents, with a reconstruction exercise running in parallel to satisfy audit, taking months and still leaving gaps.

The after: an evidence pack that is the chain, signed and replayable, with reconstruction being a ninmuctl replay command that takes minutes and produces a verifiable result.

What to demand in an RFP

If you are evaluating agentic platforms for any regulated use case, the audit trail question is where most vendors will reveal their architecture’s limits. Here is what to put in the document.

Ask whether the platform’s audit trail is the execution engine, or whether it is a separate log written by the execution engine. If the answer is the latter, ask how the two are kept consistent during failures and retries. Watch for answers that describe reconciliation processes or “eventual consistency”: these are admissions that the trail and the execution can diverge.

Ask whether the event log is cryptographically tamper-evident. A hash chain with Ed25519 signatures satisfies this. An append-only table in Postgres with a row-level timestamp does not: rows can be deleted, timestamps can be altered by anyone with database access, and there is no chain dependency that makes tampering structurally detectable.

Ask for a demonstration of deterministic replay. Give them a mission ID. Ask them to replay the run from hour two to hour three and show you the exact state at each step. If they cannot do this without standing up a separate reconstruction process, the replay is not deterministic.

Ask how the audit log survives an air-gapped deployment. The answer should be “the log is local, signed, and works identically on-prem or air-gapped.” If the answer involves a cloud endpoint for log shipping, signing, or verification, you have a sovereignty problem alongside the audit problem.

Ask about the DORA, MiFID, and EU AI Act mapping specifically. Ask which article each logging requirement maps to, and which component in their architecture satisfies it. A vendor who has actually built for these requirements will answer fluently. A vendor who has bolted observability onto an existing agent framework will describe the observability tool and hope you do not ask the follow-up questions.

A 90-day pilot design

You do not need to redesign your entire agent programme to test whether deterministic replay is real. Pick one mission type that already runs agents: a control-testing cycle, a regulatory report generation, a trade surveillance sweep. Run it on Substrate for one quarter.

The test is not whether the output quality improves. Assume it is comparable. The test is whether the audit trail satisfies your internal audit committee. Give them the signed chain. Ask them to verify one mission end to end. Can they confirm the policy versions? Can they confirm the human gate approvals? Can they replay the run and get the same result? If yes, you have the thing that moves a pilot to production in a regulated enterprise. If no, you have identified a gap the vendor needs to close before you go further.

Three numbers will tell you nearly everything about whether the audit architecture is real. First, the time from “regulator asks a question” to “we have a verifiable answer.” This should be minutes, not weeks. Second, the number of consistency gaps an independent verifier finds when checking the log against the actual output. This should be zero. Third, the time required to satisfy an internal audit sign-off on an agent-produced evidence pack. If the audit committee is signing off in hours rather than months, the architecture is working.

This matters especially as the EU AI Act enforcement date approaches. Firms that can demonstrate a compliant audit architecture before August 2026 are ahead. Firms that are still assembling logs from three inconsistent sources when an enforcement action lands are not.

The broader story connecting this capability to the rest of the factory is in declare the mission and the budget, which explains why the budget ledger and the audit trail are the same idea seen from two angles, and in cryptographic agent identity with Ultra, which covers the Ed25519 identity lifecycle that makes every signed block in the chain verifiable. The EU AI Act Article 12 obligations, and how to build a compliance mapping from the hash chain to the regulatory text, are covered in detail in EU AI Act Article 12: what to log. And if you are building a multi-agent system and worrying about whether the coordination layer introduces new audit gaps, multi-agent coordination without races covers how Cosmictron’s embedded state model eliminates the consistency boundaries that make those gaps inevitable.

The factory knows what it did

There is a temptation to treat audit logging as a solved problem. You pick an observability stack, you configure retention, you make sure someone is watching the dashboards. Most agent projects in regulated industries have done this, and most of them have discovered its limits the first time someone asks a hard question.

The insight Substrate is built on is that the audit trail cannot be a separate system that records what the execution system does. The two must be the same thing, or they will eventually disagree, and they will disagree at the worst possible moment.

Deterministic replay is not a feature of the logging system. It is a consequence of the execution model. When state is derived from an immutable, signed, append-only event log, replay is not a reconstruction exercise. It is just running the derivation again. The factory knows what it did because it never did anything except append to a signed log. And that log, end to end, is the audit trail the regulator is asking for.

If you want to see this in a real deployment context, including the financial model and the head-to-head comparison with glue stacks, you can request the investor brief. The signed event log, the deterministic replay, the cryptographic identity plane, and the hard budget ledger are not four separate properties you can mix and match. They are four aspects of one design, and that is why incumbents and glue stacks structurally cannot follow.