Article 記事

What regulated buyers should demand in memory benchmarks (2026 edition)

      author
      Jonathan Conway
    

      timestamp
      29 May 2026
    

      classification
      memory / benchmarks / locomo / beam / magma / governed-memory / governance / regulated-ai / oamp / bitemporal
    

A procurement manager at a large NHS trust sent me a story I have been thinking about since. Her team had spent three months evaluating agentic memory systems for a clinical coding workflow. Two vendors made it to the final round. One scored at the top of every recent retrieval benchmark she could find. The other had unremarkable recall scores but arrived with a document that listed, in precise technical detail, how the system would handle a GDPR Subject Access Request, a bitemporal “as of” query from the trust’s information governance team, and a regulator-ordered replay of a disputed coding episode.

The first vendor had never been asked those questions before. They said they would “get back to her.” They did not, because the questions exposed gaps their architecture could not close.

The second vendor got the contract.

This happens constantly now, and it will happen more often as the EU AI Act Article 12 enforcement deadline for high-risk systems, currently 2 August 2026 (source: EUR-Lex, Regulation 2024/1689), forces regulated buyers to ask the questions their procurement frameworks were not designed to ask. The memory benchmark leaderboards do not help. They measure the right things for research settings and the wrong things for regulated production. This article proposes a rubric that fills the gap.

What the existing benchmarks actually measure

LoCoMo (Longitudinal Conversation Memory, Maharana et al., EMNLP 2024) is the most widely cited benchmark for long-horizon agent memory. It tests whether a system can accurately answer questions about events spread across extended conversational histories, including temporal ordering, fact conflict resolution, and context retention over very long spans. This is genuinely useful. Long-horizon retrieval is hard, and LoCoMo exposes failure modes that earlier short-context benchmarks missed entirely.

BEAM (Benchmark for Evaluating Agent Memory, 2025) extends the scope to multi-session and multi-agent settings. It is better suited to the business-factory use case where many agents are reading from and writing to a shared memory graph over the course of a complex mission. BEAM tests whether a memory system remains consistent and retrievable under the kind of concurrent write pressure that agentic workflows actually produce.

MAGMA and related 2025-2026 evaluations push further into cross-document reasoning, long-window context synthesis, and structured versus unstructured memory interleaving. These are the benchmarks most likely to be cited by vendors targeting enterprise buyers in 2026.

The pattern across all of them is the same. They measure how accurately and reliably a system retrieves facts that were stored earlier. Recall precision, temporal ordering, contextual relevance, consistency under load. These are important dimensions. A system that cannot recall accurately is not worth deploying regardless of its governance posture.

But none of these benchmarks test what happens when a regulated buyer asks the system to do something a research benchmark never needs to do. Produce a signed deletion event. Answer a query scoped to a specific valid-time and transaction-time pair. Export a tamper-evident replay of every memory access taken during a three-week mission. Prove that a particular memory node was not visible to the agent at a specific moment in the past.

Those are not exotic requirements. They are the minimum that a system operating in any environment governed by GDPR Article 17, DORA, MiFID II Article 16, or EU AI Act Article 12 has to be able to satisfy. The benchmarks are silent on all of them.

The seven governance axes

The diagram below shows where existing benchmark-leading systems typically land when you add the governance axes alongside the recall axis. The recall-leader wins dramatically on one axis and is absent from the rest.

Interactive: toggle the regulated-finance lens to see governance axes re-weighted for a CRO evaluation. Click legend entries to isolate individual systems. The recall-leader (red) dominates the first axis and is effectively absent from the remaining six.

The seven axes in that diagram are not arbitrary. Each one maps to a specific regulatory or operational requirement that a CRO, head of audit, or VC doing diligence on a regulated deployment will need to verify.

Recall accuracy. This is the axis the existing benchmarks measure. It matters as a floor, not a ceiling. A system that scores poorly here is not worth further evaluation. A system that scores well here but fails elsewhere is not deployable. The benchmark exists to establish the floor; the governance rubric establishes whether the system can actually be used.

Hard-delete. When a data subject exercises their right to erasure under GDPR Article 17, the memory system must be able to remove the relevant nodes in a way that propagates through the graph, including derived facts, cached embeddings, and any downstream nodes that incorporated the deleted content. The deletion must produce a signed, policy-attributed audit event. The event must be cryptographically bound to the original write so that a verifier can demonstrate the deletion, not merely assert it. A benchmark test for this capability is simple: write a personal data node, issue a deletion request, then verify that the deletion event is present, signed, and auditable, and that the deleted node no longer appears in any retrieval path.

Bitemporal as-of. A query scoped to “what did the system know about entity X at valid time T1, as of the transaction state that existed at T2” is a standard tool in financial and healthcare audits. It is not exotic. MiFID II requires firms to maintain records sufficient to reconstruct market activity at any given point. Healthcare governance inquiries routinely ask what a clinical decision-support system knew at the time of a specific decision. A system without bitemporal indexing cannot answer these queries correctly. Approximate reconstruction from logs is not the same thing, because approximation is not acceptable in a forensic context. The test is to issue a genuine bitemporal query against a live dataset and verify that the result set is consistent with what a correct bitemporal index would return.

Replay fidelity. A three-week regulated mission, a DORA resilience test, an AML evidence pack built over a fortnight: all of these produce a history of memory reads and writes that a regulator might ask to see reconstructed. Replay fidelity is the degree to which a system can reproduce, from its own records, the exact sequence of memory accesses taken during a past mission, including the subgraph visible to the agent at each step. This is what the Substrate factory’s deterministic replay capability provides: because the realtime data plane and the identity service make every state change deterministic and signed, the entire run is reconstructable from the event log. The sister article Deterministic replay as the audit trail covers the mechanics of this in depth.

Provenance depth. Every memory node that an agent retrieved and acted upon must carry a complete provenance record: the identity of the writing agent, the policy version under which the write occurred, the source attribution, the temporal bounds, and any supersession or deletion events. Provenance depth is a measure of how complete that record is across the entire node population. A system that has provenance on some nodes but not others is not meaningfully more auditable than one with no provenance at all, because you cannot know which nodes carry clean records without inspecting every one. The article on provenance as the real differentiator explains why this is the axis that most separates viable from non-viable systems for regulated deployment.

Policy-scoped retrieval. In a governed factory, different missions run under different policy perimeters. An agent working on a healthcare coding mission should not be able to retrieve memory nodes that were written under a financial services policy. Policy-scoped retrieval is the guarantee that the query interface enforces mission-level governance: not as a bolted-on permission layer that returns 403 Forbidden, but as a property of the retrieval architecture itself, such that out-of-scope nodes are structurally absent from the result set. The OAMP specification formalises this as an existence-hiding requirement: a denied query must not reveal that anything was withheld.

Audit export. At some point, a regulator, an auditor, or a court will ask for a signed, machine-verifiable export of the memory system’s activity over a defined period. This is the audit export capability. A system that can produce a human-readable log is not the same thing as one that can produce a cryptographically signed, schema-validated bundle that a third-party verifier can check without trusting the vendor’s claims. The difference matters enormously when the entity with access to the audit export and the entity requesting it are in an adversarial relationship, which is the situation that matters most.

Why the gap exists between recall benchmarks and governance requirements

The gap is not accidental. It reflects a genuine difference in who the benchmarks were designed for.

Recall benchmarks are designed for researchers comparing retrieval architectures. The question they answer is: does this architecture recover relevant facts more accurately and reliably than that one? That is a useful comparison for choosing between retrieval approaches.

Governance requirements are set by regulators who need to be able to audit systems that have legal significance. The questions they answer are: can this system demonstrate what it knew and when, can it prove it has removed what it was asked to remove, and can it produce a tamper-evident record of its operation? These questions are not related to retrieval architecture in any direct way. A system with excellent spreading-activation recall can have no bitemporal indexing. A system with passable recall scores can have exemplary provenance records. The two dimensions are largely orthogonal.

The market has not converged on governance benchmarks because regulated buyers have not yet demanded them. They have been accepting recall scores as proxies for overall suitability. That is changing. The EU AI Act enforcement deadline is one pressure. An accumulating body of incidents, where systems that looked excellent on recall benchmarks turned out to be structurally non-compliant when put in front of a regulator, is another. The procurement manager at the NHS trust was ahead of the curve; most of her peers are catching up.

A concrete governance benchmark rubric

What follows is a proposed governance test battery for use alongside conventional recall benchmarks in RFPs for agentic memory systems in regulated environments. Each test is designed to produce a binary pass or fail result, not a score, because in most regulatory contexts the question is not “how good is the governance” but “does the governance meet the requirement.”

Test 1: Hard-delete with provenance. Write a known personal data node to the memory system. Issue a deletion request citing a specific policy authority (for example, GDPR Article 17, data subject ID, and a reference timestamp). Verify that: (a) the node is absent from all retrieval paths after deletion; (b) a signed deletion event exists in the audit log; © the deletion event is cryptographically bound to the original write event; (d) any downstream nodes that incorporated the deleted content have been appropriately handled and flagged. A system that cannot demonstrate all four is a fail on this test.

Test 2: Bitemporal as-of query under a specific policy. Set up a scenario where a fact about an entity changes at valid time T4 but the system does not learn about it until transaction time T6. Issue a query asking what the system knew about the entity at valid time T5, as of the transaction state that existed at T5 (before T6). Verify that the query returns the old value, not the new one. Then issue the same query as of T7 and verify that the new value is returned, with the update’s transaction time correctly attributed. A system that cannot separate valid time from transaction time in its query semantics cannot pass this test.

The diagram below shows a concrete version of this test case, along with the hard-delete test from Test 1. The as-of slider is the examiner’s tool.

Interactive: drag the as-of slider to T5 to confirm that only the standard risk classification is visible at that point. Advance to T9 to verify the elevated classification is now visible. Select the personal-data node and press GDPR hard delete to run Test 1; verify the signed deletion event appears in the audit log at the bottom right.

Test 3: Full mission replay. Run a three-week simulated mission involving at least ten agents writing to and reading from the memory graph. At the end, issue a replay command and verify that the reconstructed sequence of memory accesses matches the original to within a defined tolerance (zero tolerance for any fact that was retrieved and influenced a decision, higher tolerance acceptable for incidental reads). A system that achieves replay by re-running the original inputs through the current model state is not passing this test: replay must use the signed event log, not recomputation.

Test 4: Provenance completeness audit. After the simulated mission, issue a provenance completeness query: for every memory node accessed during the mission, what fraction carry a complete provenance record (agent identity, policy version, source attribution, temporal bounds, and deletion event if applicable)? The pass threshold for regulated deployment should be 100%. A system that carries provenance on 94% of nodes and has no provenance on the remaining 6% has no way of guaranteeing that the 6% are not the nodes that matter most in an audit.

Test 5: Policy-scoped retrieval isolation. Configure two missions with different policy perimeters, using overlapping agent pools. Write memory nodes under each policy. Verify that an agent operating under mission A cannot retrieve nodes written under mission B, and that the denial does not reveal the existence of the withheld node. Issue the same query from each mission context and confirm that the result sets are correctly separated.

Test 6: Audit export verification. Issue a request for a signed audit export covering the mission period. Verify that: (a) the export is machine-readable and schema-validated; (b) all signatures can be verified by a third-party tool without access to the vendor’s infrastructure; © the export can be independently verified against the hash chain without the vendor’s participation.

None of these tests require unusual infrastructure to run. They require a memory system that was designed with governance as a first-class property, which is the Substrate factory’s approach with the governed memory engine and the OAMP protocol.

Sector walkthrough: government casework

Consider a government agency running an agentic factory to process welfare eligibility cases. The swarm ingests applications, queries the claimant’s history, applies the current policy rules, flags edge cases for a human caseworker, and issues a decision. The memory component is accumulating a rich graph of claimant data, policy rules, and decision history.

Three months in, a claimant raises a complaint alleging that the system was using an outdated version of a policy rule at the time their case was decided. The agency’s information rights team is asked to investigate.

With a recall-only memory system, the investigation starts with a problem: the current state of the memory graph reflects current policy. The historical state, what the agent actually saw on the day, is gone. The agency has to reconstruct what policy was in force from external records, match it to the decision timestamp, and hope the reconstruction is accurate. This is the situation that generates ombudsman complaints and judicial review challenges. The reconstruction is neither reliable nor auditable.

With a bitemporal memory system, the investigation is a query. The casework team sets the as-of time to the decision date, issues the same policy query the agent ran, and gets back the exact subgraph that was visible to the agent on that day. The result is deterministic. It is signed. It can be verified by any party with access to the public key. The investigation takes hours instead of months, and the answer is auditable in the way that matters to an ombudsman or a court.

The hard-delete test also applies here. When the claimant exercises their right to erasure, the agency must show that personal data nodes are gone while the decision record, needed to defend the decision under GDPR Article 17’s legal-claims exemption, is preserved. A system that conflates the two will either delete the wrong thing or retain the wrong thing.

The before and after: a three-month investigation, a reconstruction that is plausible but not provable, and an ombudsman report noting “incomplete records” versus a same-day query producing a verified, signed answer. The governance rubric turns the investigation from an ordeal into a procedure.

How Substrate approaches this by construction

The business factory’s memory layer is built from the start around the seven axes in the governance rubric, not as features bolted on after a recall-first design.

The governed memory engine is a bitemporal graph. Valid time and transaction time are first-class indexed dimensions, not reconstructable metadata. Every query can be scoped to any combination of the two axes at roughly 3 ms recall latency (verified from Substrate platform metrics). This is fast enough that bitemporal scoping adds no meaningful overhead to the swarm’s execution path.

Hard-delete in the governed memory engine is a cryptographic operation. The deletion event is signed by the requesting agent identity, it carries the policy authority under which it was issued, and it is committed to the append-only event log as a first-class block. The hash chain ensures the deletion event is tamper-evident. The OAMP existence-hiding requirement ensures that deleted nodes do not appear in any retrieval path, including count responses and streaming events.

Provenance depth is a write-time property, not a post-hoc annotation. Every write to the governed memory engine carries the OAMP provenance fields: agent identity, policy version, source attribution, temporal bounds, and the action ID that triggered the write. This information is inside the signature boundary. Changing any field would change the block hash and break the chain. The result is that 100% of nodes have complete provenance records if the system is operating correctly, which is the only pass threshold that makes sense for regulated deployment.

Replay fidelity comes from the integration with the realtime data plane and the identity service. Every memory read and write is a signed event in the realtime data plane event log, so mission replay reconstructs from the signed event sequence, not from recomputation against current state.

Policy-scoped retrieval is enforced at the query interface by the mission orchestrator, which scopes each query to the mission’s declared policy perimeter before it reaches the governed memory engine. Policy isolation is part of the orchestration contract, not a per-developer convention.

Audit export is the signed event log itself. Any verifier with access to the public key can check the signatures without involving the factory’s infrastructure.

The CRO evaluating a system for regulated deployment is not looking for the recall leader. They are looking for the system that will still be deployable after the first regulator inquiry. On that criterion, slightly lower recall with a complete governance surface beats higher recall with none. That is the same argument provenance, not recall accuracy, is the real differentiator makes at the individual-node level; the governance rubric extends it to the full battery of test cases a deployment will face over its lifetime.

What to include in the RFP

The standard evaluation criteria of accuracy, latency, and throughput are necessary but not sufficient. The following requirements should appear in any serious RFP alongside them.

Require a live demonstration of a bitemporal as-of query against a dataset where a fact changed between valid time and transaction time. A vendor who can only describe the feature without demonstrating it live does not have it.

Require a provenance completeness report covering 100% of nodes accessed during the evaluation period, with agent identity, policy version, source attribution, and temporal bounds for each. Ask explicitly which fields are populated at write time versus inferred post-hoc. The distinction matters in a forensic context.

Require a demonstrated hard-delete under GDPR Article 17 that produces a signed deletion event cryptographically bound to the original write, verifiable by your information governance team without vendor assistance.

Require a machine-verifiable audit export. Ask whether it can be independently verified against the hash chain without the vendor’s infrastructure involved.

Require a policy-scoped retrieval isolation test with two simultaneous policy perimeters. Denied queries must not reveal the existence of withheld nodes (the OAMP existence-hiding requirement).

Ask each vendor to score their own system against the seven governance axes in the radar diagram and then demonstrate each one live. The self-assessment reveals where they expect the gaps; the demonstration confirms whether those gaps are real. A vendor who has not heard of LoCoMo’s governance limitations has been selling into research contexts and is discovering enterprise requirements for the first time.

The 90-day pilot that validates governance, not just recall

The right way to evaluate a memory system for regulated deployment is a parallel-track pilot: recall quality and governance simultaneously, from day one.

Track one measures what the existing benchmarks measure. Recall precision and recall at various k. Latency under the query load the production system will generate. Consistency under concurrent writes from multiple agents. Hallucination rate on facts that are present in the memory graph but were written under a policy perimeter the current query should not be able to access. These numbers tell you whether the system is useful.

Track two runs the governance test battery described above from the first week. Every week, the information governance team issues a governance probe: a bitemporal as-of query, a provenance completeness check, a policy isolation test, or a hard-delete exercise. The probes are recorded. By week twelve, you have twelve weeks of evidence that the system’s governance properties are stable, not just claimed.

The two tracks can fail independently. A system that passes track one and fails track two is an impressive demo that you cannot put in front of a regulator. A system that passes track two and fails track one is a well-governed system that nobody wants to use because the retrieval quality is not good enough for the workflow. You need both.

The pilot design also produces the artefacts for the governance committee presentation. The governance probe results are themselves signed, auditable records. When you present to the board or the information governance committee, you are not describing what the system will do in production. You are showing what it did during the pilot, in a form that can be independently verified.

There is a practical reason to run the governance track from day one rather than as a gate after the recall track. Memory systems accumulate provenance debt. If a system runs for twelve weeks with incomplete provenance records, correcting that retroactively is not possible: the signatures were not there at write time, and no post-hoc annotation can replicate them. A governance problem discovered at week twelve of a recall-only pilot means starting over with a different system.

That is the quiet point underneath everything in this rubric. The governance axes are not a checklist of features you add to a recall system. They are properties of a memory architecture that was designed from the start around the questions that regulated buyers will eventually ask. A system that was not designed that way will not pass the governance test battery, not because of missing features but because of fundamental architectural choices that cannot be changed without rebuilding the system.

The two vendors who made it to the final round of that NHS trust procurement were not differentiated by their recall benchmarks. They were differentiated by whether one of them had thought, at design time, about the questions a clinical governance team would ask three years into a production deployment.

For the factory that declares a mission and a budget and produces both the software and the signed evidence that it was built correctly: Bitemporal memory as the compliance backbone covers the temporal indexing in more depth, and Memory that was never there: hard-delete and governance by omission covers the deletion mechanics. If you want to see how the memory layer fits into the full governance contract across all six Substrate systems, you can request the investor brief.