Article 記事

The factory's memory that improves itself

      author
      Jonathan Conway
    

      timestamp
      30 May 2026
    

      classification
      governed-memory / the memory tuning loop / online-learning / thompson-sampling / edge-weights / self-tuning / regulated / substrate / governance
    

A large European insurer ran an agent-assisted claims review programme for eleven weeks before anyone noticed that retrieval quality had quietly degraded. The early weeks had been excellent. The agent surfaced the right policy clauses, the right precedent cases, the right clinical guidelines. By week ten it was pulling stale information and missing active exclusions. Nobody had changed anything. The underlying documents had not shifted. What had shifted was the query distribution: the compliance team had moved from routine renewals to a wave of complex appeals, and the memory engine was still optimised for the simpler workload it had seen at launch.

The retrieval parameters were static. They had been tuned, once, before go-live, against a benchmark that looked nothing like live appeals traffic. Every agent memory system on the market works this way. A set of weights, a decay factor, a fusion balance between keyword and vector search: fixed at ship time, adjusted by hand when someone notices a problem, which is usually after the problem has already cost something.

For a regulated factory, this is not a minor inconvenience. A retrieval system that quietly drifts out of alignment with the actual workflow is a compliance risk. The evidence pack an agent assembles is only as good as the context it retrieved, and if retrieval has degraded, the pack has degraded with it, possibly without anyone knowing until an auditor asks a question the pack cannot answer.

The right answer is not better upfront tuning. It is a memory system that keeps learning, inside the governed perimeter, in a way that is itself auditable.

Why static parameters fail in long-running programmes

The problem is not that static tuning is done badly. Most teams actually tune carefully at the start. The problem is structural: the traffic a long-running regulated programme sees six months in is rarely the traffic it saw at launch.

A financial-crime surveillance programme starts with transaction screening. Three months later the scope expands to include correspondent banking relationships. Six months in the team is doing deep cross-border investigation. The query types have changed fundamentally. Entity lookups have given way to multi-hop relational queries. The optimal propagation depth for a simple name-match query is two hops. The optimal depth for “show me all the ownership chains that connect this shell company to a sanctioned individual” might be five or six hops, and the decay factor that kept week-one retrieval tight and fast will kill week-twenty retrieval dead.

The governed memory engine’s retrieval surface has several dozen tunable parameters. The ones that matter most in practice are propagation depth, the decay rate applied at each hop of the spreading-activation traversal, the fusion weights between BM25 keyword scoring and vector similarity, the edge-type weights for episodic versus semantic versus temporal versus hierarchical connections, and the base-level activation alpha that determines how much recency boosts a node’s relevance. Change the workload and the optimal values of every one of these shifts.

You can respond to this by retuning manually every quarter. That is what most teams do. It requires a bench of labelled queries, a parameter sweep, a validation run, a deployment, and someone who knows what they are looking at. It also means you are always tuning for the past, not the present.

The alternative is to make the tuning continuous and automatic, which means running an online learning loop inside the factory while it operates.

The learning loop: from offline sweep to live adaptation

The the memory tuning loop, the optimisation layer for the governed memory engine, operates in three phases that build on each other. The first is an offline Bayesian sweep before the factory goes live. The second is query-adaptive routing that selects among optimised sub-configurations at request time. The third, and the one this post focuses on, is online learning via per-tenant Thompson Sampling bandits that run in the request path and adapt retrieval parameters continuously from live feedback.

The offline sweep matters because it gives the online learning a good starting point. A cold-start memory engine with random parameters and a Thompson Sampling bandit tightening around the wrong region of parameter space will take a long time to converge anywhere useful. Run the BOHB sweep first, on the initial labelled workload, and the bandit starts with an informed prior. New tenants begin with the offline-optimised defaults and specialise from there, rather than learning from scratch.

Interactive: trigger the sleep cycle to watch background consolidation compress episodic memories into semantic clusters. Adjust the decay rate to see how the Ebbinghaus forgetting curve accelerates or slows. The feedback arrows in this diagram represent the reinforcement signals flowing back into the bandit state.

The online learning mechanism is Dirichlet-Categorical Thompson Sampling, one bandit per tunable parameter per tenant. Thompson Sampling has been production infrastructure at Netflix, Microsoft, and Amazon Advertising for years. The reason it suits this application is a combination of three properties.

It is provably near-optimal in regret terms: the total reward lost to exploration grows as O(log T), which is the theoretical lower bound for bandit problems. It is trivially parallelisable because each query samples independently with no coordination overhead. And it handles cold starts gracefully because you can initialise the Dirichlet posterior with a prior, which in the case of the memory tuning loop is the offline-optimised configuration. You are not starting in the dark.

The mechanics are as follows. Each continuous parameter, say the propagation depth, is discretised into ten bins covering its sensible range. On each query, the bandit for each parameter samples a bin from its current posterior distribution. The combined bin selections form the ActivationConfig for that query. The query runs. Feedback arrives when a user accesses a retrieved node within five minutes, or when an explicit relevance signal comes in. The posterior for each parameter is updated. Over roughly one hundred queries the distribution tightens toward the near-optimal bins for that tenant’s current workload.

The thing that prevents the system from locking in permanently on a configuration that suited the old workload is concentration decay. Every five hundred queries, each Dirichlet concentration parameter is reduced by five percent toward its uniform mean. The rank ordering of bins is preserved, so the engine does not forget everything it has learned, but the confidence degrades enough that it will re-explore when usage patterns shift. A compliance team that pivots from entity lookups to temporal queries after a regulatory change does not need manual retuning. The bandits adapt within two halflives, roughly a thousand queries.

Edge-weight tuning and PPR: making the graph smarter

Beyond the scalar parameters, retrieval quality in a memory graph depends heavily on the weights assigned to individual edges. An edge from a “prior claim” episode node to a “coverage exclusion” fact node carries different information than an edge between two contemporaneous transaction nodes. Treating them identically is wrong.

The governed memory engine’s spreading-activation traversal uses Personalised PageRank (PPR) as the backbone for multi-hop propagation, with edge weights that modulate how much activation flows across each connection. When the learning loop observes that a particular edge was on the retrieval path for a query that received positive feedback, that edge gets a small weight increase. When an edge was traversed in queries that were consistently ignored or marked irrelevant, its weight decays. The effect is that the graph itself becomes tuned to the actual information-access patterns of the programme.

This is distinct from the parameter-level bandits. The bandits learn the global retrieval posture: how deep to go, how much to weight keyword versus vector, how much to penalise recency. Edge-weight tuning learns the local topology: which paths through this particular programme’s memory graph are genuinely informative. A financial-crime graph will develop strong edge weights between sanction-list entities and transaction episodes. A healthcare graph will develop strong weights between diagnosis codes and the specific clinical guidelines that apply to them. The two graphs, even running the same base parameters, will have learned completely different connectivity strengths from their respective workloads.

Interactive: click any node to re-anchor the query. Drag the fan-effect decay slider to see how tighter decay limits the retrieval horizon. The edge weights shown are illustrative of what a tuned graph looks like after several hundred queries: the edges that carry genuine information glow brighter and are drawn thicker. Hover any edge for its current weight.

The PPR teleportation probability is itself one of the bandit-tuned parameters. A high teleportation rate causes the traversal to frequently snap back to the query anchor, which favours precision: you get fewer results, closer to the query. A low teleportation rate allows the traversal to wander further along high-weight paths, which favours recall: you surface more contextually related nodes, at the cost of occasionally pulling in material that is only tangentially relevant. The bandit learns, from the specific query types a programme runs, which setting the users actually prefer.

The safety question: self-optimisation without policy drift

Here is the part that a compliance officer will ask about: how do you let a memory system tune itself without it drifting out of policy?

The concern is legitimate. A memory system that improves retrieval for financial-crime queries might, if left entirely unsupervised, learn to weight evidence nodes that are technically relevant to investigators but outside the declared scope of the programme. A healthcare memory system might learn to surface patient records that were excluded from the programme by a data-subject consent decision. Self-optimisation can undermine governance if the two are not designed to work together from the start.

The Substrate answer has two parts.

First, the governance perimeter is enforced before retrieval begins and is not part of the learning loop. the governed memory engine’s scope governance layer filters the graph to the nodes and edges that are in scope for the current programme, based on the declared mission and the active policy configuration. The spreading-activation traversal, the edge-weight updates, and the bandit posteriors all operate entirely within the filtered graph. The learning system cannot observe nodes outside its perimeter. It cannot develop a preference for them. The learning optimises retrieval over the permitted memory, not over all memory.

Second, every parameter update, every posterior state transition, every edge-weight adjustment is itself a signed event in the programme’s audit log. This is the key governance twist. The self-improvement is not a background process that happens invisibly. It is a sequence of signed, replayable operations. A regulator reviewing the programme can ask: at week seven, what was the propagation depth the engine was using, and why had it changed from week two? The answer is in the log: here are the feedback signals that drove the posterior toward deeper propagation, here is the exact state of the bandit at that moment, here is the signed edge-weight adjustment that followed a particularly high-signal retrieval session. The improvement is auditable.

This is the property that distinguishes the memory tuning loop running inside the governed Substrate factory from a standalone memory optimisation system bolted onto a glue stack. A LangChain implementation with a Thompson Sampling wrapper can learn. But the learning is opaque. There is no signed record of how or why retrieval changed. If retrieval quality degrades, or if a regulator asks why the system surfaced a particular piece of evidence in week nineteen, there is no answer. The audit trail, if it exists at all, covers what the agent did with the memory; it does not cover how the memory engine came to retrieve what it retrieved.

Substrate’s deterministic-replay property means the audit trail covers everything, including the evolution of the retrieval system itself. the realtime data plane’s storage engine records every state transition. the identity service’s identity plane signs every action. The tuning loop is inside that perimeter, not around it. If you want to understand the mechanics of how deterministic replay makes the full run reconstructable, the audit trail article covers that in depth.

A concrete example: financial-crime in months one through twelve

Consider a financial-crime programme declared against Substrate at the start of a compliance year. The mission is broad: maintain a continuously updated graph of entity relationships, transaction patterns, and sanction exposures relevant to the firm’s correspondent banking book.

At month one, most queries are entity lookups. “Is this legal entity name on a sanctions list?” The bandit learns quickly that shallow propagation, high BM25 weight, and tight decay are the right settings. The graph develops strong edge weights between entity nodes and their directly linked fact nodes. Retrieval is fast and precise.

At month three, the programme scope expands to include relationship screening. “Show me all the beneficial ownership chains connecting this client to any politically exposed persons.” Query depth requirements jump. The bandits begin shifting posterior mass toward deeper propagation and higher PPR exploration probability. The edge weights on hierarchical “owns-via” connections strengthen. The process takes perhaps two weeks to converge fully, without any manual intervention.

At month six, a regulatory change requires the programme to incorporate transaction pattern analysis. The query distribution shifts again. The bandits adapt. This time the shift is larger, and the five-percent concentration decay is not fast enough on its own: a compliance officer manually triggers a partial decay event, effectively telling the system to re-explore more aggressively. This is a supported operation and is itself a signed event in the audit log.

At month twelve, an auditor asks to review the programme. The auditor’s questions include: how did retrieval quality evolve over the year, what drove the changes, and can you demonstrate that the programme never surfaced out-of-scope data? The first two questions are answered by the bandit state log and the edge-weight history. The third is answered by the scope-governance events, which show the perimeter configuration at every point and the fact that the learning loop operated entirely within it.

That is the before and after in concrete terms. Before: a static retrieval configuration that starts well and degrades quietly, with no audit trail for the degradation, and a manual retuning process that runs quarterly at best. After: a retrieval system that improves continuously, inside a governed perimeter, with a complete signed record of how and why it changed.

How this differs from the memory-consolidation picture

There is a sibling article on memory consolidation and decay in long missions that covers related but distinct territory. The confusion between them is understandable, so it is worth being precise.

Consolidation and decay are about the structure of what is stored: episodic memories being compressed into semantic summaries, infrequently accessed nodes losing activation weight, background sleep-cycle processes that reorganise the graph topology. They are operations on the memory graph itself.

Self-tuning is about the retrieval algorithm that operates over the graph: the parameters controlling how deep the traversal goes, how activation spreads, how much weight different edge types carry, how keyword and vector signals are fused. The graph can be perfectly organised and still be retrieved badly if the parameters are wrong for the current query distribution.

In practice both happen simultaneously. The sleep-cycle consolidation process runs in the background, reorganising the graph, while the Thompson Sampling bandits adjust the retrieval posture in the foreground. They are complementary. Consolidation ensures the graph stays navigable and does not bloat with redundant episodic detail. Self-tuning ensures that the retrieval algorithm navigates it well given the current workload. Neither substitutes for the other.

The sibling article on spreading activation and explainability covers the underlying retrieval mechanism in more depth, including why explainability matters for regulatory defence and how the activation paths are surfaced as audit artefacts.

What happens when the learning loop goes wrong

No adaptive system is without failure modes, and being honest about them is part of what a regulated buyer deserves.

The most common failure mode is posterior collapse: the bandit converges very quickly on a configuration that suits one query type and the concentration decay is not fast enough to recover when the workload changes. The mitigation is the manual decay trigger, which compliance officers can invoke to force re-exploration. The trigger is a signed operation, so the decision to intervene is itself in the audit log.

The second failure mode is feedback noise: the five-minute access window used as a positive signal can be wrong. A user might access a retrieved node out of curiosity rather than because it was genuinely relevant, or might not access a highly relevant node because they already knew it. the memory tuning loop supports explicit relevance feedback as a supplement to the implicit signal, and the bandit weight between implicit and explicit feedback is itself a tunable parameter. In high-stakes regulated programmes, compliance teams sometimes configure explicit feedback as the primary signal and implicit access as a secondary hint.

The third failure mode is scope drift by stealth, the concern the compliance officer raised above. The mitigation, as described earlier, is structural: the learning loop does not see out-of-scope nodes and therefore cannot develop a preference for them. But the responsible approach is also to run periodic scope-governance audits, checking that the scope-perimeter events in the log match the current policy configuration, and flagging any divergence. This is a standard part of the quarterly review process for Substrate programmes.

Retrieval provenance as a governed artefact

The point about provenance is worth dwelling on, because it is where self-improving memory connects to the broader Substrate governance story.

The conventional framing for memory provenance is: when the agent cited this fact, which memory node did it come from, and when was that node written? That is the minimum the provenance article covers.

For a self-tuning system, there is a deeper provenance question: why, at the moment of retrieval, did the engine give that node a high activation score? The answer involves the current values of the bandit parameters and the current edge weights on the path from the query anchor to that node. If the bandit parameters have changed since the programme started, the same query submitted today will produce different activation scores than it would have produced at month one. A regulator reviewing a week-nineteen decision needs to know which retrieval configuration was in effect at week nineteen, not the current one.

Substrate records this. Each retrieval event carries a snapshot of the active ActivationConfig, including the current bandit-sampled parameters and the edge weights on the traversal path. The retrieval event is signed, and it references the bandit state record that produced those parameters. The full causal chain from query to retrieved context to agent decision is reconstructable. This is what “the self-improvement is signed and replayable” means in practice.

This level of provenance is structurally impossible in a glue-stack memory system where retrieval parameters live in a config file that gets overwritten at redeployment. There is no record of what the parameters were at the time of a historical retrieval. The audit trail covers what the agent did; it does not cover why the memory engine surfaced what it surfaced. For a regulated buyer, that is a gap that will surface at exactly the wrong moment.

What to demand in an RFP

If you are evaluating agent memory systems for a long-running regulated programme, the questions below will separate the serious systems from the ones that will quietly degrade.

Ask whether the retrieval parameters are fixed at deployment or continuously adapted. If fixed, ask how they plan to retune when the query distribution shifts after go-live, and how long that process takes. If adaptive, ask what the adaptation mechanism is and whether it operates inside the governed data perimeter or requires data to leave it.

Ask whether edge weights in the memory graph are static or learned from programme-specific feedback. A memory graph with static edge weights treats all connections as equally important regardless of which ones actually carry information in your specific domain. That is a significant retrieval quality ceiling.

Ask specifically whether the adaptation process is auditable. This question will reveal a lot. The answer you want is: yes, every parameter update, posterior state transition, and edge-weight adjustment is a signed event in the programme audit log, and you can replay the state of the retrieval system at any historical moment. The answer you will often get is: we log retrieval queries and responses, and we can show you aggregate quality metrics. That is monitoring, not auditability. The distinction matters when a regulator asks not just what the agent retrieved but why the memory engine surfaced it.

Ask how the system prevents the learning loop from surfacing out-of-scope data. The answer should be architectural: the learning operates only over nodes that are in scope for the current programme, not around the scope filter. If the answer is “we monitor for anomalies,” ask what the response time to an anomaly detection is and who is alerted.

Finally, ask for a 90-day pilot design that includes a baseline retrieval quality measurement at day one, a mid-point measurement at day 45, and a final measurement at day 90, with the expectation that quality should have improved as the system adapts to the actual query distribution. If the vendor is not willing to commit to measurable retrieval improvement over a 90-day live deployment, their adaptive learning claims are aspirational rather than operational.

The long-run case

The argument for self-improving memory is not complicated. Regulated programmes run for years. Query distributions drift. Manual retuning is expensive, slow, and always retrospective. A memory system that learns from its own usage, inside a governed perimeter, with a complete audit trail for the learning itself, gets progressively better at exactly the workflows the programme actually runs, without requiring anyone to notice that things have drifted and intervene.

The governance argument is the less obvious but ultimately more important one. For a regulated factory, it is not enough for the memory to improve. The improvement has to be defensible. The auditor reviewing a week-forty decision cannot accept “the memory system got better over time” as an explanation for why retrieval produced a different result than it would have at week four. They need to see the signed record: here is how the bandit parameters evolved, here are the feedback signals that drove those changes, here is the exact configuration that was active at the moment of retrieval.

That is the property that makes self-improving memory viable in regulated programmes rather than a liability in them. The learning is not happening in the dark. Every step of it is in the signed, replayable audit trail, because the tuning loop runs inside the same governed factory that signs and records everything else. You declare the mission and the budget; the swarm runs inside the perimeter; and the memory it uses gets sharper over time in a way that you can show to a regulator and have them understand exactly what happened.

If you want to understand how that audit trail connects to the broader factory architecture, request the investor brief or start with how provenance becomes a competitive differentiator in its own right.