Article 記事

Memory consolidation, decay and sleeping in long-lived regulated missions

author Jonathan Conway
timestamp 28 May 2026
classification kizuna-mem / memory / consolidation / decay / dark-factory / budget-governance / regulated / substrate / ebbinghaus

The audit report was supposed to take three days. By day eleven the swarm was still running, and the cost counter had climbed past four times the estimate. Nobody had done anything obviously wrong. The budget ceiling existed on paper. The problem was simpler and harder to fix: by day three the memory graph had swallowed every intermediate observation, every draft paragraph, every failed attempt at a section heading. The agents kept retrieving all of it. Reflection loops kept pulling stale drafts. The context window bloated. Token spend climbed with it. A mission that should have consolidated its working memory the way a human consultant does, leaving only the relevant skeleton behind after each day’s work, had instead accumulated the cognitive equivalent of never throwing anything away.

This is not a fringe failure mode. Any dark factory mission that spans more than a few hours hits it. And long-horizon regulated missions, the ones where the factory is most valuable, are precisely the ones most exposed. A quarterly control-testing cycle. A full DORA evidence pack for a tier-one bank. A clinical coding audit across six months of claims. These missions run for weeks or months. Without memory management that works on the same timescale, the factory does not run faster than a human team. It runs slower, at much higher cost, with a context that has become its own enemy.

The biological parallel is obvious enough that it is worth stating plainly. Humans consolidate during sleep. The hippocampus replays episodic memories to the cortex, which distils the important patterns and lets the unimportant raw experiences fade. What the cortex retains is semantic, structured, efficient to retrieve. What it discards is the noise. The whole process happens in the background, while the person is not actively working, at a cost so low it mostly runs on the residual metabolism of a sleeping body. The output is a brain that performs better on day two than it would have if it had simply stayed awake.

Kizuna-mem gives the factory an analogue of that process. It is not a metaphor. It is a concrete set of background pipelines, triggered on a cadence, that transform episodic raw observations into semantic community summaries, retire what has decayed below utility, and adjust decay rates based on what the mission’s agents have actually been reaching for.

Why a growing graph is a budget problem

Before the mechanics, the framing. Memory bloat is not just a quality issue. It is a cost issue, and in a governed factory where every token is metered before it runs, it is a budget governance issue.

The connection runs like this. Kizuna-mem’s retrieval works through spreading activation across the memory graph, as covered in spreading activation for regulators. The breadth and density of that graph determines how many nodes get activated per retrieval call. More nodes means more tokens in the context passed to the working agent. More tokens per step means faster burn against the mission’s declared budget. A graph that has grown to include three weeks of uncondensed episodic observations from a long-running mission will produce retrieval calls that are systematically more expensive than the same graph after a consolidation pass has promoted the important patterns to summary nodes and let the raw episodes fade.

Ninmu, the swarm conductor that holds the global budget ledger, has no way to stop this burn without the co-operation of the memory layer. The budget ledger tracks inference cost per task. It routes tasks to cheaper models when the ceiling is under pressure, as described in cost governance before the invoice. But if the retrieval context for every task is bloated by three weeks of uncondensed episodic noise, cheap routing only goes so far. You are saving on model calls while haemorrhaging on context length.

Consolidation is what closes the loop between memory hygiene and budget governance. A well-consolidated graph retrieves the same semantic content in fewer tokens. That reduction flows directly into task cost, which flows directly into how much headroom the budget ledger has left for the work that actually matters.

The numbers are illustrative rather than measured at the precision of a formal benchmark, but the direction is clear. A mission running a nightly consolidation pass on its memory graph can expect retrieval context to stay roughly stable in size even as the mission extends in time, because the pipeline is continuously compressing episodic observations into semantic summaries at a ratio that grows with mission duration. Without consolidation, context size grows roughly linearly with mission time. The difference across a three-week mission, modelled conservatively, represents a meaningful fraction of total token spend. Exact figures depend on the retrieval fan-out and the specifics of the mission domain, but the governing principle is not in question.

The consolidation pipeline: from episodes to semantics

The pipeline has four distinct stages, and understanding what each one does explains why all four are necessary.

Interactive: watch raw episodic observations arrive at the left, get deduplicated and resolved in the near-real-time stage, then promoted during background consolidation. Trigger the sleep cycle button to watch a consolidation wave compress the working set. Drag the decay rate slider to see how quickly the lowest-utility nodes fade in the rightmost phase.

Ingest is where raw observations arrive. Every time an agent reads a document, executes a tool, observes an intermediate result, or receives a message, an episodic node lands in the graph with its valid timestamp and transaction timestamp. At this stage nothing is interpreted. The Observer component creates the node with consolidation_count = 0 and inserts the raw event. The graph is growing.

Dedup and resolution runs on a near-real-time cadence, typically within seconds of ingestion. The Reflector component runs O(E) conflict detection across newly arrived edges, groups them by subject and predicate, identifies duplicates and contradictions, and merges surviving nodes. When two observations refer to the same real-world entity, the surviving node’s consolidation_count increments by one. This is the first stage of reinforcement-aware accounting: a fact that has been observed multiple times, across different agents in the same mission, comes out of this stage with a higher count than a fact observed once. That count will later influence how slowly it decays.

Background consolidation is the sleep process. It runs on a longer cadence, often hourly or nightly depending on mission configuration, and it does the expensive work of community detection, L1 overview generation, and hierarchical promotion. Raw episodic nodes that belong to the same semantic cluster are summarised into a community node at a higher level of abstraction. The raw nodes are retained in the graph for bitemporal provenance (a regulator can still ask what the agents knew at hour three, day one, and retrieve the original observations) but they are demoted in the retrieval scoring. Their consolidation_count increments from the consolidation pass itself, which sounds counterintuitive: why increment the count of a node you are effectively retiring? Because the count is a reinforcement signal that feeds into the decay formula. A node that has been through a consolidation pass has been structurally important enough to participate in community formation. That should slow its decay relative to a node that arrived once and was never touched again, even though the community summary above it now handles retrieval for that semantic cluster.

Natural decay is the rightmost phase. Nodes whose consolidation_count remains low and that have not been accessed by retrieval in some time fade toward the threshold below which they are excluded from retrieval scoring. The decay formula in adaptive mode is:

R(m, t) = exp(-(t - r_m) / (tau * (1 + eta * ln(1 + n_m))))

Where t is now, r_m is the time of last reference, tau is the base half-life in days, eta is the reinforcement sensitivity, and n_m is the consolidation_count. The predecessor post adaptive decay and reinforcement-aware memory covers this formula in full, including the logarithmic saturation that prevents heavily-used nodes from becoming effectively immortal. The key point for long-horizon missions is that the formula was designed precisely for this scenario: memories that agents actively use during the mission stretch their own half-life, while observations that arrived early and were never revisited fade naturally without any manual curation.

The combination of these four stages is what gives the factory biological-style memory hygiene. Ingest accumulates. Resolution deduplicates. Consolidation promotes. Decay prunes. The result is a graph that stays navigable as the mission extends, rather than one that grows monotonically until it is too expensive to use.

How Kizuna-mem implements the sleep cycle

The consolidation pipeline runs continuously at the ingest and dedup stages. The background consolidation stage is different: it is heavier, it involves community detection on the full graph, and it should not compete with the live retrieval and ingestion workloads. Kizuna-mem schedules it as a background process, configurable per mission, that runs when retrieval traffic is low.

In practice this means nightly consolidation for missions that run on human business-day cadences, and more frequent passes for missions that run continuously. The configuration is at the mission level in Ninmu, because consolidation cadence is a policy decision that affects cost (consolidation passes consume tokens for L1 summary generation) and should be declared alongside the mission budget, not set as a hidden default somewhere in the memory layer.

The relevant configuration sits in the mission declaration:

[memory]
consolidation_cadence = "nightly"   # "hourly", "nightly", or a cron expression
decay.mode = "adaptive"
decay.tau_days = 180
decay.eta = 0.8

Setting consolidation_cadence = "nightly" instructs Ninmu to schedule a consolidation task after the mission’s last active task completes each day. The task is metered like any other task in the budget ledger. This is the important point: consolidation is not free. Generating L1 summaries requires inference calls, and those calls count against the budget. Declaring them in the mission budget is not bureaucracy. It is accurate accounting. A mission that consolidates nightly is spending a small fraction of its budget on compression, and the payoff is that every subsequent day’s retrieval calls are cheaper by a larger fraction. For a multi-week mission the economics are strongly in favour of consolidating. For a two-hour mission they probably are not.

The kizuna-tune component sits above this and provides the self-improving loop. Where the consolidation pipeline compresses existing observations into semantic summaries, kizuna-tune adjusts the retrieval weights themselves based on which memories the mission’s agents have found most useful. An agent that repeatedly retrieved a particular community node and produced good downstream results, as judged by the mission’s own success signals, causes the weights on the edges leading to that node to be adjusted upward. Future retrieval calls from agents with similar context will find it faster and with less spreading-activation budget. This is online learning within the governed perimeter: the factory gets better at the customer’s specific regulated workflows over time, and the improvements are auditable because every weight update is a signed event in the Cosmictron event log.

kizuna-tune does not change the compliance guarantees of the memory layer. The decay mode is still determined by the mission-level policy. If the mission is running compliance decay (the right choice for most regulated deployments, as the predecessor post discusses at length), kizuna-tune adjusts retrieval weights without touching the decay schedule. The DPO’s guarantee, that all memories older than N days have decayed below threshold Y, still holds. The tuning operates on retrieval scoring, not on the temporal validity of the underlying facts.

The cost inversion over a multi-week mission

The token reduction from consolidation and decay does not happen all at once. It compounds across the life of the mission, and the compounding is what makes it decisive for long-horizon work.

Interactive: toggle between the unmanaged baseline and the consolidated mission to compare how retrieval token cost evolves across a four-week mission. All figures are illustrative. Hover a segment for the composition. The direction of the shift is the point.

In the first few days, both a consolidated and an unconsolidated mission look similar. The graph is small. Retrieval is cheap. The background consolidation pass has not yet had much to work with. The lines diverge around the end of the first week.

By week two, the unmanaged graph has accumulated hundreds of episodic observations from the mission’s agents. Retrieval calls are pulling context that includes early-mission observations now irrelevant to where the work has reached. The effective context length per retrieval call has grown. Token spend per task is up. The budget ledger is under pressure.

The consolidated mission, by contrast, has spent four nightly consolidation passes turning those episodic observations into a handful of community summaries. The raw observations are still in the graph for provenance, but they sit below the retrieval threshold for current work. Retrieval calls are pulling tight semantic summaries. Context length per call is stable. Token spend per task is close to what it was on day one.

By week four the gap is material. The illustrative model here assumes a mission with roughly 200 agent interactions per day, an average episodic-to-semantic compression ratio of around 10:1 from nightly consolidation, and a retrieval fan-out of 15 nodes per call before spreading activation. Under these assumptions, the consolidated mission’s retrieval token cost at week four is around a third of the unmanaged baseline. The actual figures for any specific mission will depend on its retrieval patterns and domain vocabulary, so treat these as directional rather than commitments.

What is not illustrative is the mechanism. Smaller graph of active nodes means cheaper retrieval. Cheaper retrieval means more of the budget goes to productive inference rather than context assembly. More of the budget going to productive inference means the factory can run longer before hitting the ceiling, or run the same mission for less.

A finance audit walkthrough

Take the quarterly control-testing mission described in the introduction. A tier-one bank running a quarterly control test is asking the factory to do something that, done manually, takes a team of three or four for six to eight weeks. The factory should take a few days. But it is still a multi-day mission, and the memory management problem is real.

Without consolidation, the mission looks like this. On day one, agents ingest the control catalogue, the policy documents, the historical test results, and the transaction data sample. Several thousand observations land in the episodic graph. On day two, agents start drafting test narratives, and retrieval calls pull context that includes all the day-one ingestion observations. Many of them are now irrelevant to the specific controls being tested today. Context is bloated. Token spend is high. By day four, when agents are assembling the evidence pack, they are working with a retrieval context that includes four days of raw observations, many of them superseded by later work. The reflection loops that check evidence quality are expensive. The budget is under pressure.

With nightly consolidation and adaptive decay, the same mission runs differently. After day one, the consolidation pass promotes the control catalogue into semantic community nodes grouped by risk domain and control type. The raw ingest observations are retained for provenance but demoted in retrieval. Day-two agents retrieving context for a specific credit risk control get the semantic community node for credit risk controls, not a flat list of every document that touched that topic on day one. By day four, the evidence assembly agents are working with a well-organised graph of semantic summaries, and the reflection loops are cheap because the context they pull is tight and relevant.

The signed output is the same either way. The bitemporal record of every observation is preserved for the regulator to audit. What changes is how much of the budget the factory consumed getting there, and whether the mission stayed within the declared ceiling. The governance story here is directly connected to the memory story. A mission that has been declared with a hard budget ceiling needs memory management that does not erode that ceiling through context bloat. Consolidation is not an optimisation layered on top of the governance. It is part of how the governance works.

The regulator’s question, what did the agents know and when, is answered by the bitemporal layer independently of the consolidation state. As covered in bitemporal memory as the compliance backbone, the raw episodic observations remain in the graph with their original valid timestamps even after consolidation has promoted their semantic content to summary nodes. A compliance officer asking for the agent’s knowledge state at hour three of day one gets the original observations, not the post-consolidation summaries. The two representations coexist. Consolidation is an operational optimisation; the provenance record is immutable.

What the compliance mode constraint means for long missions

One friction point is worth stating clearly, because it trips up teams making the transition from short-horizon pilots to long-horizon production missions.

The predecessor post established that compliance decay and adaptive decay are mutually exclusive per deployment. Compliance decay is deterministic: score *= 0.85^days regardless of usage. Adaptive decay is reinforcement-aware: R(m, t) = exp(-(t - r_m) / (tau * (1 + eta * ln(1 + n_m)))). The reason they cannot coexist in a single deployment is that the compliance guarantee (all memories older than N days have decayed below threshold Y) depends on the decay being a pure function of age. Any usage-dependent term breaks the guarantee.

For long-horizon regulated missions, this creates a tension. Compliance decay, the safe choice for regulated deployments, discards the reinforcement signal that would otherwise extend the half-life of heavily-used memories. A memory that the mission’s agents retrieve every day for three weeks decays at the same rate as one retrieved once on day one. Under compliance decay, the consolidation pipeline still works, because community summaries are created by structure (graph connectivity and co-occurrence) rather than by retrieval frequency. But the decay side is age-only.

The practical implication is that compliance-mode missions need their consolidation cadence tuned more carefully. Because decay does not adapt to usage, the consolidation pass has to work harder to keep the active graph small. Nightly consolidation with aggressive compression ratios is the right setting for long compliance-mode missions. The alternative, less frequent consolidation combined with a slow fixed decay rate, will let the graph grow and the token costs with it.

Adaptive decay is available as an option for single-tenant, non-personally-data-touching missions where the DPO does not need the age-only guarantee. For missions that do involve personal data, the EU AI Act Article 12 and DORA logging requirements point strongly toward compliance decay: you want the regulator to be able to prove, with arithmetic, that data beyond the retention window has faded below retrieval threshold. The formula 0.85^days supports that proof. The adaptive formula does not.

For the factory’s memory that continuously improves its own retrieval weights through kizuna-tune, the same constraint applies. The tuning adjusts retrieval scoring, not temporal validity. A DPO reviewing the system can still state: “Any episodic node older than 90 days has a compliance decay factor below 0.00004.” The retrieval weight adjustments are layered on top of that guarantee, not underneath it.

What to demand in an RFP

If you are evaluating memory systems for a long-horizon regulated workload, the consolidation question is one of the fastest ways to separate serious implementations from ones that will fail at scale.

Ask whether the memory system has a concept of episodic-to-semantic consolidation at all. Many vector stores and simple graph databases do not. They grow monotonically. Some offer manual pruning APIs. Manual pruning is not consolidation. It is house-keeping that someone has to do, which means it does not get done, which means the graph grows.

Ask how the consolidation cost is accounted for. If the answer is that consolidation is a background process that runs on the memory system’s own infrastructure at no charge to the mission, ask what happens to the provenance when it runs. Consolidation that overwrites the raw episodic layer is destroying the bitemporal record your regulators need. Consolidation that preserves the raw layer while promoting to summaries, and accounts for the promotion cost in the mission budget, is the correct design.

Ask how the decay schedule interacts with the consolidation pipeline. A decay-only approach without consolidation will prune old memories but will not promote the important structural patterns to efficient summary nodes. The result is a graph that stays small but loses the semantic coherence that makes retrieval accurate. A consolidation-only approach without decay will produce beautiful semantic summaries sitting on top of an ever-growing raw episodic layer. Both cost models and retrieval quality degrade.

Ask specifically about compliance mode versus adaptive mode and whether they are separate, incompatible configurations. Any honest answer will acknowledge the trade-off. A system that claims to offer both deterministic compliance guarantees and reinforcement-aware adaptive decay simultaneously is either conflating two different things or has a design the DPO should scrutinise closely.

Ask what happens to the memory graph at the end of a mission. A dark factory mission produces two outputs: the delivered artefact (the audit report, the evidence pack, the software) and the signed, bitemporal memory record that the regulator can query. The memory record is not scratch space to be discarded. It is part of the output. An RFP that does not ask about post-mission memory retention is leaving a compliance gap.

A 90-day pilot design

The consolidation mechanics are most visible in a mission long enough that the economics actually shift. A 90-day pilot gives you enough time to see them.

Pick a high-pain regulated workflow with a known before. Quarterly control testing, monthly AML evidence packs, or the rolling clinical coding backlog all work. The mission needs to run for at least two weeks to show the consolidation effect clearly, and it needs to be the kind of mission where the team has a rough sense of current human cost per run.

Instrument the pilot around three numbers.

Retrieval token cost per task, measured daily. In an unconsolidated baseline, this should rise across the mission as the graph grows. With consolidation enabled, it should plateau and potentially fall in the second and third week as the consolidation ratio takes effect. A flat or declining retrieval cost curve is evidence that the pipeline is working.

Budget consumed per equivalent unit of work, compared across weeks. If week three is completing the same volume of mission tasks for fewer tokens than week one, that is the consolidation dividend showing up in the ledger. If week three is costing more per task than week one, the graph is growing faster than the consolidation pipeline is compressing it. That is a tuning problem, either the consolidation cadence is too infrequent or the compression ratio is too low.

Post-mission provenance query latency. Pick five specific moments in the mission timeline and, after the mission completes, run bitemporal queries asking for the agent knowledge state at those moments. Query latency that stays consistent across the mission timeline, rather than growing linearly with the distance back in time, is evidence that the bitemporal indexing has held up under the consolidation pipeline. Recall latency in Kizuna-mem is benchmarked at around 3 ms, and that figure should not degrade with mission duration in a correctly configured system.

These three measurements are not exotic. They can be extracted from the Ninmu budget ledger, the Cosmictron event log, and the Kizuna-mem query API respectively. They are the evidence base for deciding whether to scale the pilot to a longer, wider mission, and they are the kind of numbers that a CRO or head of audit will want to see before signing off on the production deployment.

Closing

A dark factory that can run three-week missions economically is a different product from one that runs three-hour demos well. The difference is not the swarm. It is the memory.

Consolidation keeps the retrieval graph navigable. Decay prunes what is no longer useful. The sleep cycle compresses episodic noise into semantic signal while the mission is not actively running. kizuna-tune adjusts retrieval weights toward what has actually been useful, within the compliance constraints the deployment requires. Together, these processes are what make it possible to declare a multi-week mission with a hard budget ceiling and have the factory stay within it, rather than discovering on day eleven that the context has become its own worst cost driver.

The bitemporal provenance record, the thing the regulator needs to audit the mission, is preserved throughout. Consolidation and retention are not in conflict. Consolidation is an operational layer that works above the provenance layer, compressing the representation that active agents work with while leaving the full temporal record intact for audit.

For the full story on how Kizuna-mem tracks what the agents knew and when, bitemporal memory as the compliance backbone covers the two-axis timeline model and how hard-delete fits into it. For the self-improvement loop and how kizuna-tune adjusts to the customer’s specific domain, the factory’s memory that improves itself goes deeper on the online learning mechanics. For the budget governance story that makes all of this connect to the mission ceiling, cost governance before the invoice covers the Ninmu ledger in detail.

If you want to see the financial model for a long-horizon regulated mission and understand where the consolidation dividend shows up in the cost decomposition, you can request the investor brief.