The missing 80%: why glue-stack pilots die and the dark factory survives
A head of audit at a European bank sent me a short message earlier this year. Her team had spent six months piloting an agentic evidence-gathering system. It worked. The first three weeks were, by her account, the most exciting technology demonstration her department had seen in a decade. By month four the pilot was on life support, and by month six it was dead. Not because the model was wrong. Not because the orchestration was broken. Because when the external auditors arrived, she could not answer their first question: “Show us every decision the system made, who authorised it, and what data it had access to at that moment.”
There was no answer. The evidence was in three places: a LangSmith trace, a Postgres table that the team had hand-built for logging, and the inference provider’s dashboard. None of them agreed. The Postgres table had gaps. The LangSmith trace did not include the retrieval context. The inference dashboard showed cost but not decisions. Eighteen months of preparatory work, gone, because the stack had been assembled from tools that were never designed to answer that question together.
That story has a name in the industry now. Everyone building agentic systems in regulated verticals has a version of it. The details change. The shape is always the same.
The 20% problem
The pitch for an agent pilot is usually the same 20%: a frontier model, an orchestration library, a vector store, some retrieval logic, maybe a thin wrapper. You can get something genuinely impressive running in two or three weeks. It demos beautifully. Stakeholders are excited. Budget is approved.
Then the hard questions arrive.
Can you stop the swarm if it starts overspending? Can you prove that no agent ever acted on the credentials of a human employee? Can you reconstruct the exact state the agent was in when it made a specific decision, six months after the fact? Can you ensure the system runs entirely within your own data centre, with no data transmitted to a third party? Can you delete a specific person’s data from the agent’s memory, with a cryptographic proof that it is gone? Does the system halt and surface to a human every time the decision crosses a risk threshold, without that gate becoming a bottleneck?
Most glue stacks answer none of those questions, and cannot. Not because the engineers were incompetent. Because the tools were not designed for them. LangChain and LangGraph give you loops and state machines. Pinecone gives you vector recall. Postgres gives you persistence. GitHub gives you version control. None of them has any concept of a governed agent swarm with a hard budget, a cryptographic identity per agent, a deterministic replay log, or sovereign deployment constraints. You can bolt monitoring around the outside of this stack, and many teams do. But monitoring is not governance. It tells you what happened after it happened. A governed system prevents the problem before it occurs, and produces the evidence as a byproduct of the work itself.
This is not a small gap. The hard 80% that regulated buyers actually require is the difference between a demo and a production system. It comprises seven things, and all seven have to be right for the system to survive contact with an external auditor, a regulator, a CRO who actually reads the risk reports, or a board that has already been burned once.
The seven things glue stacks cannot retrofit
A budget that actually stops the swarm. Not a dashboard that alerts a human after the money is spent. An actual ceiling inside the orchestration layer that the swarm cannot cross. This means a global budget ledger that reserves against the plan before work starts, reconciles estimates against actuals as tasks complete, and routes each task to the cheapest model that can actually do the work. Ninmu, the swarm conductor at the core of Substrate, does exactly this. The Gartner 2025 guidance on agentic project cancellation rates cites runaway cost and weak governance as the leading causes (source: Gartner press release, 25 June 2025). The Gartner prediction is not a criticism of agent technology. It is a description of what happens when you deploy agents on infrastructure that has no concept of a spending limit. The bank in my opening story did not have a spending limit. They had a Slack alert.
Cryptographic agent identity with no shared credentials. Glue stacks let agents act on the application’s existing credentials. That means a shared OAuth token or an API key that belongs to a human account. When something goes wrong, and in a swarm something will go wrong, you cannot answer the question “which agent did this?” without going through logs that were never designed to carry that information. Ultra, the separate authority plane in Substrate, gives every agent a provable Ed25519 identity. Every action is signed by that identity and lands in a tamper-evident log. Fake identities are rejected at the boundary. You can prove, to an auditor, that a specific agent with a specific identity took a specific action at a specific time, and that the log itself has not been altered since.
Deterministic replay that is the audit trail. This is the one that kills most glue-stack pilots in regulated environments. Deterministic replay means that the entire swarm execution, every state change, every model call, every retrieval, every tool use, can be reconstructed exactly from the stored event log. Not approximately. Exactly. The audit trail is not a separate system that observes the production system and tries to record what happened. It is the production system’s storage engine. Because Cosmictron, the agent runtime and live data plane in Substrate, makes every state change deterministic and signed, replay is a first-class property of the system, not an afterthought. I go deeper on this in deterministic replay as the audit trail.
Bitemporal memory with governance by omission. Agents that operate on regulated data need memory that knows two things simultaneously: what was true in the world, and what the agent knew and when. A flat vector store cannot answer “what did the agent know at the moment it made this recommendation, three months ago?” A bitemporal graph can. Kizuna-mem, the agentic memory system in Substrate, carries both valid time and transaction time, which means you can reconstruct the exact memory state the agent had at any point in the past. Governance by omission means that out-of-scope data is simply not present in the agent’s memory, not redacted, not filtered, not present. And hard-delete means that when a data subject exercises their right to erasure under GDPR Article 17, the deletion is a cryptographic operation logged in the audit trail, not a soft-delete and a hope that no backup surfaces the data later. The recall performance is around 3 ms on the bitemporal graph, which means governed memory does not impose meaningful latency costs.
Sovereign deployment with owned models. For a growing number of regulated buyers, “sovereign” is not a selling point. It is a procurement requirement. The UK government’s data residency rules, the EU AI Act’s provisions on high-risk systems, DORA’s requirements for regulated financial entities, and national security considerations for defence and critical infrastructure: all of them point to the same constraint. The data cannot leave. The models cannot be hosted by a third party. The system must be deployable in an air-gapped environment with no dependency on any external API. Substrate deploys on the customer’s own hardware, runs open-weight models fine-tuned for the customer’s domain, and has optional frontier model failover that does not compromise sovereignty when it is not needed. A glue stack that calls the OpenAI API from inside a government data centre is not sovereign, regardless of what the vendor says about it.
Human gates that scale. The right answer to “how do you maintain human oversight of an autonomous swarm?” is not “have humans review every output.” That is just a slow, expensive, and inconsistent version of manual processing. The right answer is that the swarm handles the bulk of the work inside a strict policy and data perimeter, and surfaces only the decisions that actually require a human: the exceptions, the ambiguous cases, the high-risk actions. Those gates have to be fast to clear, carry full context, and record the human’s decision and reasoning in the same audit trail as everything else. A human gate in Ninmu stops the swarm, presents the context, captures the decision, and resumes. It is not a separate workflow. It is part of the same signed mission record.
A signed supply chain. Every agent in the swarm is running code that came from somewhere. In a glue stack, that code went through GitHub with a human’s access token, no provable agent identity, and branch rules designed for human developers. In Substrate, every commit, merge, and artifact in Kizuna is signed and policy-gated. Agents are first-class signed identities in the forge itself, with trust levels and declared capabilities. The supply chain for a Substrate mission is auditable from source to production. A glue stack’s supply chain is not.
Interactive: click a series in the legend to show or hide it. Toggle “regulated-finance lens” to see how the axes shift in weight when the buyer is a tier-1 bank or asset manager rather than a startup. Substrate’s lead on auditability, replay fidelity, and identity strength is the shape of the hard 80%.
Why retrofitting does not work
Teams that have built a working glue-stack pilot often ask whether they can add the hard 80% to what they already have. The honest answer is that you cannot, and the reason is architectural rather than technical in the narrow sense.
The problem with building a governed system on top of an ungoveredn one is that governance requires a single source of truth for every signed action. You cannot have the budget ledger in Ninmu, the audit log in Postgres, the identity records in AWS IAM, and the memory in Pinecone, and then assert that the resulting system is coherent. At the moment of an audit or an investigation, you need to produce a single, consistent, tamper-evident record. Four systems with four clocks and four failure modes cannot produce that.
The same applies to deterministic replay. You can only replay a system deterministically if every state change was recorded in a single ordered log by the same authority. If agents write to an external database, call an API, update a vector store, and write to a message queue, those four writes happened in some order but the order is not recorded anywhere. The replay is approximate at best. For a financial regulator or a court, approximate is the same as useless.
The reason Substrate has 790,000 lines of owned code across six purpose-built systems is precisely that you cannot buy the hard 80% off the shelf. Ninmu, Cosmictron, Kizuna-mem, Ultra, Kizuna, and Voxeltron were designed together against a single governance contract. The emergent properties, deterministic replay as the audit trail, cost control before inference runs, sovereign air-gapped operation, every action signed and auditable by construction, are not features that any of the six systems has independently. They are properties of the six systems working as one factory. That is why the homepage calls it a dark factory rather than a platform or a stack. A stack is components you assemble. A factory is a thing that works.
The buyer objections, and what to say to them
Three objections come up in almost every serious procurement conversation about governed agent systems. They are worth addressing directly.
“Our auditors will never accept this.” What auditors require, as the bank story at the opening of this piece illustrates with some force, is a coherent, consistent, tamper-evident record. An audit trail that lives across three inconsistent systems is precisely what auditors will not accept. A system with deterministic replay and cryptographic identity is closer to what they want than anything a glue stack can produce, because it gives them a single record they can verify. The conversation with auditors is not “will you accept an AI system?” It is “will you accept a system where every action is signed, every decision is reconstructable, and the evidence pack is produced as a byproduct of the work itself?” That is a different conversation, and one that has a better answer.
“We cannot have data leaving the country.” Substrate runs on the customer’s hardware. The inference runs on the customer’s models. The data does not leave. This is a constraint that rules out every hosted agent platform and every glue stack that calls a third-party inference API. It does not rule out Substrate. The sovereignty side of this story is in sovereign AI, air-gapped by default.
“We already blew the budget on the last agent experiment.” This is the most understandable objection, and the one that deserves the most direct answer. The previous experiment failed because the system had no budget control. It was impressive until it was expensive, and there was nothing in the infrastructure to prevent the transition. A system with a hard budget ceiling inside the orchestrator, cheapest-sufficient routing per task, and a swarm that stops itself before overspend is a fundamentally different economic proposition. Drag the cap in the diagram below to see what happens. That is not a dashboard alert. That is the machinery refusing to cross the line.
Interactive: drag the budget cap to the left and watch the routing change. At a comfortable cap, each task runs on the best model for it. As the cap tightens, Ninmu routes to cheaper models. Tighten further and it halts tasks it cannot afford, showing exactly which work stopped and why. A glue-stack pilot has no equivalent control.
What to demand in an RFP
If you are evaluating agentic infrastructure for a regulated environment, the following questions will separate the systems that have addressed the hard 80% from those that have not. Most vendors will struggle with most of them.
Ask whether the system has a hard, global budget per mission that the orchestrator enforces, as opposed to monitoring that alerts after spend. Ask them to demonstrate the swarm declining to start a task because the budget ceiling would be crossed. Insist on a live demonstration, not a slide.
Ask how agent identity is established and proved. The answer should involve cryptographic keys, not OAuth bot accounts. Ask to see an action log where each entry is signed by the acting agent’s key, and ask how you would detect a tampered log.
Ask what happens when you need to prove, six months from now, exactly what data a specific agent had access to at a specific moment. Ask them to do a deterministic replay of a past run while you watch. If the system cannot do this, the audit trail is not a real audit trail.
Ask about the memory system’s support for bitemporal queries and hard-delete. “Soft-delete and vacuum later” is not GDPR compliance. Ask to see a deletion event in the signed log.
Ask about sovereign deployment. If the answer involves any call to an external API, the system is not sovereign. Ask what happens if the inference provider has an outage or changes its pricing.
Ask about the supply chain. What signed the artifact that is running in production right now? If the answer involves a human’s GitHub token and no agent identity, the supply chain is not governed.
And ask about human gates. Not whether they exist, but what happens when a gate is cleared: is the human’s decision recorded in the same signed log as the agent’s actions? If the human approval lives in a different system from the agent audit trail, you have two sources of truth that will disagree at an inconvenient moment.
The six systems are unpacked in the six systems as one factory, not a stack. The cost governance question, and the way a hard budget ledger inverts the economics of agent rework, is in cost governance before the invoice arrives.
A 90-day pilot design for one regulated workflow
You do not need to commit to a full deployment to find out whether the hard 80% actually holds. Pick one high-pain, well-defined regulated workflow with a clear before state. Good candidates: a quarterly control test that currently takes three analysts two weeks. An AML evidence pack that takes an associate team ten days. A backlog of clinical coding appeals that has a measurable per-item cost and a known error rate.
Week one through four: declare the mission and the hard budget in Ninmu. Set the ceiling below your current per-run human cost for the equivalent work. Run the factory on a representative sample of the workflow, not the full estate. The point of this phase is to confirm that the system stops itself before it overspends, that the output carries a coherent audit trail, and that the human gates surface the right exceptions.
Week five through eight: have an external party examine the output. This does not need to be a formal regulatory submission. It needs to be someone with the same mindset as an auditor: give them the signed evidence pack and ask whether they could reconstruct the system’s decisions from it. Ask them to try to find a gap. This is the test that the bank in my opening story never ran before they went to production, and the omission cost them the pilot.
Week nine through twelve: run the full workflow under production conditions, measure three numbers. Cycle time from declared to delivered. Exception rate: what share of items required a human gate, and how long did each gate take to clear? Spend against the ceiling: it should never cross it. If it does, the thesis is wrong and you have learned that cheaply.
What you are testing is not whether an agent can draft an evidence pack. Plenty of systems can do that. You are testing whether the factory stops itself when it should, whether the output is signed and auditable without manual assembly, and whether the same machinery can be pointed at your next regulated workflow without a rebuild. The second and third properties are what turn a pilot into a production decision.
The 90 days also give you something useful for the RFP if you are in a formal procurement cycle: a concrete, measured proof point from your own environment, with your own data, on a workflow your auditors already understand. That evidence is worth considerably more than a vendor’s reference customers in a different jurisdiction.
The shape of the decision
The missing 80% is not a list of features. It is a question about what kind of system you are actually buying.
A glue stack is infrastructure assembled from tools that were designed for adjacent problems. It can produce impressive results in controlled conditions. The moment it encounters a serious auditor, a regulatory inquiry, a sovereignty requirement, or a security incident, the seams show.
A governed factory is a system designed from the first principle that regulated enterprises need the entire hard 80% before the first production run, and that you cannot retrofit that 80% onto a system that started without it. The 790,000 lines of owned code in Substrate’s six systems exist because every layer had to be built with governance as the primary constraint, not added later.
The conversation in the industry right now is mostly about the 20%: which model, which orchestration library, which vector store. Those are important questions for demos. They are the wrong questions for production in a regulated environment.
The right question is the one the head of audit asked on the day her pilot died: show me every decision the system made, who authorised it, and what data it had access to at that moment. The answer to that question either lives in the architecture or it does not. It cannot be added later.
If you want the full technical and financial picture, you can request the investor brief. And if you are designing the 90-day pilot for your own environment and want to go through the workflow design in detail, declare the mission and the budget covers the Ninmu side of the conversation, and deterministic replay as the audit trail covers what the auditor actually gets at the end.