Human approval gates that don't become bottlenecks: designing the dark-factory perimeter
A major European bank ran an agentic trade-finance pilot in early 2025 and hit a wall nobody had predicted. The agents were excellent. They ingested documents, resolved counterparty entities, ran sanctions checks and assembled draft evidence packs at a pace no associate team could match. The problem was approval. Every draft went into a shared queue. The queue fed three senior analysts. Within two weeks the queue held four hundred items and was growing by the hour. The analysts were not slow. The volume simply exceeded what three people could review at the pace the factory generated it.
This is the bottleneck antipattern. It arrives not because the agents fail but because they succeed. You have built a production line that can run at machine speed and then you have connected it to a human gate designed for a human-paced process. The result is a very expensive inbox.
The other failure mode is the opposite. Strip out the gates entirely, hand the agents unconstrained authority to move money, update records or file regulatory submissions, and wait for the first significant error. Regulators, internal audit and your own risk function will shut the programme down. The EU AI Act does not leave much room for ambiguity on this point: high-risk AI systems touching employment decisions, credit assessment, insurance underwriting, critical infrastructure control, law enforcement and similar domains require meaningful human oversight (Article 14). “We watched the logs” is not sufficient.
The design space sits between those two failure modes. This article is about how to find and maintain that position.
The problem, stated precisely
The goal is not “add humans somewhere.” It is to route decisions to humans only when the cost of a wrong automated decision outweighs the throughput penalty of human review. Everything else should run without interruption inside a well-defined policy and data perimeter.
That framing has three parts, each of which is doing real work.
“Cost of a wrong automated decision” is a risk calculation. Not every error is equal. An agent that miscategorises a low-value, easily reversed internal transaction is cheap to correct. An agent that approves a cross-border payment to a sanctioned entity is not. The gate taxonomy must be calibrated to consequence, not to anxiety.
“Throughput penalty of human review” is real and measurable. If a named human takes twenty minutes to review an exception, and the factory generates fifty exceptions an hour, you need a review capacity of a thousand analyst-minutes per hour. That number is either absurd (and tells you the gate triggers too often) or achievable (and tells you the gate is calibrated about right). Ignoring the arithmetic and hoping for the best is how you end up with that European bank’s queue.
“Well-defined policy and data perimeter” is the governance guarantee that makes the agents trustworthy inside the gate. The perimeter is not just a list of things agents cannot do. It is a positive specification: what data they can read and write, which models they can invoke, what actions they can take, and under what conditions each action requires a gate. Without the perimeter, the gate is the only safety mechanism. With the perimeter, the gate is the final escalation for a system that is already constrained.
A gate taxonomy for regulated work
The practical starting point is a tiered risk classification. Not every decision needs the same treatment, and conflating them is the root cause of most bottleneck failures.
Tier 0 (no gate required). Routine evidence gathering, document parsing, entity resolution, format conversion, low-value internal reconciliation where reversal is cheap and fast. Agents run these under a strict data perimeter with no human involvement. The audit trail records everything, and retrospective review is always possible. Volume can be unlimited.
Tier 1 (soft assertion, no interactive gate). Actions that are individually low-risk but that accumulate into a material position if repeatedly wrong. Here the system logs a confidence score alongside the action and flags aggregations that cross a threshold for batch review. A human reviews the pattern, not the individual item. This is the right gate for things like repeated application of a classification rule.
Tier 2 (named-human gate, standard SLA). Decisions above a risk threshold that require a human to attest before the action proceeds. The gate is specific: a named person (or a named role with a specific person on shift), not a queue. The context presented to that person is complete: the recommendation, the evidence, the confidence score, the policy citation, and the downstream consequence of approving versus rejecting. The SLA is defined in advance, because Ninmu will hold the mission plan open at this gate and cannot estimate completion time without it. Normal target: under four hours.
Tier 3 (named-human gate, urgent SLA). Same as Tier 2 but for decisions where delay itself creates risk. A sanctions hit on a live payment. A clinical decision with patient-safety implications. A government benefits decision approaching a statutory deadline. Target: under thirty minutes, with an escalation path if the named approver is unavailable.
Tier 4 (dual-control gate). The highest-consequence decisions, where a single approver is insufficient. Two named humans must each confirm independently. Used sparingly because the throughput cost is high. Not a general mechanism.
The right distribution in a well-designed factory is something like this: the great majority of actions are Tier 0. A meaningful fraction of volume generates a Tier 1 pattern flag that a human reviews in aggregate perhaps once per shift. Tier 2 gates handle the genuinely ambiguous individual decisions, typically a small single-digit percentage of volume. Tier 3 is rare, Tier 4 rarer. If your Tier 2 volume is running at twenty or thirty percent, the perimeter is too narrow or the tier thresholds are miscalibrated. The gate should be the exception, not the norm.
Interactive: step through a regulated approval-gate workflow. Most steps complete autonomously and emit an artifact to the evidence pack. The one human gate pauses the run and collects an attestation before proceeding. Toggle “show glue-stack gaps” to see which artifacts a conventional stack cannot produce.
How Ninmu inserts gates without stalling the swarm
The mechanism matters. A gate that halts the entire mission while waiting for human input is a different animal from a gate that pauses only the branch that needs approval while the rest of the swarm continues.
Ninmu models the human gate as a task node in the mission DAG. Like every other task, it has a dependency graph: it blocks only the tasks that depend on its output. Independent branches continue executing. If a mission has a gate at step four of a twelve-step workflow, the swarm does not sit idle from step four onward. It continues any work that does not depend on the gate’s output, accumulates results, and resumes the blocked branch the moment the gate clears.
The gate node has a declared SLA. Ninmu uses this when computing the mission’s expected completion time. A gate with a four-hour SLA extends the critical path by four hours, no more. If the gate clears early, the critical path shortens accordingly. The mission owner always has a live, honest estimate of completion time that accounts for human review latency, because the latency is in the plan, not outside it.
What arrives at the named approver is not a raw document dump. Ninmu composes a gate packet: the task recommendation (what action the agent proposes), the evidence it assembled (signed artifacts with cryptographic links to source documents via Ultra), the policy rule being applied, the confidence score, the consequence of each path (approve, reject, or request more information), and an override field for the case where the approver disagrees with the recommendation but wants to record why.
The approver sees one screen. One decision. The recommendation is the default. They confirm, reject, or modify, with an optional reason captured in either case. The reason is not optional for a rejection on a Tier 3 decision (the EU AI Act’s Article 14 requirement for meaningful human oversight means the reason is itself a logged, signed artifact).
That interaction is designed to take minutes, not an afternoon. The bottleneck in the European bank’s pilot was not reviewer capacity. It was context assembly. Analysts were spending most of their time reconstructing the background for each item before they could decide. Gate packets move that assembly into the factory, where it is cheap and fast.
Interactive: a mission with a human gate node (shown in orange). Drag the budget cap to watch the swarm re-route around it. The gate pauses only its dependent branch; independent work continues. Click Approve in the diagram to release the held branch and watch the mission complete.
The override capture loop
The gate is not just a control. It is a data source.
Every approval, every rejection, and every modification to an agent recommendation is a labelled training signal. The agent proposed X. The human approved, rejected, or changed it to Y, and said why. Aggregated across thousands of gate decisions, this becomes the most valuable corpus you have for improving the agents that feed the gate.
There are two ways to use this. The cheaper one is threshold calibration: if a particular decision type is passing the Tier 2 gate with a ninety-eight percent approval rate and zero modifications, it almost certainly belongs at Tier 1 or Tier 0. The gate is adding friction for no governance benefit. Quarterly review of the approval rate by decision type, combined with a downstream error rate check, lets you safely lower tiers as confidence grows.
The more powerful use is model fine-tuning. Each override is a (context, recommendation, correction, reason) tuple. Run enough of them through a fine-tuning pass and the model that generates the recommendation learns the approver’s implicit standards. The gate becomes thinner because the model gets better at anticipating what the approver would want.
This is not hypothetical for Substrate. Kizuna-mem’s bitemporal graph stores every gate decision with its full provenance: which version of which model made the recommendation, what data it had access to, which policy version was in effect, who the approver was, and what they decided. The “as of” query capability means you can ask what the approval rate was under the previous model version compared with the current one, exactly what each approver approved or rejected in a given period, and whether a class of modifications is systematic (suggesting a policy ambiguity) or idiosyncratic (suggesting an individual preference). The data for the feedback loop is a side-effect of governance, not an extra cost.
For a deeper look at how the audit trail captures every signed decision and makes it queryable for exactly this kind of retrospective analysis, the post on deterministic replay as the audit trail covers the Cosmictron mechanics in detail.
The UX problem that most discussions skip
The bottleneck problem has a technical dimension and an interface dimension. The technical side is largely solved by the tier taxonomy and DAG-aware architecture above. The interface dimension is where well-designed systems still fail.
Consider what an approver’s day looks like when gate UX is bad. They open a review interface, see a list of pending items, and spend five to ten minutes per item loading workings from a separate system in a format they have to decode. By late morning they have cleared eight items and a hundred and twenty more have arrived.
The answer is not a shorter list. It is a faster decision. Ninmu’s gate packet is designed so that a trained approver can review the recommendation, scan the evidence summary, and decide in under a minute for a standard Tier 2 case. The full evidence pack is there if needed; it is not required for the decision. The approver’s cognitive load is: does this recommendation look right? Not: let me reconstruct the analysis from raw documents.
This requires a named approver, not a queue. A queue distributes work by availability, which sounds efficient but destroys context. A named approver builds familiarity with a domain agent’s outputs over time and can spot anomalies quickly. You cannot develop that tacit knowledge if every item comes from a different domain handled by a different agent. The system routes to a backup approver when the primary is unavailable, and this escalation path is explicit in the Ninmu mission plan, not handled by a generic on-call rotation that bypasses the assignment logic.
Sector walkthrough: a regulatory reporting gate in practice
Take a quarterly DORA (Digital Operational Resilience Act) reporting workflow for a mid-sized EU financial institution. The submission covers operational incidents, third-party dependencies, ICT risk assessments and resilience test results. Before the factory: three risk managers spend six weeks assembling the report from twelve internal systems, reconciling inconsistencies by hand, and producing narrative sections for compliance and legal review. The report is late roughly one year in three.
With Substrate, the mission is declared to Ninmu once the reporting period ends. The swarm runs:
-
Document ingestion and classification (Tier 0). Cosmictron cells pull incident logs, ICT inventories and contract data from source systems. Agents classify each incident under DORA’s taxonomy. Low-confidence classifications queue for batch review rather than individual gating.
-
Third-party ICT concentration analysis (Tier 2 gate). Whether specific dependencies create unacceptable concentration risk is a judgment call with material regulatory consequences. The agent assembles the evidence, makes a recommendation, and a named risk manager reviews and decides. In practice: twelve to fifteen minutes, three to five items per quarter.
-
Narrative drafting and legal sign-off (Tier 0, then Tier 3 dual-approver gate). Standard narrative is automated. The complete draft goes to two named approvers (compliance officer and general counsel) for final sign-off before lodgement. DORA requires evidence of material incidents to be retained for at least five years; the signed evidence pack is stored accordingly.
The before state: six weeks of risk manager time, frequent late submissions, an audit trail assembled retrospectively if at all. The after state: bulk assembly runs in hours, a handful of Tier 2 decisions absorb a morning of analyst time per quarter, the Tier 3 gate takes an afternoon of senior review. Risk managers stop doing assembly work. They do judgment work, which is what they were hired for.
The submission is also audit-ready in a way the manual process never was. Every data point has a cryptographic link to the source artifact. Every classification decision carries the model version, policy citation, confidence score and input data in Kizuna-mem’s bitemporal graph. A regulatory inspection that asks “show me how this figure was derived” gets a deterministic replay, not a reconstruction from memory.
For the end-to-end mechanics of that evidence pack, including how Kizuna-mem’s bitemporal graph links every artifact back to its source, see the trade-finance evidence pack lineage walkthrough.
The metrics that actually matter
Organisations that instrument their gate architecture well converge on three numbers.
Gate pass rate by tier and decision type. The percentage of gate items where the approver confirms the agent’s recommendation without modification. A Tier 2 pass rate below seventy percent suggests the model quality is insufficient or the threshold for escalation is too low. A Tier 2 pass rate above ninety-five percent suggests the threshold is too high (these items should probably be Tier 1 or Tier 0). The modification rate, distinct from the rejection rate, is particularly informative: systematic modifications in the same direction almost always indicate a policy ambiguity or a model that has learned a slightly wrong prior.
Time-to-decision. How long from the gate being opened to the approver clearing it. This should be tracked against the declared SLA, not just in aggregate. If a Tier 3 gate with a thirty-minute SLA is regularly clearing in forty-five minutes, you have a staffing or attention problem that will show up on your critical-path completions before long. Track it per approver as well as in aggregate: a consistent outlier may be a signal that the gate packet quality is insufficient for their specific domain.
Downstream error rate on approved items. The rate at which items that cleared a gate subsequently required correction or remediation. This is the ground truth on whether the gate is doing its job. A gate with a high pass rate but a high downstream error rate is a rubber stamp, not a safety mechanism. The bitemporal audit trail makes this number computable: you can query Kizuna-mem for items that were approved at gate n and subsequently triggered a correction action at any later step.
These three numbers, tracked over rolling quarters, let you evolve the tier thresholds deliberately rather than reactively. The goal is not a static policy. It is a policy that gets more precise as the agents improve and as the override capture loop feeds calibration data back into the models.
How Substrate’s six systems combine to make this work
The gate architecture requires every component to do its part.
Ninmu holds the gate in the mission DAG, computes the critical path with the gate SLA included, routes the gate packet to the named approver, and resumes the blocked branch when the gate clears. Without a mission-aware orchestrator, the gate is an ad-hoc interrupt rather than a first-class step in a governed plan.
Ultra signs every artifact entering the gate packet and signs the approver’s decision as a separate event with a fresh Ed25519 signature. The gate decision is as tamper-evident as any other action in the audit log. An approver cannot claim they approved something they did not, and the system cannot claim an approval was given that was not. That non-repudiation is what compliance teams need before they will accept an automated workflow for a submission with personal legal liability attached.
Kizuna-mem stores every gate decision in the bitemporal graph with its full provenance context. The “as of” query lets you reconstruct what any approver knew at the moment of their decision, satisfying the EU AI Act’s Article 14 requirement that oversight be meaningful.
Cosmictron’s deterministic replay means the sequence of agent actions that assembled the gate packet can be replayed exactly. When an approved item later turns out to be wrong, you can replay the run to the point of the gate and determine whether the error was in the agent’s reasoning, in the data it was given, or in the approver’s judgment. Those are three different remediation paths.
Substrate’s six systems produce the audit trail as a side effect of running. There is no separate log forwarder, no centralised aggregator, no compliance database that needs keeping in sync with the operational database. The EU AI Act Article 12 requirement for automatic event recording is satisfied by construction. The post on EU AI Act Article 12 and what high-risk systems must log covers this in more technical depth.
What to demand in an RFP
The gate question cuts through vendor positioning quickly. Here is what to put in the document.
Ask the vendor to demonstrate a human gate as a first-class node in a mission DAG that pauses only its dependent branch while independent work continues. If the demo shows the whole workflow halting at the gate, the architecture is sequential, not DAG-aware, and throughput suffers proportionally to gate frequency.
Ask how the gate packet is composed: whether context is assembled by the orchestration layer or whether the approver must navigate to the agent’s workings in a separate system. If the latter, review time will be dominated by context assembly rather than judgment.
Ask where the approver’s decision is stored. It should land in a tamper-evident audit log with a cryptographic signature within milliseconds of the approver clicking confirm. A sync from an approval tool to a separate compliance database at some point means a consistency boundary that will cause problems at audit time.
Ask what happens when the named approver is unavailable. A well-designed system has an explicit escalation path in the mission plan: a named backup or role-based fallback with defined conditions.
Ask to see gate pass rate, modification rate, time-to-decision and downstream error rate for a pilot workflow. If the vendor cannot produce these numbers, the gate exists on paper but is not being managed.
A 90-day pilot design
You do not need to bet a large programme on this pattern. Pick one high-pain regulated workflow where:
- The volume is high enough that manual review of every item is already unsustainable or expensive.
- The risk of individual errors is material but not catastrophic (save Tier 4 workflows for later).
- The workflow has a clear output artifact that someone can evaluate for quality.
A quarterly internal controls test, an AML case triage workflow, or a contract review for a standard deal type are all good candidates. Run the pilot with conservative tier thresholds (set them high, so more items go to humans than strictly necessary). Measure the three gate metrics from day one. At the end of thirty days, review the pass rates and time-to-decision to see where thresholds can safely move. By day ninety you should have a calibrated tier policy for this workflow and the data to justify extending the approach to adjacent workflows.
The founding fact that the pilot will confirm or deny is simple: a governed factory can deliver more throughput than a manual process while maintaining the same or better decision quality at the gates. If it cannot, either the perimeter is too narrow, the model quality is insufficient, or the gate tier thresholds are miscalibrated. All three are diagnosable and fixable without rebuilding the system. The pilot exists to find and fix them at small scale before the stakes rise.
The economics behind this are explored in more depth in how to declare the mission and the budget, which covers the Ninmu cost ledger and how the hard budget constraint interacts with the gate SLAs to give you a predictable completion estimate from the start of any mission.
The point of the perimeter
A dark factory without a well-designed gate architecture is not an autonomous factory. It is an uncontrolled one. Those are different things, and regulators, internal audit and board-level risk functions are quite good at distinguishing between them.
The perimeter is the set of commitments the factory makes to the enterprise: here is what agents do autonomously, here is what they escalate, here is what is captured at every decision point. When the perimeter is designed well, the factory runs at near-machine speed for the bulk of its work, and humans spend their judgment on the decisions that actually require it.
The signed audit trail that records every gate decision, every override and every downstream outcome is not a compliance overhead. It is the evidence that the perimeter is working. It is the thing that lets you show a regulator, an auditor or a board that the factory is not an inbox with extra steps, but a governed production system that knows its own limits.
If you want to understand how this fits into the broader factory design, including the cryptographic identity and provenance chain that makes every gate decision non-repudiable, you can request the investor brief. Or read EU AI Act Article 12 and what high-risk systems must log to see how the gate record maps directly to the regulatory logging requirements that come into force in August 2026.