Article 記事

Declare the mission and the budget: the new interface for regulated work

author Jonathan Conway
timestamp 30 April 2026
classification ninmu / dark-factory / budget-governance / agent-orchestration / cost-control / regulated / substrate

A bank I spoke to last year ran an agent pilot that worked beautifully right up until the invoice arrived. The demo had been a triumph. An agent read a backlog of control tests, drafted the evidence, flagged the gaps, and produced something an auditor could almost sign. Everyone was delighted. Then it went wider, ran overnight against the real estate, and by the time someone looked at the dashboard the inference bill had passed the cost of the analysts it was meant to relieve. Nobody had done anything wrong. There was simply no number anywhere in the system that said “stop”.

That is the quiet problem with almost every agent project shipping today. The model is impressive. The orchestration is clever. And there is no global spending limit that the machinery itself respects, because the tools were never built to have one.

Gartner has been blunt about where this goes. Their 2025 guidance warned that a large share of agentic AI projects will be cancelled by 2027, and runaway cost with weak governance is near the top of the list of reasons (source: Gartner press release, 25 June 2025). You can read that as a prediction. I read it as a description of what happens when you bolt agents onto infrastructure that has no concept of a budget.

This is the first thing the Substrate homepage asks you to do, and it is deliberately not “write a prompt”. It is this: declare the mission and the budget.

What “declare the mission and the budget” actually means

The pitch sounds almost too simple. You do not configure agents. You do not wire up a graph of nodes. You do not hand a bot your GitHub token and hope. You state two things. Here is the goal. Here is the most I am willing to spend to reach it.

From there a swarm of governed agents takes over. They plan the work, break it into tasks, route each task to the cheapest model that is actually good enough for it, write the code, generate the tests, run the review, and ship the result. Every action is signed and lands in an audit trail. Humans are pulled in only at the gates that policy says require a human. The output is two things at once: the software you asked for, and the signed record of exactly how it was produced.

The component that holds all of this together is called Ninmu. The name is the Japanese word for mission, which is the right frame. Ninmu is the swarm conductor. It is the part of the factory that turns “here is what I want and here is the ceiling” into a running, metered, stoppable plan.

The piece people miss on the first read is the budget. They treat it as a billing footnote. It is not. In a governed factory the budget is the primary control surface. It is the thing that decides how good a model each task gets, when work happens in parallel, and whether the swarm is allowed to take the next step at all.

Interactive: this is a software mission running under Ninmu. Drag the budget cap and watch the routing change. Hover any task to see the models it could have used. The blue node is a human gate, and the swarm waits there until you approve it.

Spend a minute with that diagram before reading on, because it makes the argument better than I can. Drag the cap down from its comfortable starting point. The first thing that happens is not failure. It is adaptation. Tasks that were assigned a frontier model quietly drop to a cheaper one that can still do the job. This is cheapest-sufficient routing, and it is the everyday behaviour of the system. Keep dragging. At some point downgrading is not enough, and Ninmu does the thing that almost no other stack will do. It refuses to start work it cannot pay for, and it tells you which tasks it halted and why.

That refusal is the whole product in miniature.

Why a budget ledger is harder than it looks

It is tempting to think you could add this to an existing orchestrator in an afternoon. Track tokens, multiply by a price, stop when you hit a number. People have tried. It does not hold, and the reasons are instructive.

The first problem is that cost in an agent system is not a single meter ticking upward. It is a tree. A planning step spawns five tasks. One of those tasks decides it needs to reflect, so it loops. A reflection loop calls a model that calls a tool that triggers another agent. The naive approach counts tokens at the leaves and finds out it overspent after the money is gone. A real budget ledger has to reserve against the plan before the plan runs, hold the global remaining figure across every concurrent branch, and reconcile estimates against actuals as tasks complete. It is closer to a treasury function than a usage counter.

The second problem is routing. “Cheapest sufficient” is doing an enormous amount of work in that phrase. Sufficient for what? A task that drafts boilerplate can run on a small model. A task that reasons about a security boundary cannot. Ninmu has to carry, per task, an idea of which model tiers are acceptable, then pick the cheapest of those that still fits the remaining budget. Lower the ceiling and the acceptable set shrinks. That is why the diagram reroutes the way it does. The routing is not cosmetic. It is the budget propagating into model selection, task by task.

The third problem is the stop. A budget that warns you is a dashboard. A budget that stops the swarm is a governor. The difference matters most precisely when things are going wrong, which is exactly when a human is least likely to be watching the dashboard. The bank in my opening did not lack dashboards. It lacked a floor in the machinery that the machinery could not cross.

LangGraph and the broader glue stack have no native concept of any of this. They give you loops and state, which is genuinely useful, but a global spend limit is not in the model. You can build telemetry around the outside and alert on it. You cannot make the orchestrator itself decline to take a step because the mission would go over budget, because the orchestrator does not know what a mission budget is. This is not a criticism of the libraries. It is a statement about where the boundary of the design sits.

The mental model: assistant versus factory

There is a sharp line between two ways of using AI for serious work, and the budget is what reveals it.

A coding assistant makes one developer faster. That is real value and I use them every day. The developer stays in the loop on every step, which means the developer is also the cost control. They feel the latency. They see the output. They stop when it is enough. The human is the budget.

A dark factory is a different animal. You are not in the loop on every step, by design, because the point is to run faster than a human team and at a scale a human team cannot match. The moment you remove the human from each step, you have also removed the thing that was quietly stopping the spend. So the factory has to supply that floor itself, in the orchestration layer, as a first-class property. The hard budget is not a feature added to the factory. It is the precondition that makes a factory safe to run at all.

This is why “declare the mission and the budget” is the front door and not a settings page. It is the contract. You give the factory autonomy. The factory gives you a ceiling it will not cross. Without the ceiling, the autonomy is a liability nobody in a regulated business will sign off on, and they are right not to.

From software to the actual work

The first thing the factory builds is software. That is V1, and the diagram above is a software mission. But the interesting claim on the homepage is the next sentence: the same factory then runs finance, healthcare and government.

The method does not change. You declare a goal and a hard budget. A governed swarm plans it, routes it, executes it, surfaces the decisions that need a human, and produces a signed, auditable result. What changes is what sits in the nodes. Instead of “write API” and “generate tests”, the nodes become “ingest the documents”, “build the entity graph”, “run the policy checks”, “surface the exceptions to a named analyst”, “assemble the evidence pack”, “sign and lodge it”.

The same engine pointed at a trade-finance evidence pack. Same two inputs from the buyer: the goal and the ceiling. Same cheapest-sufficient routing, same human gate, same signed output. Drag the cap to put it under pressure.

Look at what is identical between the two diagrams and what is not. The shape of the control is identical. A plan decomposed into tasks, each routed to a model that fits both the work and the budget, a human gate where policy demands one, and a signed artifact at the end. What differs is only the domain. That sameness is the entire “software then everything else” thesis made concrete. You are not buying a code generator that someone later tries to repurpose for audit work. You are buying a way of running governed knowledge work, and software just happens to be the first kind of knowledge work it runs.

For a trade-finance pack the before and after is stark. The old way is an associate or three, several weeks, a six or seven figure annual labour line, and an audit trail assembled by hand at the end that still has gaps. The factory way is the same change in hours, with the labour bill collapsing toward compute and tokens, experts approving gates rather than typing keystrokes, and an evidence pack that is audit-ready by construction because the signed record was produced as the work happened, not reconstructed afterwards. I go through that workflow end to end in the trade-finance evidence pack walkthrough.

What the budget unlocks that monitoring cannot

Once the budget is a real object inside the orchestrator, several things that are hard elsewhere become straightforward.

You can ask what-if questions before you commit. What does this mission cost if I allow two reflection loops per task instead of one? Lower the implied ceiling and the routing answers you. You are seeing the cost of a policy decision before you pay for it, which is the opposite of the after-the-fact reconciliation most teams live with.

You can give different missions different ceilings and have the factory respect them simultaneously, because the ledger is per mission and global within it. A high-stakes quarterly control test gets a generous budget and frontier models on the tasks that matter. A routine reconciliation gets a tight one and runs almost entirely on small models. The same factory serves both without anyone hand-tuning model choices, because the budget does the tuning.

And you can stop. Not pause a dashboard alert and hope someone reads it. Stop the swarm, in the orchestration layer, before the overspend, with a record of what was halted. The economics of this are not subtle. There is a widely cited figure floating around the 2026 discourse that a large fraction of agent coding spend goes on bug fixing and rework rather than shipped output. I dig into those numbers, and how a governed factory inverts them, in the real economics of agent swarms. The short version is that the dominant cost in agent work is not the inference. It is the rework and the cleanup and the compliance overhead, and a factory that is governed and auditable from the first step attacks exactly those lines.

There is a deeper point underneath all of this, which is that the budget and the audit trail are the same idea seen from two angles. Both require the orchestrator to know, at every step, what is happening and what it costs and whether it is allowed. Once you have built that, metering before the spend and replaying after the fact fall out of the same machinery. I unpack the replay side in deterministic replay as the audit trail, and the way the six systems combine into one contract rather than a stack of products in the six systems as one factory.

What to demand in an RFP

If you are evaluating any agentic platform for regulated work, the budget question separates the serious systems from the demos faster than any benchmark. Here is what to put in the document.

Ask whether the system has a hard, global budget per mission that the orchestrator itself enforces, as opposed to monitoring that alerts a human after spend. Ask them to demonstrate the swarm declining to start a task because the mission would exceed the ceiling. Watch their faces. Many will show you a dashboard and a Slack alert. That is not the same thing, and now you know why.

Ask how the system routes tasks to models, and specifically whether lowering the budget changes the routing. If the model assignment is static, the budget is decorative.

Ask to see the budget and the audit trail come from the same place. If cost control lives in one tool and the audit log lives in another, you have two sources of truth that will disagree at the worst possible moment, which is during an investigation.

Ask what happens to the budget ledger when the work runs across a private cloud or an air-gapped environment with your own models. The answer should be “nothing changes”, because the meter is part of the factory, not part of a vendor’s billing API. If the cost control depends on a hosted endpoint, it does not survive the deployments that regulated buyers actually need. The sovereignty side of that story is in sovereign AI, air-gapped by default.

A 90-day way to prove it

You do not need to bet the estate to find out if this is real. Pick one high-pain regulated workflow with a clear before. A quarterly control test. An AML evidence pack. A backlog of casework that has a known per-item cost in human hours. Declare it as a mission with a hard budget set deliberately below your current human cost, and run it.

Three numbers tell you almost everything. The cycle time, measured as how long the change takes from declared to delivered. The exception rate, measured as the share of items that needed a human gate and how long each gate took to clear. And the spend against the ceiling, which should never cross it, because if it does the whole thesis is wrong and you have learned that cheaply.

What you are really testing in those ninety days is not whether an agent can draft an audit report. Plenty of demos can do that. You are testing whether the factory stops itself when it should, whether the output is signed and auditable without anyone assembling it by hand, and whether the same machinery would run your next regulated workflow without a rebuild. Those are the properties that turn a pilot into production, and they all trace back to the two things you declared on day one.

The mission and the budget. State the goal. Set the ceiling. Let a governed swarm do the rest, and watch it stop before it overspends. That is the interface, and once you have used it the old way of running agent projects starts to look like driving a car that has an accelerator and a speedometer but no brake.

If you want the deeper version of this, including the financial model and the head-to-head against glue stacks, you can request the investor brief. And if you are wrestling with the cost-control side specifically, cost governance before the invoice arrives goes a layer deeper than I have here.