Article 記事

Cost governance before the invoice arrives: the non-negotiable for 2027 agent projects

author Jonathan Conway
timestamp 18 May 2026
classification cost-governance / ninmu / dark-factory / budget-ledger / agent-orchestration / token-economics / substrate / regulated

A public sector technology team I know spent six months building an agent that processed benefit applications. The demo was compelling, the accuracy was acceptable, and the audit committee was cautiously interested. Then someone ran a projection of what the system would cost at production scale, factoring in the reflection loops the team had added to improve accuracy, the memory recall calls on every application, and the model they had chosen for the final decision step. The number that came back was four times what the programme office had budgeted. The project was shelved. The team did not cancel it because the technology failed. They cancelled it because there was no mechanism in the orchestration layer that would have stopped the system spending four times the budget before anyone noticed.

That is the story Gartner was telling when they published their June 2025 guidance warning that a large share of agentic AI projects will be cancelled by 2027 (source: Gartner press release, 25 June 2025, “Gartner Predicts 40% of Agentic AI Projects Will Be Abandoned by End of 2027”). Runaway spend with weak governance structures is the dominant cause they cite. You can read that as a prediction. What it actually describes is the natural outcome of bolting agent orchestration onto infrastructure that treats cost as a reporting concern rather than a runtime constraint.

The cancellation wave is not hypothetical. It is the logical terminus of the current architecture pattern: impressive model, clever orchestration, no hard ceiling, bill arrives at the end of the month, somebody panics.

This piece goes a layer deeper than declaring the mission and the budget. That post explains why the budget is the primary control surface. This one explains what it actually takes to build one that holds.

The problem, stated by someone who has debugged it

Most people who have thought about this for five minutes propose the same solution. Track tokens at each step, multiply by the model’s per-token price, accumulate a running total, alert when it crosses a threshold. You could wire this up in an afternoon with a middleware wrapper and a Slack webhook.

It does not work. Not because the accounting is wrong, but because it misses the structure of how cost actually propagates through an agent system.

An agent run is not a sequence of calls. It is a tree, and frequently a tree that grows while it is running. A planner decomposes a mission into seven tasks. Three of those tasks decide, independently, that the problem they have been handed is ambiguous and trigger a reflection loop. Each reflection loop fires three more model calls. One of those calls retrieves ten documents from memory, each retrieval costing tokens at the embedding and reranking step. A document retrieval triggers a sub-agent that does a policy check, and the policy check calls a different model entirely.

By the time the naive token counter reaches the leaf nodes, the inference is done and the bill is already committed. You did not overspend in one big step. You overspent in two hundred small steps, each of which looked reasonable to the node that took it, because no single node had visibility into the remaining budget for the whole tree.

The second structural problem is that cost is not a single price per token. It is a function of the model chosen, and the model choice is a routing decision made dynamically throughout the run. A system that counts tokens without knowing which model each task is using, and without the ability to change that choice mid-run, has tracking but not control.

The third problem is time. Reflection loops and memory consolidation cycles run asynchronously. The synchronous token counter is not counting them. By the time the asynchronous sub-processes complete and report back, the synchronous budget has already been allocated to other work.

These are not edge cases. They are the standard operating pattern of any agent system that does something more complex than a single prompt-and-response. And they are why “monitor after the fact” cannot substitute for “govern before the step.”

What a real budget ledger requires

The piece of machinery that can actually stop a swarm before it overspends looks less like a billing layer and more like a treasury function. Here is what it has to do.

Reserve before execution. When Ninmu decomposes a mission into tasks, it does not simply log the expected cost and then run the work. It reserves the projected cost of each task against the mission budget before any task starts. The reserve uses the cheapest acceptable model for that task, which is a conservative estimate. The remaining budget visible to the next task accounts for all outstanding reservations. If the remaining budget cannot cover the cheapest possible execution of the next task, that task does not start.

Propagate globally across branches. The ledger is a single data structure for the entire mission, not one counter per task. When a branch spawns a sub-branch, the sub-branch draws from the same pool. The pool knows the total outstanding reserved spend across every concurrent branch. This is the part that breaks when you build it naively on top of an orchestration library that was not designed for it: you end up with per-branch counters that do not communicate, so the global budget overflows even though each branch thinks it is within its local limit.

Reconcile actuals against estimates. As tasks complete, the actual cost is reconciled against the reservation. If a task came in under estimate, the difference is returned to the pool and becomes available for later work. If it came in over (because a model was slower, or the task required more tokens than projected), the overage is charged immediately. The pool never goes negative. If a reconciliation step would take the pool to zero or below, the mission halts, not after the step, but before the next one is dispatched.

Route to cheapest sufficient. The budget cap is not just a ceiling the system tries to stay below. It actively shapes which model gets used for each task. Ninmu carries, per task, a stack of acceptable model tiers ordered from cheapest to most capable. Under a generous budget, the preferred tier runs. As the cap tightens, cheaper tiers are substituted for tasks where the cheaper model is still sufficient for the work. At the bottom of the stack, if even the cheapest option cannot be afforded, the task halts. The routing is not cosmetic; it is the mechanism by which the budget propagates into model selection continuously throughout the run.

Interactive: drag the budget cap down and watch the swarm re-route to cheaper models, then start halting work. The metre drains as tasks run; the red halt line is the floor the factory cannot cross. Hover a task to see its model tiers and current routing.

Spend time with that before reading further. Drag the cap to the point where downgrading is no longer sufficient. The halt is not an error state and it is not a graceful degradation. It is the system working correctly. The alternative, which is letting the work continue and discovering the overrun later, is not a more capable system. It is a system with no floor.

The token economics of reflection and memory

The cost of a reflection loop is not obvious until you have run a few. A task that simply executes and returns might cost thirty tokens. The same task with two reflection cycles costs three hundred. The reflection model has to receive the original output, the reflection prompt, produce a critique, pass the critique back to the task, and the task model has to generate a revised output. If the memory system is being queried at each reflection step to provide relevant context, add another hundred tokens per step for retrieval. The cost multiplier for a single reflection cycle, under realistic conditions, sits somewhere between five and fifteen times the base task cost.

This matters because reflection is often where quality improvements live. A governed factory does not ban reflection. It accounts for it. Ninmu allows you to configure, per task type, whether reflection is permitted, and if so, how many cycles. The cost of the reflection policy is visible before it runs: you can see the budget impact of allowing two reflection cycles versus one before committing to the policy. That visibility is itself a cost-control mechanism, because it makes the trade-off between quality and spend concrete rather than implicit.

Memory retrieval has a similar economics. A query against Kizuna-mem, at roughly three milliseconds per recall, looks cheap in isolation. On a mission where five hundred tasks each make three memory calls, the total retrieval cost, including the model tokens spent processing the retrieved context, is a meaningful line item. A governed system budgets this. An unbudgeted system discovers it later.

The reflection-and-memory cost multiplier is also where many enterprise agent pilots have come unstuck. The demo ran a small number of tasks against a clean dataset, with reflection set to two cycles and generous model selection, and produced an impressive accuracy number. The production run added fifty times more tasks, added memory access for context, and kept the same reflection and model settings because nobody had a mechanism to change them based on budget. The accuracy number held. The bill did not.

What the economics of agent work actually look like

There is a number floating through the 2026 discourse about how much agent coding spend goes on rework, debugging, and fixing output rather than shipped work. The figure that gets cited most often, and that the brief for this series refers to, is the kind of split where the minority of spend represents actually shipped output while the majority covers correction. I am treating that as illustrative rather than quoting a specific figure I cannot independently verify, but the directional claim is consistent with what engineers running production agent systems report: the inference cost on first-pass work is a fraction of the total cost once you account for the cycles spent fixing what did not pass review.

The economics of agent swarms post goes into this at length. The point I want to make here is about the causal relationship between cost governance and that rework ratio.

The rework is largely downstream of two things. First, tasks running on models that were too capable for the work at hand, producing output that is slightly wrong in ways that require expensive debugging. Second, tasks running without sufficient context from memory, producing output that does not account for prior decisions in the mission, requiring reconciliation that generates more tasks. Both of these are routing failures. A governed factory addresses them at the routing step, before the rework is generated. Cheapest-sufficient routing does not just cut the per-task inference cost. It also cuts the per-task failure rate, because the task is being run by a model matched to the work rather than one selected by default or by the path of least resistance.

The ROI of cost governance is therefore not only the direct saving from lower per-token costs and prevented overruns. It is also the indirect saving from the rework that the better routing prevents.

Interactive: toggle between the “today” cost composition and a governed factory. The categories are illustrative, not measured. Hover a segment to see its proportion. The point is the direction of the shift, not the exact figures.

How Substrate builds this as a first-class property

Ninmu is the system that holds the budget ledger. It is not a billing wrapper around another orchestrator; it is the orchestrator, and the budget is a first-class data structure within it.

When you declare a mission to Ninmu, you supply two things. The goal, which gets decomposed into a task graph. The ceiling, which becomes the initial balance of the ledger. Nothing about the system is metaphorical. The ledger is a real data structure. The balance is a real number that cannot go below zero by design. The stop is a real halt, not a dashboard alert.

The routing algorithm Ninmu runs during decomposition is the same one the interactive diagram above implements. For each task in the plan, it queries the task’s tier stack, takes the cheapest tier that fits the remaining budget after accounting for all reservations, and assigns it. When the cap is dragged down in the diagram and a task switches from Sonnet to Haiku, that is the algorithm making a real decision. When a task turns red and halts, that is the algorithm determining that even the cheapest acceptable tier would exceed the remaining budget.

The metering runs in Cosmictron. Because Cosmictron combines the agent runtime with the data layer in a single binary, every model call made by a task produces a signed event that is immediately visible to the budget subsystem. There is no asynchronous gap between “the inference happened” and “the ledger was updated”, because the ledger update is part of the same transaction as the inference dispatch. This is how the reservation-and-reconcile cycle works without race conditions: the ledger is not being updated from a separate process polling a billing API. It is being updated in the same event log that the inference writes to.

The deterministic replay that the audit trail post covers in depth is also how the budget trace works. Every reservation, every reconciliation, every routing decision, every halt is a signed event in the same log as the rest of the mission execution. You can replay the budget history of a mission at any point in time, with full fidelity, without reconstructing it from separate billing and orchestration logs that were never designed to be consistent with each other.

Ultra signs every agent action before it is committed. This means the budget trace is not just a record of what the system believes it spent. It is a cryptographically attested record of what each specific agent did, with which model, at what cost, under which policy. If an agent in the swarm tries to route a task to a model tier the budget cannot support, the unsigned dispatch is rejected before it reaches the inference endpoint. The budget is enforced at the identity boundary, not just at the accounting layer.

The sector view: what this changes in practice

Finance. A tier-one bank running a quarterly control testing programme has a programme budget and a deadline. The relevant question for the agent factory is not “how much inference can we buy in total?” but “can this specific programme be completed within this specific ceiling, and if not, what work can be done and what gets escalated?” Ninmu answers that question before the run starts, not after. The programme office declares the ceiling. The factory tells them what fits, what gets downgraded, and what needs a human to triage. If the ceiling is sufficient, the run executes, halts at the human gate for review, and delivers a signed evidence pack. If it is not, the system proposes an adjusted scope rather than running to exhaustion and presenting a surprise invoice.

Healthcare. A clinical coding factory processing thousands of cases per day has highly variable per-case complexity. Simple cases are cheap. Complex multi-diagnosis cases with appeal history and policy exceptions are expensive. Without cost governance, the expensive cases silently consume the budget intended for the whole day’s run. With Ninmu’s per-mission ledger, each case is its own mission with its own ceiling derived from the case’s complexity tier, and the aggregate is monitored across the whole batch. A case that would overrun its ceiling is automatically flagged for a human coder rather than producing an expensive and possibly wrong automated output. The economics of the case-mix become predictable rather than volatile.

Government. A casework backlog programme has a different economic constraint: the per-case cost must remain below the per-case saving compared to human processing, or the whole programme does not make sense. Ninmu’s routing table can be configured with this constraint as the primary variable. Every task in the case plan is assigned the cheapest model that keeps the total case cost below the break-even point. If a case’s complexity requires spending more than break-even on inference, it routes to a human specialist rather than running an expensive automated process that costs more than the alternative. The break-even point is not a dashboard warning. It is the ledger ceiling for that case class.

What to demand in an RFP

The budget question separates governed systems from monitored ones faster than any benchmark. Here is what to put in the document.

Ask the vendor to demonstrate the swarm declining to start a task because the mission would exceed the ceiling. Not alerting. Not logging. Declining. Watch what happens. If they show you a dashboard and a Slack notification, you now know the system has monitoring but not control. That is a meaningful difference when the system runs overnight on a production workload.

Ask how the routing changes when you lower the budget cap. If the model assignments are static regardless of the cap, the budget is decorative. If the cap changes the model selection for each task, the budget is functional. Only the second system is actually governing cost.

Ask where the budget ledger lives relative to the audit log. If the cost record is in a billing system and the execution record is in an orchestration log and the model assignment is in a configuration file, you have three sources of truth that will disagree during an incident. If they are all the same signed event stream, you have one.

Ask what happens to cost governance when the system runs on your own hardware with your own models. If the cost control is implemented via a vendor billing API, it disappears when the model is on-premises. If it is implemented in the orchestrator itself, as Ninmu implements it, it works identically regardless of where the inference runs. For any regulated buyer who needs sovereign deployment, this is not an edge case. It is the baseline.

Ask for the replay. Can they show you, for any completed mission, the exact routing decisions made at each task, the cost reserved and reconciled at each step, and the reason any task was downgraded or halted? If the answer requires assembling records from multiple systems, that is the audit story they will need to tell a regulator, and it is fragile. If the answer is a single replayable event log, the audit story holds.

A 90-day way to find out if it is real

Pick one high-cost regulated workflow where you have a clear current price. A quarterly control test with a known analyst cost. A clinical coding programme with a known per-case cost. A batch of applications with a known processing cost per item.

Declare it as a mission in Ninmu with a ceiling set at eighty percent of your current human cost. Not ninety. Eighty. The ten percent gap is there to give the routing algorithm room to work. If the system cannot fit a useful amount of the workflow into eighty percent of the human cost baseline, the economics are not there and you have learned that cheaply.

Three things tell you whether cost governance is working. The first is whether the spend ever crosses the ceiling. It should not. If it does once, ask why. If it does regularly, the ledger is not governing, it is tracking.

The second is whether the routing trace shows meaningful model substitution. A run where every task used the same model regardless of budget pressure is a run where the routing was not functioning. You want to see cheap models on the easy tasks and better models on the hard ones, with clear evidence that the allocation reflects the remaining budget rather than a static configuration.

The third is whether the escalation rate makes sense. A ceiling that forces too many halts is set too low, and the programme needs a larger budget. A ceiling that never forces a halt probably means the tasks are being over-resourced, and you could lower the ceiling and save more. The governed factory makes both of these visible in the event log, which means you can tune the economics on the basis of evidence rather than intuition.

If all three pass after ninety days on one workflow, you have a factory that can be trusted with the next one. That is the pilot design. Not proving that agents can do the work, which they can in almost every case. Proving that the cost governance holds when the work gets hard.

The missing eighty percent post makes this point at the stack level: the hard capability that separates a pilot from production is not the model performance. It is the governance infrastructure that makes the model’s work attributable, bounded, and trustworthy. Cost governance is the financial dimension of that same infrastructure requirement.

The longer view

The projects that will survive to 2027 are not the ones that found the best model. They are the ones that built the floor under the model. A hard budget ledger that the orchestrator enforces, a routing function that actively substitutes cheaper options under pressure, a metering system that updates synchronously with the work, and a halt that fires before the overrun rather than after it.

These are not afterthoughts. They are the preconditions. The model is easy. There are dozens to choose from. The floor is hard, and almost nobody has built it.

Ninmu and Cosmictron together form the floor. Every token is metered before it runs. Every model assignment reflects the remaining budget. Every halt is recorded and signed. The cost trace and the audit trail are the same event stream.

You can request the full financial model and the pilot structure in the investor brief. If you are already convinced and want to see the audit side of the same ledger, deterministic replay as the audit trail is the right next read. If you want to understand the rework economics in more depth, the real economics of agent swarms has the numbers.

The question is not whether your agent programme will face a cost crisis. It is whether the crisis happens before or after the budget ceiling.