The real economics of agent swarms: 82 cents on bugfix and rework today, factories tomorrow
A large European bank ran an AI coding pilot last year and asked a simple question at the end of it: of every pound we spent on this programme, how much of it produced something that went into production? Their answer, after accounting for the output that was vetted, merged, and actually running in the test environment, was roughly twenty pence. The other eighty had paid for things that did not survive: code that failed review, output that needed substantial rewriting by a senior engineer before it was acceptable, integration scaffolding that had to be rebuilt three times because the agent did not understand the boundary constraints, and the compliance officer’s time spent writing up why two of the agent’s decisions were not auditable.
Nobody involved was incompetent. The model was capable. The agent framework was one of the popular open-source ones with a respectable GitHub star count. The unit economics were simply pointing at a structural problem: the system was optimised for generating output, not for generating output that survives contact with a regulated production environment.
That story is not unusual. A figure circulating in 2026 AI commentary frames the same problem at industry scale. The widely-cited framing is that roughly 82 cents of every AI coding dollar goes on bugfix and rework rather than on shipped product, with only around 18 cents reaching something in production, and somewhere around 44 cents of the total going specifically on bug-fixing (source: widely circulated 2026 industry commentary, attributed in various forms across AI developer community discussions on X and in several 2026 engineering blogs; not the author’s own measurement, and the precise figures should be treated as directional rather than precise). The exact percentages vary depending on who is citing it and what they are counting. The direction is not in dispute by anyone who has run these programmes at scale.
This piece is the economics companion to cost governance before the invoice arrives. That post deals with the mechanics of hard budget ledgers and why they are architecturally different from monitoring dashboards. This one takes a step back and asks why the economics look the way they do, what the structural causes are, and what it actually takes to invert the picture.
The three cost categories nobody is measuring
When organisations calculate the cost of an AI coding programme, they typically measure inference spend. It is the number on the invoice. It is also, in almost every case I have seen, the smallest of the three real costs.
The first category is rework. An agent generates code. A human reads it. The human determines it is structurally wrong, misunderstands a constraint, or introduces a pattern the team will spend months unpicking. The human rewrites it. This is not a failure of the model in the abstract sense. It is a failure of the system to communicate context, to carry institutional knowledge, to know what “acceptable” means for this specific codebase and this specific team. The rework cost is a human senior engineer’s time. Senior engineer time in a regulated bank is not cheap. Neither is senior engineer time in a hospital technology department or a government digital service. The inference cost of the agent that generated the bad output was probably a few dollars. The cost of the senior engineer who spent a morning rewriting it is several hundred.
The second category is validation and compliance overhead. Every agent action in a regulated environment eventually has to be defensible. Someone has to be able to say, if asked by a regulator or an audit committee, what happened, when, why, and who approved it. In most current setups that record does not exist in any usable form. Log files exist. Timestamps exist. A coherent, signed narrative of what the agent decided and on what basis does not. The compliance overhead is the work that goes into reconstructing that narrative after the fact, and it is substantial. In finance this can mean days of work per significant agent action. In healthcare it can mean the difference between a claimable decision and an unclaimed one. The inference that produced the original output cost a dollar. The compliance reconstruction cost several orders of magnitude more.
The third category is integration failure. Most agent frameworks operate at the level of the file or the function. They do not have a durable, governed model of the system they are modifying. They do not know, when they change an API signature, that three downstream consumers exist and will break. They do not know, when they restructure a data model, that the migration path has compliance implications. The integration failure cost is the engineering time spent finding and repairing the things the agent did not know it was breaking.
Add these three together and the 82 cents framing, whether or not the specific numbers are exactly right, is pointing at something real. The inference is the cheap part. The dominant cost is everything that happens after the inference.
Interactive: toggle between today’s cost composition (assistant plus glue) and the governed factory view. The shift in where the spend goes is the argument in a single visual. Hover any segment for the label. Figures are illustrative and clearly labelled as such.
Why the glue stack makes this worse, not better
There is a tempting explanation for the rework problem that goes like this: the models will get better, the context windows will grow, and eventually the agent will just understand what it is supposed to do. Inference quality is a real variable and it has been improving. But the rework problem is not fundamentally about model capability. It is about what the system knows and what it can prove.
A system assembled from LangGraph, Pinecone, a GitHub integration, and a monitoring dashboard has a fundamental structural constraint: none of the components were built to know each other’s state. The orchestrator loops over tasks but has no native model of the codebase’s constraint graph, the compliance requirements, or the production boundary conditions. The memory store recalls text but has no concept of what was valid at what time, which means the agent can retrieve context that was true six months ago and no longer is. The GitHub integration acts on a human’s token, which means the provenance chain collapses at the first step: you know a commit happened, but the record of which agent, with what authority, having made what decision for what reason, is not in the storage layer. It is at best in a log file somewhere.
When the output fails review, the system has no way to know why in any generalisable sense. It generates more output. The cycle repeats. The rework accumulates. This is not a criticism of the individual tools, which are good at what they were designed for. It is a statement about what happens when you assemble a governed agent system from components that were not designed for governance.
The missing 80 percent of the stack, as the case study of glue-stack pilot failures documents in detail, is not infrastructure. It is the hard properties: provable identity, governed memory, auditable action chains, hard budget enforcement, and a deployment fabric that treats agent cells as first-class compute. These properties have to be designed in from the beginning. You cannot retrofit them onto a working system without rebuilding the system.
The “dark factory” mental model versus the “AI assistant” mental model
Two framings are worth keeping distinct because they lead to completely different economics.
In the AI assistant model, a human works alongside an agent. The human reviews every significant output. The human understands the context and corrects the agent when it is wrong. The human is the quality gate, the compliance record, and the budget. This is a genuinely useful model. Coding assistants in this frame produce real productivity gains. The economics work because the human is absorbing the rework and validation cost as part of their normal work; it never appears as a separate line on a budget.
In the dark factory model, the system runs at a speed and scale where human review of every output is not the assumption. The factory declares a mission, a governed swarm executes it, humans approve the gates that policy requires, and the output is signed and auditable by construction. The economics of this model are fundamentally different. The human is no longer absorbing the rework cost. The system either has structural properties that prevent the rework, or the rework appears as a distinct, visible, and large budget line.
This is why the homepage framing for Substrate is “declare the mission and the budget” rather than “turn one developer into ten.” The two models require different architectures. A system that makes one developer faster can be bolted together from capable components. A system that runs a governed factory for a regulated enterprise has to be built with the rework and compliance and validation costs as first-class concerns from the beginning, because in the factory model those costs are not being absorbed by the human reviewing every output. They land on the system itself. Either the system handles them by construction, or they land on your budget as surprise items after the fact.
The unit economics of the factory model, when the system handles those costs by construction, are genuinely different. Not because the inference gets cheaper. Because the dominant cost categories shrink or disappear.
What cheapest-sufficient routing does to the economics
One of the mechanisms that changes the picture is cheapest-sufficient model routing. It sounds like a cost-reduction feature. It is actually an economics-transformation feature.
In the current assistant model, model selection is typically done once at the start: choose a model that is capable enough for the hardest task in the workflow, and use it for everything. This is conservative and it works, but it means paying frontier-model prices for tasks that a smaller, cheaper model could handle correctly. The planning step that drafts a bullet list of subtasks does not need the same model as the step that reasons about a security boundary. Running both on the frontier model is expensive. Running both on a smaller model risks the security-boundary reasoning being wrong.
Ninmu, the swarm conductor in Substrate, carries a per-task model of which tiers are acceptable for each node in the mission plan, and it selects the cheapest acceptable tier given the remaining budget. The mission budget diagram makes this visible: drag the cap down and watch which tasks downgrade and which refuse. The routing is not cosmetic. It is the budget enforcing a policy across every step of the mission, simultaneously.
The economics of this are not subtle. In a mission where most tasks are document-processing or code-generation of moderate complexity, and a handful of tasks involve security-critical reasoning or compliance-relevant decisions, routing aggressively routes the majority of the work to cheaper models and reserves frontier capacity for the steps where it matters. The inference bill for a mission run this way is substantially lower than a flat-rate approach. More importantly, the cheapest-sufficient property prevents the common failure mode where a team deploys a cost-controlled system in the pilot and then gradually shifts everything to frontier models because it is easier, and discovers at scale that the costs are no longer controlled.
Interactive: drag the budget cap on this illustrative mission to see how Ninmu re-routes tasks to cheaper models and halts work it cannot pay for. The goal is not to show precise costs but to make the routing logic visible. Figures are illustrative.
The compliance overhead is the real rework problem
Here is a line of thinking that almost every team I have spoken to has not followed to its conclusion. The 82 cents framing applies to coding output. It is about code that gets generated and does not survive. But in a regulated context, the same problem applies to compliance output, which is much more expensive.
An agent in a healthcare setting produces a coding decision. The decision is auditable, or it is not. If it is not auditable, a human has to reconstruct the reasoning chain after the fact, or the decision is not usable, or the organisation accepts a compliance risk. In a high-volume processing environment, this is not a marginal cost. It is the dominant operational cost. The same applies to trade-finance evidence packs, to DORA resilience documentation, to EU AI Act Article 12 logging requirements, to government casework decisions.
In each case, the agent doing the work is the cheap part. The compliance overhead is the expensive part. And in current architectures, the compliance overhead is proportional to how un-auditable the system is by default. Every step that is not natively signed, every decision that is not natively recorded, every action whose provenance chain collapses back to a human’s credentials rather than a verifiable agent identity, becomes a compliance cost that appears later and is expensive to retrospectively address.
A governed factory inverts this by making the output of the compliance process a byproduct of the normal run rather than a separately expensive second step. Every action signed by Ultra’s cryptographic identity layer is already part of the evidence chain. Every task recorded in Cosmictron’s deterministic replay log is already reconstructable. Every memory access through Kizuna-mem is already bitemporal, meaning the regulator’s question “what did the agent know at this moment” is answerable from the storage layer without additional investigation work. The audit-ready property is not a feature bolted on after the fact. It is the floor the factory is built on.
This is where the cost inversion in the diagram above reflects something real rather than marketing arithmetic. The compliance overhead in the “before” bar is not a small slice. For regulated use cases it often exceeds the inference and rework costs combined. When the factory is governed by construction, that overhead does not disappear, but it collapses to the cost of the governance infrastructure itself rather than the cost of a manual compliance process applied retrospectively to an un-auditable system.
A sector walkthrough: the before and after in trade finance
Consider a trade-finance evidence pack, which is among the most documentation-intensive processes in regulated finance. The before state is well understood. An associate, sometimes two, works through a set of documents over several days or weeks, extracting the relevant counterparty and transaction information, running it against sanctions lists and AML policy rules, surfacing exceptions to a senior analyst, and assembling a final pack that is submitted with the relevant signatures. The total cost per pack, including the analyst time and the compliance check, is in the range of several hundred to several thousand pounds depending on complexity. The audit trail is assembled by hand at the end, which means it is incomplete and inconsistent.
The costs that appear in the before picture: a senior associate’s week, an analyst’s review time, a compliance officer’s sign-off time, and the non-trivial probability that the pack has to be revised because an error is found during review, at which point the loop restarts. The inference costs are zero because the system is entirely human.
Now introduce an agent system that is not governed by construction. It generates the document extracts quickly. The inference is cheap. But the compliance officer cannot sign the pack because the provenance of each extracted fact is not recorded. The analyst cannot approve it because the policy-check reasoning is not verifiable. The pack has to be reviewed item by item by a human anyway, which means the review cost has not decreased. What has decreased is the time the associate spent typing. What has increased is the time the senior review chain spends doing the compliance work that the system cannot do on its behalf.
This is the operational pattern that underlies the rework framing: the system makes the cheap part cheaper and the expensive part more expensive, because it generates more throughput without providing the audit properties that allow that throughput to clear the compliance gate.
The factory approach starts from a different place. The mission is declared with a hard budget. Ninmu plans the evidence pack workflow: ingest the documents, build the entity graph, run the policy checks, surface exceptions to a named analyst, assemble the pack, sign and lodge it. Each step runs at the cheapest sufficient model. Each action is signed by Ultra. Each memory access is through Kizuna-mem. When the pack is complete, the audit trail is the record of the run, not a document someone assembled after the fact. The analyst spends their time approving the exceptions that require human judgement, not re-reading every line of the output to check provenance.
The before-and-after on costs is not subtle. The direct labour cost collapses to the analyst gate time plus oversight. The inference cost is low because of cheapest-sufficient routing. The compliance overhead is built in. The revision loop still exists for genuine exceptions, but the cases where revision happens because the provenance was not recorded or the reasoning chain was not verifiable, those cases go to zero. That is a large category.
What to demand in an RFP
Any agentic platform that cannot answer these questions clearly is probably solving a different problem than the one that determines your costs.
Ask where the compliance overhead goes. Specifically: if a regulator asks you to demonstrate the reasoning chain for a decision the system made ninety days ago, how long does it take and who does the work? If the answer involves a human reconstructing anything, the compliance overhead has not gone away. It has been deferred.
Ask how the system proves which agent made which decision. Not “we have logs.” Specifically: is every action cryptographically signed by a verifiable agent identity with a defined capability scope, and is that signature in the storage layer where the audit can read it directly? If the answer involves correlating logs across multiple systems, you have multiple sources of truth that will disagree at the worst moment.
Ask whether the budget is enforced in the orchestration layer or reported in a dashboard. The bank in the opening of this post had dashboards. What it lacked was an orchestrator that would refuse to take a step because the mission was about to exceed the ceiling. Those are different things.
Ask about the compliance cost per unit of output at scale. Not the inference cost. The compliance cost. For most current systems, this number is very hard to calculate because the compliance work is distributed across human reviewers and is not systematically tracked. For a governed factory it is a direct output of the operational model.
A 90-day pilot design for the economics question
The point of a 90-day pilot is not to see whether the agent can do the work. They can. The interesting question is whether the economics work at scale, which means whether the rework and compliance costs behave the way the theory says they should.
Pick one workflow with a clear before-state cost. A trade-finance pack, a quarterly control test, a casework backlog with a known per-item cost in human hours. Run it under the factory. Measure three numbers:
The total cost per output unit, counting inference, analyst gate time, compliance review time, and revision cycles. Compare to the before state. If the dominant cost categories have not shifted, the system is not structured as a factory.
The revision rate. What share of outputs required human revision after the initial run, and why? If the revision reasons are “provenance not recorded” or “reasoning chain not verifiable,” those are structural properties the system should have supplied. If they are “genuine domain exception,” those are expected and acceptable.
The ceiling behaviour. Run at least one test where the budget ceiling is deliberately set below the comfortable level and observe what happens. A factory stops before overspending. A dashboard system sends a Slack message after the fact.
Ninety days is enough to see whether the economics inversion is real or whether the compliance overhead has simply moved somewhere less visible. If the before costs were high and the pilot shows comparable total cost with better audit quality, you have found the thing. If the inference is cheaper but the compliance reviewers are working twice as hard, you have found a faster way to generate the problem.
The economics of agent swarms are not the inference costs. They never were. They are the rework and validation and compliance costs that inference-first framing obscures. A factory that is governed by construction from the first step attacks exactly those lines. The 82 cents figure, however you weight the exact numbers, points at the right problem. The answer to it is not cheaper inference. It is a system that does not produce the 82 cents of waste in the first place.
If you want the financial model and the head-to-head against glue stacks, you can request the investor brief. For the infrastructure angle on why these costs scale differently on owned hardware, see hardware implications of the agentic wave. And for the governance mechanics underneath the cost controls described here, cost governance before the invoice arrives covers what it actually takes to build a budget ledger that holds.