Article 記事

Hardware implications of the agentic wave: DRAM, HBM, servers 2026-2027

author Jonathan Conway
timestamp 26 May 2026
classification hardware / dram / hbm / agentic-ai / sovereign-infra / voxeltron / on-prem / capex-opex / substrate

A data-centre procurement manager I spoke to earlier this year had a problem that would not have made sense two years ago. Her team had budgeted for GPU cycles. The invoices coming back were dominated by something else: DRAM. Lots of it. The agents were doing something GPU-optimised inference barely notices: they were reading and writing memory in large, irregular bursts, context windows ballooning as missions ran for hours rather than milliseconds, reflection loops loading and reloading the same graph segments, multi-agent coordination pulling shared state at random offsets. The GPUs were not idle. But they were not the bottleneck either.

This is the quiet hardware story of 2026. The agentic workload is different from training, and it is different from the single-turn inference that shaped most of the current infrastructure conversation. Understanding the difference matters whether you are a sovereign government deciding where to spend infrastructure budget, a regulated enterprise choosing between cloud and on-prem, or an analyst trying to make sense of why certain memory segments started drawing unusual attention in the first half of 2026.

What actually makes an agentic workload different

Start with what inference looks like when it is simple. A user submits a prompt. The model loads its weights into memory once, runs a forward pass, streams tokens out. The weights are enormous, which is why HBM (High Bandwidth Memory, the GPU-stacked memory used in H100s and their successors) is so important for inference: you need enormous bandwidth to feed the attention mechanism. But the operation is fairly regular. The access pattern looks like large sequential reads across weight matrices. Cache it right and the GPU stays fed.

Agentic workloads are different in four ways, and each one changes the hardware picture.

The first is context length. A single-turn chat inference might operate on a few thousand tokens. An agent working through a multi-step audit mission, reflecting on its own prior outputs, consulting memory graphs for facts it established three hours ago, and maintaining state across dozens of tool calls, can accumulate context windows that are orders of magnitude larger. Long context is not just more of the same; the attention mechanism’s compute and memory costs grow with the square of sequence length, and the access pattern becomes less regular as the model attends to distant positions.

The second is the reflection loop. Many agentic architectures deliberately give agents the ability to reason about their own reasoning: check a plan, critique an output, revise a decision before committing it. Each loop is a complete forward pass, or close to it. In a supervised batch training run you control loop count tightly. In a live agent mission, the number of reflection loops depends on how hard the problem is, and can spike sharply when the agent encounters ambiguity or error. From a hardware perspective, the burst profile is irregular and hard to smooth out.

The third is multi-agent coordination. When a governed factory like Substrate is running dozens of cells simultaneously under a single Ninmu mission, those agents are sharing state: a bitemporal memory graph in Kizuna-mem, signed event logs in Ultra, delta propagation from Cosmictron. Reading and writing that shared state is not a GPU operation. It is CPU and DRAM, and it involves a lot of random-access traffic: an agent needs to know what another agent decided three minutes ago, or check a policy constraint in the memory graph, or write a signed action to the audit log. These reads and writes are small, frequent, and spread across memory in a pattern that looks nothing like the large sequential reads that HBM is optimised for.

The fourth is the simulation and testing load. Governed factories do not ship untested. Every significant agent output in a well-designed factory goes through some form of automated checking: does this code compile? does it pass the test suite? does the evidence pack conform to the required schema? Running tests against agent outputs is itself a workload, and it runs in parallel with the primary agent mission. At high throughput it adds meaningfully to the overall memory footprint.

Put these four together and you get a workload profile that is: more memory-bound than simple inference, more reliant on conventional CPU-DRAM bandwidth for coordination and state management, burstier in a way that makes smooth provisioning harder, and more sensitive to random-access latency than sequential bandwidth.

Interactive: toggle between the agentic workload profile and simple inference to see how the cost composition shifts. Hover any segment for a description. The figures are illustrative, not measured data.

What this means for the memory market in 2026-2027

Analysts following the semiconductor sector have started to notice the signal. The Mizuho NDR commentary circulated in May 2026 around the memory sector noted explicitly that agentic AI deployment is beginning to lift demand for conventional CPU-DRAM, not just the GPU-stacked HBM that has dominated the AI infrastructure conversation since 2023 (source: Mizuho NDR takeaways, widely cited in semiconductor trade coverage, May 2026). The logic is exactly what the workload analysis above would predict: the coordination overhead, memory-graph traversal, and state management in agentic systems all run on CPU, and they need DRAM bandwidth.

This matters because CPU-DRAM and HBM are different supply chains. The HBM market has been constrained since 2022 by the sheer difficulty of stacking memory dies on GPU substrates at yield. CPU-DRAM, while less exotic, is not immune to its own capacity constraints: the same foundries and DRAM manufacturers that make server DRAM are handling demand from automotive, consumer devices, and existing data-centre growth. If agentic deployments start to be a material demand driver for CPU-DRAM, the supply picture becomes interesting.

Be careful about forward-looking certainty here. Whether agentic workloads will move server DRAM pricing by a material amount in 2026-2027 depends on how fast enterprise agentic deployment actually scales, which is genuinely uncertain. The directional claim (agentic AI favours memory bandwidth, including CPU-DRAM) is well-grounded in the workload characteristics. The magnitude is not something I am going to put a precise number on without a citable source, and neither should you trust anyone who does.

The HBM story also continues. Long-context inference and large-model deployment remain GPU-bound, and the pipeline of sovereign AI investments globally points to continued pressure on GPU and HBM supply. The Mizuho commentary also notes that sovereign infrastructure projects are emerging as significant buyers of both, alongside hyperscalers (source: ibid.). India’s large-scale data-centre investments, including the Reliance and Netweb announcements that drew attention in early 2026, are buying real server capacity rather than just committing to cloud credits. When a national government decides it needs AI infrastructure inside its own borders, it becomes a buyer of CPUs, DRAM, GPUs, and networking at scale. Multiply that across several sovereign programmes and it is not a trivial demand signal.

The server design implication is also worth noting. Agentic workloads are more heterogeneous than training or simple inference: they need GPUs for the model forward passes, CPUs for orchestration and coordination logic, fast DRAM for state management, and fast storage for memory graphs and event logs. Optimal server configurations for agentic deployment may look different from the GPU-heavy configurations that dominated AI procurement from 2022 to 2025. Memory-expanded CPU configurations, high-bandwidth NVMe, and well-designed NUMA topology all become more relevant.

The sovereign imperative and its hardware logic

The sovereign AI narrative is not new, but it acquired sharper edges in 2026. A combination of regulatory pressure (EU AI Act enforcement beginning in August 2026, DORA requirements for financial firms, data residency rules that vary by jurisdiction), geopolitical concern about technology dependence, and straightforward procurement risk from hyperscaler concentration has pushed governments and large regulated enterprises to ask seriously whether they can run their AI infrastructure inside their own walls.

The hardware implication is direct. If you cannot send data to a cloud provider, you need the compute, the memory, and the storage sitting in a place you control. That means capital expenditure on actual hardware, which is a different procurement and financing decision from operational expenditure on cloud credits.

For governments this is largely a settled question. Sovereign infrastructure is a policy decision as much as an economic one, and the relevant programmes are buying hardware. For regulated enterprises, the calculus is more nuanced. Cloud is convenient, predictable on a per-call basis, and requires no upfront commitment. On-prem has higher upfront cost but offers predictable long-run spend, no data-egress risk, and the ability to run continuously without a vendor’s terms of service or pricing changing underneath you.

The agentic workload characteristics shift this calculus in the on-prem direction more than simple inference did. Consider the memory footprint. A governed factory running hundreds of agent cells simultaneously, with Kizuna-mem maintaining bitemporal graphs across all active missions, Cosmictron holding shared state for multi-agent coordination, and Ultra managing signed identity for every action, is holding a substantial amount of working state in memory at any given moment. Paying for that as cloud consumption, especially with the memory-bandwidth pricing that GPU cloud currently uses, can become expensive fast. Owning the hardware and paying once for the DRAM makes the economics look different over a three- to five-year horizon.

This is not a guarantee. Cloud has genuine advantages: elasticity for unpredictable burst loads, absence of maintenance overhead, access to the latest GPU generations without a capital cycle. The right answer depends on workload predictability, data-sovereignty requirements, and the buyer’s cost of capital. What has changed is that the agentic workload characteristics, specifically the memory intensity and the state management overhead, give on-prem a stronger cost argument than simple inference alone would.

How Substrate and Voxeltron sit in this picture

Substrate’s deployment model was designed around sovereign deployment from the start, not retrofitted onto a cloud-native architecture. The six systems (Ninmu, Cosmictron, Kizuna-mem, Ultra, Kizuna, Voxeltron) run on the customer’s hardware, inside the customer’s network perimeter, with the option of frontier model failover as an optional edge but not a dependency for operation. This is not primarily a marketing position; it is an architectural constraint that shaped design decisions throughout the stack.

Voxeltron is the piece most directly relevant to the hardware question. Its job is to take a physical host and make it as dense with agent capacity as possible, without sacrificing isolation or operational reliability. The headline numbers are verified: approximately 10,000 idle cells per host, with isolated cell boot times of 50 milliseconds or less. To put that in terms a hardware procurement team can use: one well-configured server running Voxeltron can host a very large number of concurrently available agents, ready to be dispatched to a mission task within 50 milliseconds of the request.

Interactive: drag the host capacity slider to explore how agent-hours scale with physical host count. Each cell represents an isolated agent runtime. The fault-injection button shows how Voxeltron’s AI SRE operator handles a failed cell: state is preserved via the Cosmictron snapshot and the cell rolls back cleanly.

That density figure is what changes the on-prem economics. Traditional container orchestration was designed for persistent services, not ephemeral agent workloads. A Kubernetes cluster running Dockerised agents incurs cold-start overhead, scheduling latency, and resource allocation granularity that was never optimised for thousands of short-lived agent tasks. Voxeltron’s architecture treats the cell as the unit, boots them fast, packs them densely, and recovers them automatically via the AI SRE operator if something fails. The result is more agent-hours per physical server, which is the number that actually appears in the denominator of the on-prem cost calculation.

The memory implication is the other side of the density argument. Dense packing of cells is only economical if the per-cell memory footprint is well-managed. An idle Voxeltron cell is not consuming active DRAM for model weights; it is a lightweight runtime waiting for dispatch. When a cell activates, it loads what it needs. This is different from the pattern where every potential agent instance has its models fully resident in GPU memory at all times. The consequence is that a modest physical host, appropriately configured with fast DRAM and good NVMe, can support agent density that a naive “allocate a full container per agent” approach could not approach at the same price point.

Cosmictron’s embedded state model also matters here. Instead of every agent cell talking to an external Redis cluster or Postgres instance over the network for shared state, Cosmictron provides co-located state with zero network hops for cells on the same host. For the random-access, low-latency state reads that multi-agent coordination requires, this is a meaningful improvement. It also means the DRAM on the Cosmictron-bearing host is doing real work (holding the live state and delta buffer for all cells on that host) rather than that bandwidth going across the network to a separate data store.

A concrete sector walkthrough: government AI procurement

Consider a government department that needs to stand up a governed agent factory for processing benefit claims, or casework, or regulatory submissions. The mandate is clear: data must stay within national infrastructure, the audit trail must be inspectable by public accounts committees, and the system must be capable of running under partial network isolation if required.

Before: The department buys cloud credits, puts data into a hyperscaler’s sovereign region (which may or may not actually satisfy the data-residency requirements, depending on jurisdiction), and accepts that operational continuity depends on the vendor. The cost is opex: it varies with consumption, which is convenient until the agentic workload scales and the bills reflect the memory-intensive, bursty nature of what is actually running.

With Substrate on sovereign hardware: The department commissions servers. The upfront capex is real, and it requires a financing decision rather than a credit card. But from that point, running the governed factory costs compute electricity and maintenance rather than per-token and per-memory-access charges. The Voxeltron density means the number of physical servers needed to support a given agent workload is lower than alternatives that cannot match the cell packing. Ninmu’s hard budget ledger means the total compute consumption is bounded by the declared mission budget before inference runs, not discovered on an invoice three weeks later.

The audit trail is a further sovereign argument. When the agents run on sovereign hardware, the entire event log, every signed agent action, every bitemporal memory record, every human gate decision, sits in infrastructure the department controls. Regulators and audit committees can inspect it without going through a vendor’s export API. That is not a capability you can bolt on after the fact; it requires the audit trail to be the storage engine, which is what Cosmictron’s deterministic replay architecture provides.

The 90-day pilot design for this kind of buyer is worth being explicit about. Pick one high-pain workflow with a known current cost: the legal team’s case-file review backlog, the compliance team’s evidence-pack assembly, the finance team’s control testing cycle. Stand up a Voxeltron host (or a small cluster for redundancy). Declare the mission and a hard budget set at some fraction of the current human cost. Run it. Measure three numbers: cycle time, exception rate requiring human review, and actual spend against the ceiling. The ceiling should not be crossed; if it is, the factory’s governance model has a problem and you have found it cheaply.

What to ask for in an RFP

If you are evaluating hardware-paired agentic infrastructure for sovereign or regulated deployment, a few questions cut through the marketing.

Ask the vendor what the cell density is on a standard 2U server and how that is measured. “Up to 10,000 idle cells per host” is a specific, testable claim. Vague answers about “thousands of containers” are not.

Ask what the cold-start latency is for a new agent cell under load. 50 milliseconds is the Voxeltron target; ask to see it demonstrated, not described.

Ask what the DRAM configuration assumes and what happens to density when cells are active rather than idle. An honest answer distinguishes between idle cell density (the lightweight footprint when the cell is waiting) and active cell memory (what the cell consumes when running a model forward pass). Both numbers matter for sizing a real deployment.

Ask how shared state is managed across cells on the same host. If the answer is “Redis” or “Postgres on a separate server”, you are being asked to fund the network latency and the separate infrastructure for every coordination operation. If the answer is “co-located embedded state with zero hops”, ask them to show the latency numbers.

Ask what the budget governance looks like at the hardware level. An agent factory that can consume all available DRAM and crash the host because a reflection loop went out of control is not a governed factory; it is a demo. The budget ceiling in Ninmu’s ledger and Voxeltron’s per-cell resource limits should together prevent runaway consumption before it affects other tenants on the same host.

The capex-opex tension is real but often poorly framed

One thing worth being direct about: the argument for on-prem hardware is not always correct. There are genuine scenarios where cloud is the better answer, and intellectually honest evaluation requires acknowledging them.

If your agentic workload is genuinely unpredictable in volume, cloud elasticity is hard to replace without over-provisioning. If you do not have the infrastructure team to maintain a hardware deployment, the operational cost of on-prem can exceed the opex savings. If your data-residency requirements can actually be satisfied by a hyperscaler’s sovereign offering in your jurisdiction, the security argument for on-prem is weaker.

What has changed is not that on-prem is always right. It is that the agentic workload characteristics (memory intensity, density requirements, bursty coordination traffic) have shifted the break-even point. The types of deployment that are worth considering for on-prem now include medium-scale agentic factories that would have been marginal cases under a pure inference workload, precisely because the per-cell memory overhead at Voxeltron density makes physical hardware genuinely competitive in the right configuration.

The sovereign buyers, the national governments and critical infrastructure operators, have an additional dimension: the option of paying cloud opex indefinitely is not actually on the table for them. They need hardware. The question for them is which hardware, in what configuration, running what software. The agentic workload characteristics described here should inform those procurement decisions directly: weight memory-bandwidth configurations more heavily than pure GPU count; look for software that packs cells densely rather than wastes host capacity on container overhead; insist on a budget governance model that prevents runaway consumption before the invoice arrives.

Cross-links and further reading

The density economics described in this article build directly on the Voxeltron architecture documented in Voxeltron: cell density and the 50ms boot. If your interest is the sovereign deployment model rather than the hardware economics, sovereign AI, air-gapped by default covers the operational and governance case in more depth.

The unit economics of agent swarms, including how a governed factory inverts the cost composition compared with unmanaged agentic pipelines, is the subject of the real economics of agent swarms: 82 cents on rework. And if you want to understand the geopolitical and capital dimension of sovereign infrastructure as a competitive battleground, the global battle for sovereign agent infrastructure picks up the narrative at the national level.


The hardware conversation around AI has been dominated by two things: how many H100s a training run needs, and which hyperscaler is cutting which deal. Agentic deployment changes the question. The memory bus, the random-access latency, the cell density, the cost per agent-hour on a physical host: these are the numbers that matter when a governed factory is running missions continuously rather than serving one-off inference requests.

The buyers who understand this in 2026 will make procurement decisions that hold up. The ones who do not will replicate the pattern of the previous cycle: optimise for the metric everyone talks about, and find the bottleneck somewhere else entirely.

If you want to understand how Substrate’s architecture was designed for exactly these hardware realities, including the budget governance, the cell density, and the sovereign deployment story, you can request the investor brief.