Article 記事

EU AI Act Article 12 and Annex III: what high-risk agent systems must log by August 2026

      author
      Jonathan Conway
    

      timestamp
      3 May 2026
    

      classification
      eu-ai-act / compliance / logging / high-risk-ai / audit-trail / substrate / identity / regulated
    

A compliance officer at a European bank asked me in April what their AML screening agent needed to log under the EU AI Act. The practical question was not abstract compliance. It was whether, months later, the bank could prove what the agent saw, did, and relied on. I gave them the short answer: everything, for at least six months, in a form that cannot be altered after the fact. They asked me to be more specific. This article is the longer version of that conversation.

Article 12 of the EU AI Act (Regulation (EU) 2024/1689, available in full on EUR-Lex) imposes automatic logging obligations on providers of high-risk AI systems. The obligations are not about keeping dashboards or writing to a database. For you, the key point is whether the system can create the evidence automatically, without depending on someone remembering to switch logging on. They require that a high-risk AI system “technically allow for the automatic recording of events.” The word “technically” is doing a lot of work in that sentence. An agent that can log but currently does not is not compliant. The capability must exist in the system itself, by design, before the system is placed on the market or put into service.

For most of the agents being deployed today in financial services, healthcare, and government, the infrastructure does not have this capability. That is usually not an oversight. It is a design inheritance from frameworks that were never built for regulatory evidence.

What makes a system “high-risk” under Annex III

The first question is whether your system is in scope. The Act defines high-risk AI systems in Article 6 and Annex III. The list is specific. Annex III covers eight areas, including biometric identification, critical infrastructure, education and vocational training, employment, essential private and public services (credit scoring, life and health insurance), law enforcement, migration and asylum management, and administration of justice.

For regulated enterprise buyers, the two areas that cover most current agent deployments are employment decisions (HR agents that screen candidates or manage staff) and essential private services, which expressly includes creditworthiness assessment and life and health insurance risk classification. An AML screening agent that contributes to a decision about whether to maintain a customer relationship is almost certainly in scope. A claims processing agent that influences an approval decision is in scope. A casework agent in a government benefit service is in scope.

The Act gives providers some discretion to self-certify that a system within Annex III falls outside the high-risk category if it poses only a limited risk to health, safety, or fundamental rights (Article 6(3)), but the bar for that self-determination is high and it must be documented. If you are in any doubt, assume you are in scope.

What Article 12 requires

Article 12 sets out four categories of events that must be automatically recorded. Read it against the original text at EUR-Lex, because summaries often compress things in misleading ways.

In plain language, the log has to help show when the system may be moving toward a risky outcome. First, the system must record events and information that are necessary to identify situations that may result in a high-risk AI system presenting a risk (Article 12(1)). This is risk-situation logging: any output, internal state, or input pattern that could indicate the system is producing decisions with an elevated risk of harm or error.

It also has to preserve the history of what changed, not just what the system output today. Second, it must record events related to substantial modifications throughout the lifetime of the system (Article 12(2)(a)). A model version change is a substantial modification. A change to the underlying training data is a substantial modification. So is a change to the prompt templates that alter the system’s behaviour in scope. The word “lifetime” is important: this runs from when the system is first placed into service through to decommissioning.

Third, post-market monitoring events (Article 12(2)(b)). Once a high-risk system is deployed, providers are required to actively monitor its real-world performance under Article 72. The logs must be capable of supporting that monitoring, which means recording enough operational detail to detect performance drift, unexpected outputs, and emerging failure modes.

Fourth, operational monitoring events (Article 12(2)©): the normal functioning of the system at a sufficient level of granularity to understand what the system did and on what basis.

This is where logging stops being a generic observability feature and becomes evidence. Article 12(1) also states that the logs must, to the extent possible, allow for the identification of the period during which the system was used, the reference database against which the input data has been verified, and the input data that led to the output. That last requirement deserves attention. Regulators need to reconstruct what the system knew when it made a decision. Vector search over an unversioned embedding store cannot provide that. A bitemporal memory graph can.

Retention turns the legal duty into a storage and deletion design problem. Retention: Article 12(1) requires that logs be kept for a minimum period. For most high-risk systems the baseline is six months. Certain biometric and law enforcement applications under Annex III have longer retention requirements under related provisions of the Act and the AI Liability Directive. Some financial services firms will also need to comply with DORA’s ICT incident reporting log retention requirements, which run to five years for major incidents, effectively creating a higher floor.

In everyday terms, if someone changes the record, the change has to be detectable. Tamper-evidence: the Act does not use the phrase “tamper-evident” directly, but the requirements are functionally equivalent to it. If a log can be modified after the fact without detection, it cannot reliably identify risk situations or support a regulatory investigation. The technical standard for what “automatic recording” means in practice will be elaborated through the harmonised standards the Commission is developing with CEN-CENELEC, but the direction of travel is clear.

On fines: the Act has a tiered penalty structure. Prohibited practices under Article 5 (facial recognition databases scraped from the internet, social scoring, and similar) carry fines of up to EUR 35 million or 7% of global annual turnover, whichever is higher. Non-compliance with obligations applicable to high-risk AI systems, including the Article 12 logging requirements, falls under the mid tier: up to EUR 15 million or 3% of global annual turnover. Fines for providing incorrect information to a notified body or national supervisory authority fall at EUR 7.5 million or 1%. These are maximums; actual penalties depend on the gravity of the infringement, the size of the operator, and national enforcement approaches.

One caveat worth stating clearly: as of June 2026, there are legislative proposals under what is being called the Digital Omnibus package (part of the Omnibus Simplification package published by the Commission in February 2025) that would delay some SME-facing obligations and adjust certain Annex III criteria. These proposals are not yet law. The August 2 2026 applicability milestone for high-risk systems under Article 6(2) and Annex III stands unless and until a final legislative text amending the Act is published. Building compliance strategy on proposed delays is a risk most regulated enterprises should not take.

The technical gap in most agent stacks

What does Article 12 require technically? For a buyer or operator, the test is simple: can you answer a regulator’s reconstruction question from the system itself? At a minimum: an append-only log that captures the identity of the acting system (which agent, at which version), the input it received, the decision or output it produced, the time, the policy or rule set that governed it, and any human gate invocations. The log must be structured in a way that supports queries like “show me everything the system knew and did on this transaction at this timestamp.” And it must not be editable after the fact.

Now look at a typical agent stack: LangChain or LangGraph for orchestration, Postgres or a vector store for memory, a cloud model API for inference, and a custom logging table bolted on after the fact. The logging table has the database’s write access. An operator with access to the database can delete or alter rows. The memory store does not version its contents bitemporally, so “what the agent knew at time T” is not reconstructable after the memory is updated. The model version is probably not recorded per-call. The prompt template that shaped the output is not captured. The policy that governed the output does not exist as a formal artifact at all.

None of this is a criticism of the individual tools. Postgres is excellent. LangGraph is a reasonable orchestration choice. The problem is structural: governance, tamper-evidence, and bitemporal provenance were not design requirements for these components. You cannot bolt them on afterwards without rebuilding the parts that matter.

The AML screening workflow below shows what Article 12 logging looks like when you design it in from the start, and what a typical glue stack would be missing.

Interactive: toggle “show glue-stack gaps” to see which Article 12 artifacts a LangGraph plus Postgres stack would fail to produce. Hover each workflow step to inspect the specific artifact and which Substrate system generates it.

Five of the six artifacts in that workflow are absent from most agent stacks. The one they typically have is something in the neighbourhood of a rule-trace log, because rule engines tend to write outputs by design. Everything else, the signed ingest manifest, the bitemporal memory state record, the calibrated risk record with model version, the human gate attestation, the final tamper-evident bundle, requires an infrastructure that was built to produce it.

How Substrate produces what the regulator needs

To understand why this works, start before the log. When a Substrate mission is declared, the mission orchestrator decomposes it into a task graph and assigns each task a policy context. The policy context is not a comment in a config file. It is a formal artifact that travels with the task and is included in the signed record of what happened. From the first step, every action carries its governance provenance.

The identity service is the cryptographic authority plane. Every agent in the factory has a provable Ed25519 identity registered with the identity service. When an agent acts, it signs the action record with its private key. The action record includes: the agent identity, the timestamp, the tool invoked, the input (or a content-addressed hash of it for large payloads), the output hash, and the policy identifier. The identity service writes this to an append-only, HMAC-chained log. The HMAC chain means that any alteration to any record changes the hash, which invalidates every subsequent record in the chain.

That chain is what Article 12 means when it says the system must “technically allow for the automatic recording of events.” The log does not need a human to activate it. It is not a middleware wrapper that could be disabled. It is how the factory records anything at all.

The governed memory engine supplies the bitemporal memory that Article 12 requires for input reconstruction. When an AML agent queries the governed memory engine for context about a counterparty, the query and its result are recorded with both valid time (when the fact was true in the world) and transaction time (when the system learned it). At any future point, you can ask the governed memory engine to reconstruct exactly what the agent knew at a specific timestamp. The ~3 ms recall latency means this reconstruction is fast enough to happen in-line during a regulatory examination.

The realtime data plane’s deterministic replay adds the execution layer. Because every state change in the realtime data plane is logged with the event that caused it, the entire agent run can be replayed. The replay is deterministic: given the same signed event log, the same outputs are produced. This is not a reconstruction from fragmentary application logs. It is the storage engine itself treated as an audit record, which is the right design when “what did the system do” and “what is the authoritative record of what the system did” need to be the same thing.

The diagram below shows the tamper-evident log structure that the identity service produces, and what happens when you attempt to alter any record after the fact.

Interactive: click “simulate tamper” to see the HMAC chain break at the modified block and invalidate all downstream records. Hover any block to inspect the signed fields that Article 12 requires.

The tamper demonstration is not theatrical. It reflects the actual verification mechanism. An auditor or regulator presented with the signed log can run the hash verification themselves. If the chain is intact, the log has not been modified since the first block was written. If it is broken, the break points to exactly where and when an alteration occurred.

A concrete walkthrough: AML screening in a European bank

Now put the architecture back into the bank’s workflow. Consider an AML screening workflow in scope under Annex III, Article 6, Essential Private Services. The bank operates an agent that ingests transaction data, screens against sanctions lists and internal rules, scores each transaction for risk, flags exceptions to a compliance analyst for review, and produces a Suspicious Activity Report where required.

Under a glue-stack architecture, the logging situation looks roughly as follows. Transaction ingest is handled by an ETL pipeline that writes to a database. Whether the specific ingestion event is logged in a form that identifies which version of the pipeline ran and what it received depends on whether someone wrote a custom handler. Sanctions screening typically writes its result to a decisions table. The rule engine may produce a trace. Risk scoring probably does not record the model version. The human review gate is an entry in a ticketing system, not a cryptographic attestation. The SAR output is assembled and filed, but the cryptographic link between the filed SAR and the specific evidence that led to it does not exist.

If a regulator asks the bank to produce the Article 12 log for a specific transaction reviewed on a specific date, the bank has to reconstruct it from five separate systems, none of which were designed to be consistent with each other. This is the practical interpretation of “most current agent stacks have no such capability.” It is not that the data does not exist somewhere. It is that it cannot be produced in a form that satisfies “automatic recording” and tamper-evidence.

Under Substrate, the same workflow runs as a governed mission. When the AML agent ingests a document, the identity service signs the ingest event and writes it to the append-only log. When it queries the governed memory engine for counterparty context, the query, the result, and the memory state at query time are all recorded. When the realtime data plane rule engine evaluates the transaction, the delta is written to the deterministic log. When the risk score is calculated, the mission orchestrator records the model version, the cost, and the policy tier. When the exception is routed to the analyst, the human gate record is a signed attestation linked to the preceding event chain. When the SAR is filed, the entire chain is bundled and hash-signed by the identity service into a tamper-evident evidence pack.

The bank can produce the Article 12 log for that transaction in one query. The retention requirement is met automatically because the append-only log does not permit deletion of production records (GDPR right-to-erasure operates through a separate governance pathway that writes a cryptographic deletion record without altering the chain, as described in GDPR hard-delete and agent memory graphs).

The cryptographic identity side of this story, meaning how the identity service registers and manages agent identities over the system lifetime, is covered in detail in cryptographic agent identity and the identity service. The deterministic replay mechanism that makes the entire run reconstructable is covered in deterministic replay as the audit trail.

The full event taxonomy Article 12 requires

It is worth mapping the Article 12 requirements to specific Substrate artifacts in concrete terms. The point is to make the compliance story inspectable, not aspirational. The table below is illustrative but grounded in the archetype we just walked through.

Risk situations (Article 12(1)): the mission orchestrator’s mission monitor watches the swarm’s output distribution and flags anomalies against the declared policy parameters. An anomaly triggers a risk-situation event in the identity service log, with the mission state at that point serialised and signed. The governed memory engine retains the memory state at anomaly time, so the “input data that led to the output” requirement is met bitemporally.

Substantial modifications (Article 12(2)(a)): When a new model version is provisioned through the agent forge (the forge), the version change is recorded as a signed commit with the previous and new version hashes. The agent forge also records prompt template changes through its policy-gated merge process. A regulator asking “what version of the model was running on this date” receives an answer from the agent forge commit log, not from a human’s recollection.

Post-market monitoring (Article 12(2)(b)): The realtime data plane’s incremental DBSP subscriptions allow the monitoring agent to receive real-time delta feeds on the AML agent’s output distribution without polling. Performance drift is detectable before it becomes a risk situation. The subscription itself is a signed event in the identity service log, so the monitoring regime is itself auditable.

Operational monitoring (Article 12(2)©): The identity service’s append-only action log provides this by construction. The log includes every tool invocation, every memory query, every model call, and every gate interaction, with timestamps, identity, and policy context. The granularity is not configurable downwards: it cannot be turned off, and it cannot be selectively filtered in production.

The 90-day pilot design

If you are a head of compliance or a CRO at a regulated firm evaluating whether your current agent infrastructure can meet the August 2026 deadline, the immediate task is to find out what evidence you can produce today. Here is a practical 90-day structure for producing it.

Days 1 to 30. Pick one high-risk workflow already in production or near production. AML screening is the natural choice for financial services because it is expressly in Annex III and you almost certainly have a system or a pilot running. Map every step in that workflow to the Article 12 event taxonomy. For each step, answer: does our current infrastructure produce a signed, append-only record of this event? Document the answer honestly. In almost every case, the answer for three or more steps will be no, or “we have an approximation that would not survive regulatory scrutiny.”

Days 31 to 60. Run the same workflow in a Substrate-governed environment, in parallel with the existing system. Do not replace the production system yet. Compare the two logs. The Substrate log is queryable as a single coherent artifact. The existing log requires joining across multiple systems. Produce a concrete answer to the regulator’s question: “show me everything the system knew and did on this transaction on this date.” Time how long that takes from each stack.

Days 61 to 90. Produce a draft Article 12 technical documentation appendix for the Substrate-governed version. Article 11 of the Act requires technical documentation for high-risk AI systems, and the logging architecture is part of that documentation. The appendix should include the event taxonomy, the retention configuration, the hash chain verification procedure, and the model version tracking mechanism. If this appendix can be completed in 90 days, you have a compliance-capable architecture. If it cannot, you have a gap that the August deadline will expose.

A note on the Digital Omnibus proposals. If the Commission’s proposals are eventually adopted in a form that delays high-risk obligations for SMEs or modifies Annex III, the technical debt of not having designed for Article 12 does not disappear. Regulators will have longer memories than the compliance calendar, and a post-market monitoring incident in a system that never had proper logging will be harder to defend than one that was built to produce the evidence.

What to put in an RFP

If you are evaluating AI infrastructure for a high-risk use case and you want the Article 12 compliance story to hold up, make the vendor show you the evidence path. These are the questions that separate capable systems from systems that will require expensive retrofitting.

Ask the vendor to demonstrate that every agent action is signed with a unique cryptographic identity. Watch them produce the key registration event and show you the verification chain. If the agent runs under a shared service account or an API key that any human or system could use, you do not have a signed record; you have a log entry.

Ask how the system reconstructs what the model knew at a specific timestamp. Not what the current memory contains, but what it contained then. If the answer involves querying a current-state store and acknowledging that older states are gone, you do not have the bitemporal capability Article 12 requires.

Ask to see the tamper-evident structure of the log. Ask them to simulate altering a record and show you how the verification detects it. If the demonstration is that you would notice because they could see the altered row in the database, you are looking at a detective control, not a tamper-evident log. Article 12 requires the latter.

Ask how model version changes are recorded. Specifically, ask whether the model version that processed a specific transaction is recoverable from the log one year after the fact. If the answer involves application metrics dashboards rather than signed commit records, you have a gap.

Ask about air-gapped deployment. Some national supervisory authorities will eventually require that logs for certain categories of high-risk AI system reside on national or institutional infrastructure. If the logging mechanism depends on a cloud-hosted vendor service, you have an architectural constraint that no amount of SLA negotiation resolves. The sovereign deployment story for Substrate, including the air-gapped option where no data leaves the institutional perimeter, is covered in sovereign AI, air-gapped by default.

The “governed by construction” argument

There is a phrase on the Substrate homepage: “governed by construction.” It means that the governance properties, signed identity, tamper-evident logs, bitemporal memory, deterministic replay, are not features that can be switched off or that require a separate compliance overlay. They are how the system does anything at all.

This is why the phrase matters in compliance work. This is the correct architecture for Article 12 compliance. A system that can be configured to log is different from a system that cannot execute without logging. The former requires a human to maintain the logging configuration and verify it has not drifted. The latter makes the question of whether logging is happening identical to the question of whether the system is running.

For regulated enterprises whose board and audit committees need to sign off on high-risk AI systems, “governed by construction” is not a marketing position. It is the answer to the question “how do you know it is logging what Article 12 requires?” The answer is: because it cannot not log it.

The same architecture that produces the Article 12 compliance record also produces the operational data that makes the system trustworthy and improvable. The bitemporal memory that lets you reconstruct what the agent knew is also what lets you run spreading-activation explainability queries that show a regulator exactly which facts activated which decisions. The signed event log that satisfies Article 12 is also the complete mission history that lets you run a performance post-mortem on a quarterly basis. The governance overhead is not additional. It is inherent.

The August 2026 applicability date will pass. Some firms will have the log. Most will not, and will be deciding how to explain that to their national AI supervisory authority. The firms that built their agent infrastructure on a foundation designed to produce what Article 12 requires will spend that week running their agents. The others will be writing gap analysis memos.

If you want to request the investor brief, or to walk through what a Substrate-governed high-risk AI system looks like for your specific sector, the right place to start is the Substrate page. For the trade-finance evidence pack walkthrough that shows the complete Article 12 log for an AML workflow end to end, see trade finance and AML evidence pack lineage. And for the budget governance side, how the cost ceiling and the audit trail come from the same machinery, see declare the mission and the budget.

This article cites the EU AI Act (Regulation (EU) 2024/1689) as published on EUR-Lex. The August 2 2026 applicability date refers to the obligations for high-risk AI systems under Article 6(2) and Annex III. Digital Omnibus proposals referenced are Commission proposals published February 2025 and have not at time of writing entered into force. Penalty figures cited are statutory maximums under Article 99; actual penalties depend on national supervisory authority enforcement and case-specific factors.