Article 記事

Your agent's memory is lying to you

      author
      Jonathan Conway
    

      timestamp
      8 April 2026
    

      classification
      ai-agents / memory / rag / vector-search / temporal-graphs
    

The confident lie

You only find out your agent has a memory problem when it answers with confidence and costs you trust. Your AI agent just told a paying customer they’re on the Free plan. It didn’t hallucinate. It didn’t make anything up. It retrieved a real document from your vector store, checked the similarity score, and served it with full confidence.

The problem? That document is from January. The user upgraded to Pro in March. Your vector store has both facts, and the January record scored higher because it’s surrounded by onboarding context: welcome emails, feature limitation explanations, setup guides. More text means richer embeddings. Richer embeddings mean higher similarity scores.

Your agent lied with a true document.

The stale facts problem

This is the failure mode you need to design for before the first angry ticket arrives. Every production agent system built on top of a vector store is vulnerable to this. The core issue is that vector databases treat memories as independent points in high-dimensional space. Each document exists in isolation. There are no relationships between them, no temporal ordering, no concept of “this superseded that.”

When you upsert a new document about the user’s Pro subscription, the old Free tier document doesn’t go away. It shouldn’t: you might need it for auditing, for understanding the user’s journey, for answering “when did they upgrade?” But at query time, both documents compete for the top-K slots on equal footing.

Metadata filters help, but they push the problem to the application layer. You end up writing bespoke timestamp comparison logic for every query type. Miss one, and your agent is lying again.

The strawberry problem

Stale facts are not the only way memory breaks. Take an agent that manages both work and personal contexts.

The user has a project internally codenamed “Strawberry,” a critical Q2 initiative with weekly standups, design docs, and a Slack channel. The user also asked the agent to remind them to pick up strawberries at the farmer’s market this weekend.

In embedding space, these two concepts are neighbors. “Project Strawberry deadline moved to April” and “strawberries at the Saturday market” share enough lexical overlap to land within cosine distance ~0.3 of each other. When the user asks “what’s happening with Strawberry?”, your agent might helpfully inform them about both the project deadline and the weekend grocery run, blended into a single incoherent response.

This happens because embeddings encode lexical and semantic similarity, not domain membership. They don’t know that one “strawberry” is a codename in a work context and the other is a fruit in a personal context. A knowledge graph does, because it encodes relationships, not just content.

Why vector stores aren’t memory systems

A vector store is still useful. It answers one question well: “What documents are most similar to this query?” That’s search. But it’s not memory.

Memory needs to answer harder questions:

What’s true now? Not what was ever true.
What changed? Not what’s similar to what changed.
What did we believe on Tuesday? Point-in-time reconstruction.
How are these facts related? Graph traversal, not embedding proximity.

A flat vector store can’t answer any of these without bolting on application-level logic that grows more fragile with every new use case.

The fix: bitemporal data model

The fix starts with a simple requirement: your system has to know when a fact was true and when it learned that fact. The core insight comes from temporal databases, a concept well-established in the database world but mostly absent from AI memory systems. Every fact needs two independent timelines:

Event time: When something happened in the real world. The user upgraded on March 3rd; that’s the event time.

Ingestion time: When the system learned about it. Maybe we processed the webhook on March 3rd at 2:47 PM; that’s the ingestion time.

These two timelines are independent. You can learn about past events late (backfill). You can discover that something you recorded was wrong (correction). Having both timelines lets you answer questions that a single-timeline system cannot:

“What plan is the user on?” filters for currently-valid event time
“What did our system believe about the user at 2 PM yesterday?” filters by ingestion time
“When did we learn about the upgrade?” queries ingestion timestamps

In the governed memory engine, we implement this at the edge level of the knowledge graph. Every edge (relationship between nodes) carries four timestamps:

pub const Edge = extern struct {
    id: u64,
    kind: EdgeKind,
    source_id: u64,
    target_id: u64,

    // Bitemporal timestamps (nanoseconds since epoch)
    t_event_valid: i64,     // when the fact became true
    t_event_invalid: i64,   // when the fact stopped being true
    t_ingest_created: i64,  // when we recorded it
    t_ingest_expired: i64,  // when we learned it was outdated

    confidence: f32,
    weight: f32,
    // ...

    pub fn isCurrentlyValid(self: Edge) bool {
        return self.t_event_invalid == std.math.maxInt(i64) and
            self.t_ingest_expired == std.math.maxInt(i64);
    }

    pub fn invalidate(self: *Edge, t_event_end: i64, t_ingest_now: i64) void {
        self.t_event_invalid = t_event_end;
        self.t_ingest_expired = t_ingest_now;
    }
};

A fact is “currently valid” only when both invalidation timestamps are set to maxInt, meaning neither the real-world event has ended, nor have we learned that it’s outdated. Invalidation doesn’t delete data. It marks a temporal boundary.

How it works in practice

In practice, the model only helps if it is enforced at write time. When the Observer pipeline ingests a plan upgrade, two things happen in the graph store:

// 1. Invalidate the old fact
try graph_store.invalidateEdge(
    old_plan_edge_id,
    march_3rd_nanos,   // event time: when upgrade happened
    now_nanos,         // ingestion time: when we learned about it
);

// 2. Create the new fact
_ = try graph_store.createEdge(
    .semantic,          // edge kind
    user_node_id,       // source: the user
    pro_plan_node_id,   // target: the Pro plan
    march_3rd_nanos,    // event time
    now_nanos,          // ingestion time
);

Both writes go through the WAL (write-ahead log) first, then update in-memory state. That makes the update crash-safe by default.

From the Python SDK, the developer experience is straightforward:

async with KizunaMem(endpoint, api_key, tenant_id=42) as client:
    # Observe the upgrade event
    await client.observe(
        speaker="system",
        text="User upgraded from Free plan to Pro plan",
        timestamp=int(time.time()),
    )

    # Later: retrieve current billing context
    result = await client.retrieve("what plan is the user on?", top_k=5)

    # result.nodes contains only currently-valid facts
    # The old Free plan fact is still in the graph (for auditing)
    # but it's temporally invalidated and won't appear in results
    print(result.assembled_context)

Retrieval has to combine several signals without letting stale facts sneak back in. The retrieval pipeline uses a fusion approach: BM25 text search, temporal recency boosting, graph traversal, and, for complex questions, automatic query decomposition (pass decompose=true in the retrieve call, and the system splits your query into sub-queries via LLM, runs each independently, and merges results). Critically, temporal filtering happens before ranking. Invalid facts never enter the candidate pool. They can’t outscore current facts, no matter how rich their embeddings are.

The memory also needs to fit the context window you actually have. Results come back in content tiers: L0 is the full text, L1 is a condensed abstract, and L2 is a brief summary. The system picks the right tier based on your token budget, so you don’t blow half your context window on a single memory when a two-sentence summary would do. And if you’re debugging why certain results showed up, debug enrichment gives you activation traces: hop paths, trigger scores, the whole graph traversal receipt.

The recency boost

Hard filtering decides what is allowed to compete. Beyond that, the governed memory engine applies a recency boost during fusion retrieval. Facts from the last 7 days get a 1.2x multiplier. Facts older than 30 days get a 0.8x penalty. This is configurable, but the defaults reflect a straightforward observation: recent context is usually more relevant.

const age_days = @divTrunc(now - ts, DAY_NS);
const boost: f32 = if (age_days <= 7) 1.2
    else if (age_days <= 30) 1.0
    else 0.8;
entry.value_ptr.* *= boost;

This soft signal works alongside the hard bitemporal filtering. A fact from yesterday that’s been invalidated won’t appear at all. A valid fact from last week will rank above an equally-relevant valid fact from three months ago.

What about the strawberry problem?

The same structure that protects time also protects context. “Project Strawberry” lives in a subgraph connected to work-related nodes: teammates, deadlines, Jira tickets, design docs. “Strawberries at the market” connects to personal nodes: grocery lists, weekend plans, the farmer’s market location.

This is how the system keeps the work strawberry away from the grocery strawberry. The retrieval engine uses spreading activation, an algorithm from cognitive science (Collins & Loftus, 1975) that models how human memory actually works. When you query about “Project Strawberry,” activation starts at the matching node and flows outward along weighted edges. Work-adjacent nodes (teammates, deadlines, Jira tickets) reinforce each other through mutual activation. The grocery subgraph is structurally isolated from the work context, so lateral inhibition suppresses it.

This isn’t a heuristic. It’s the same mechanism your brain uses to disambiguate “bank” (financial) from “bank” (river): context primes one network, and inhibition suppresses the other. No commercial agent memory system does this except the governed memory engine. We wrote a full deep dive on the algorithm here.

This is the difference between searching for content and traversing a knowledge structure. The graph knows that these are different strawberries.

The bottom line

If your memory system can’t tell you what changed, it’s not a memory system. It’s a search engine wearing a trenchcoat.

The fix isn’t more sophisticated embeddings or better chunking strategies. It’s a data model that treats time as a first-class concept, where facts have lifespans, relationships encode structure, and “what’s true now” is a query the system can answer without application-level hacks.

And the retrieval algorithm matters just as much as the data model. Vector similarity finds documents that look like your query. But agent memory requires following relationships: from a billing question to a subscription to an expired credit card to the support ticket filed about it. That’s multi-hop associative retrieval, and it’s what spreading activation does. It’s the dominant model of human memory retrieval in cognitive science (Collins & Loftus, 1975; Anderson’s ACT-R architecture), and the governed memory engine is the only commercial system that uses it. We cover the algorithm in detail in our spreading activation deep dive.

That’s what we’re building with the governed memory engine. A temporal knowledge graph in Zig and Rust with spreading activation retrieval, because agents deserve better than confident lies.