Article 記事

Your Agent's Memory Is Lying to You

author Jonathan Conway
timestamp 8 April 2026
classification ai-agents / memory / rag / vector-search / temporal-graphs

The Confident Lie

Your AI agent just told a paying customer they’re on the Free plan. It didn’t hallucinate. It didn’t make anything up. It retrieved a real document from your vector store, checked the similarity score, and served it with full confidence.

The problem? That document is from January. The user upgraded to Pro in March. Your vector store has both facts, and the January record scored higher because it’s surrounded by onboarding context: welcome emails, feature limitation explanations, setup guides. More text means richer embeddings. Richer embeddings mean higher similarity scores.

Your agent lied with a true document.

The Stale Facts Problem

This isn’t a hypothetical. Every production agent system built on top of a vector store is vulnerable to this. The core issue is that vector databases treat memories as independent points in high-dimensional space. Each document exists in isolation. There are no relationships between them, no temporal ordering, no concept of “this superseded that.”

When you upsert a new document about the user’s Pro subscription, the old Free tier document doesn’t go away. It shouldn’t: you might need it for auditing, for understanding the user’s journey, for answering “when did they upgrade?” But at query time, both documents compete for the top-K slots on equal footing.

Metadata filters help, but they push the problem to the application layer. You end up writing bespoke timestamp comparison logic for every query type. Miss one, and your agent is lying again.

Stale fact wins on similarity scoreA query for current plan retrieves both January and March documents. The older Free document scores higher because it sits inside richer onboarding context. FIG.01 / SIMILARITY-RANKED RETRIEVAL → STALE FACT 偽の自信 querywhat plan is the user on? JAN · DOC 0114 · ONBOARDINGwelcome · free plan limits · setup · feature tour ·faq · billing intro · upgrade prompts · 4 800 tokensSIMILARITY0.847 ▲ MAR · DOC 0972 · UPGRADE EVENTuser upgraded to pro plan · 220 tokensSIMILARITY0.612 AGENT RESPONSE“you are on the free plan.”retrieved 0114 · confidence 0.847 · vector store has no concept of supersession Both facts live as independent points. The richer surrounding text wins on cosine similarity. Truth is not a function of token count.

The Strawberry Problem

Stale facts aren’t the only failure mode. Take an agent that manages both work and personal contexts.

The user has a project internally codenamed “Strawberry,” a critical Q2 initiative with weekly standups, design docs, and a Slack channel. The user also asked the agent to remind them to pick up strawberries at the farmer’s market this weekend.

In embedding space, these two concepts are neighbors. “Project Strawberry deadline moved to April” and “strawberries at the Saturday market” share enough lexical overlap to land within cosine distance ~0.3 of each other. When the user asks “what’s happening with Strawberry?”, your agent might helpfully inform them about both the project deadline and the weekend grocery run, blended into a single incoherent response.

This happens because embeddings encode lexical and semantic similarity, not domain membership. They don’t know that one “strawberry” is a codename in a work context and the other is a fruit in a personal context. A knowledge graph does, because it encodes relationships, not just content.

Embedding proximity vs graph membershipIn embedding space the two strawberries collide. In a graph they live in disjoint subgraphs and stay disambiguated by structure. FIG.02 / EMBEDDING SPACE ↔ GRAPH STRUCTURE 文脈分離 EMBEDDING SPACE / VECTOR STOREqueryprojectstrawberrygrocerystrawberrycosine ≈ 0.31 · both retrieved KNOWLEDGE GRAPH / SUBGRAPHSWORKstrawb.jiraslackteamPERSONALstrawb.grocerymarketweekendactivation stays inside WORK The graph knows these are different strawberries because it knows what they are connected to.

Why Vector Stores Aren’t Memory Systems

A vector store answers one question well: “What documents are most similar to this query?” That’s search. It’s useful. But it’s not memory.

Memory needs to answer harder questions:

  • What’s true now? Not what was ever true.
  • What changed? Not what’s similar to what changed.
  • What did we believe on Tuesday? Point-in-time reconstruction.
  • How are these facts related? Graph traversal, not embedding proximity.

A flat vector store can’t answer any of these without bolting on application-level logic that grows more fragile with every new use case.

The Fix: Bitemporal Data Model

The core insight comes from temporal databases, a concept well-established in the database world but mostly absent from AI memory systems. Every fact needs two independent timelines:

Event time: When something happened in the real world. The user upgraded on March 3rd; that’s the event time.

Ingestion time: When the system learned about it. Maybe we processed the webhook on March 3rd at 2:47 PM; that’s the ingestion time.

These two timelines are independent. You can learn about past events late (backfill). You can discover that something you recorded was wrong (correction). Having both timelines lets you answer questions that a single-timeline system cannot:

  • “What plan is the user on?” filters for currently-valid event time
  • “What did our system believe about the user at 2 PM yesterday?” filters by ingestion time
  • “When did we learn about the upgrade?” queries ingestion timestamps
Bitemporal data modelTwo independent timelines: event time tracks when facts became true in the world, ingestion time tracks when the system learned about them. FIG.03 / BITEMPORAL EDGE TIMELINES 二重時間モデル EVENT TIME · WHEN IT HAPPENEDjan 04feb 18mar 03apr 06apr 28edge · plan = freet_invalid = mar 03edge · plan = pro INGESTION TIME · WHEN THE SYSTEM LEARNEDjan 04feb 18mar 03 14:47apr 06apr 28recorded · plan = freerecorded · plan = pro QUERY · CURRENTLY-VALID FACTS“what plan is the user on?” → profilter t_event_invalid = ∞ ∧ t_ingest_expired = ∞ · invalid edges never enter the candidate pool Invalidation marks a temporal boundary. The Free fact is not deleted; it is bounded.

In Kizuna-Mem, we implement this at the edge level of the knowledge graph. Every edge (relationship between nodes) carries four timestamps:

pub const Edge = extern struct {
    id: u64,
    kind: EdgeKind,
    source_id: u64,
    target_id: u64,

    // Bitemporal timestamps (nanoseconds since epoch)
    t_event_valid: i64,     // when the fact became true
    t_event_invalid: i64,   // when the fact stopped being true
    t_ingest_created: i64,  // when we recorded it
    t_ingest_expired: i64,  // when we learned it was outdated

    confidence: f32,
    weight: f32,
    // ...

    pub fn isCurrentlyValid(self: Edge) bool {
        return self.t_event_invalid == std.math.maxInt(i64) and
            self.t_ingest_expired == std.math.maxInt(i64);
    }

    pub fn invalidate(self: *Edge, t_event_end: i64, t_ingest_now: i64) void {
        self.t_event_invalid = t_event_end;
        self.t_ingest_expired = t_ingest_now;
    }
};

A fact is “currently valid” only when both invalidation timestamps are set to maxInt, meaning neither the real-world event has ended, nor have we learned that it’s outdated. Invalidation doesn’t delete data. It marks a temporal boundary.

How It Works in Practice

When the Observer pipeline ingests a plan upgrade, two things happen in the graph store:

// 1. Invalidate the old fact
try graph_store.invalidateEdge(
    old_plan_edge_id,
    march_3rd_nanos,   // event time: when upgrade happened
    now_nanos,         // ingestion time: when we learned about it
);

// 2. Create the new fact
_ = try graph_store.createEdge(
    .semantic,          // edge kind
    user_node_id,       // source: the user
    pro_plan_node_id,   // target: the Pro plan
    march_3rd_nanos,    // event time
    now_nanos,          // ingestion time
);

Both writes go through the WAL (write-ahead log) first, then update in-memory state. Crash-safe by default.

From the Python SDK, the developer experience is straightforward:

async with KizunaMem(endpoint, api_key, tenant_id=42) as client:
    # Observe the upgrade event
    await client.observe(
        speaker="system",
        text="User upgraded from Free plan to Pro plan",
        timestamp=int(time.time()),
    )

    # Later: retrieve current billing context
    result = await client.retrieve("what plan is the user on?", top_k=5)

    # result.nodes contains only currently-valid facts
    # The old Free plan fact is still in the graph (for auditing)
    # but it's temporally invalidated and won't appear in results
    print(result.assembled_context)

The retrieval pipeline uses a fusion approach: BM25 text search, temporal recency boosting, graph traversal, and — for complex questions — automatic query decomposition (pass decompose=true in the retrieve call, and the system splits your query into sub-queries via LLM, runs each independently, and merges results). Critically, temporal filtering happens before ranking. Invalid facts never enter the candidate pool. They can’t outscore current facts, no matter how rich their embeddings are.

Results come back in content tiers: L0 is the full text, L1 is a condensed abstract, and L2 is a brief summary. The system picks the right tier based on your token budget — so you don’t blow half your context window on a single memory when a two-sentence summary would do. And if you’re debugging why certain results showed up, debug enrichment gives you activation traces: hop paths, trigger scores, the whole graph traversal receipt.

The Recency Boost

Beyond hard temporal filtering, Kizuna-Mem applies a recency boost during fusion retrieval. Facts from the last 7 days get a 1.2x multiplier. Facts older than 30 days get a 0.8x penalty. This is configurable, but the defaults reflect a straightforward observation: recent context is usually more relevant.

const age_days = @divTrunc(now - ts, DAY_NS);
const boost: f32 = if (age_days <= 7) 1.2
    else if (age_days <= 30) 1.0
    else 0.8;
entry.value_ptr.* *= boost;

This soft signal works alongside the hard bitemporal filtering. A fact from yesterday that’s been invalidated won’t appear at all. A valid fact from last week will rank above an equally-relevant valid fact from three months ago.

What About the Strawberry Problem?

The graph structure handles this naturally. “Project Strawberry” lives in a subgraph connected to work-related nodes: teammates, deadlines, Jira tickets, design docs. “Strawberries at the market” connects to personal nodes: grocery lists, weekend plans, the farmer’s market location.

The retrieval engine uses spreading activation — an algorithm from cognitive science (Collins & Loftus, 1975) that models how human memory actually works. When you query about “Project Strawberry,” activation starts at the matching node and flows outward along weighted edges. Work-adjacent nodes (teammates, deadlines, Jira tickets) reinforce each other through mutual activation. The grocery subgraph is structurally isolated from the work context, so lateral inhibition suppresses it.

This isn’t a heuristic. It’s the same mechanism your brain uses to disambiguate “bank” (financial) from “bank” (river) — context primes one network, and inhibition suppresses the other. No commercial agent memory system does this except Kizuna-Mem. We wrote a full deep dive on the algorithm here.

This is the difference between searching for content and traversing a knowledge structure. The graph knows that these are different strawberries.

Spreading activation with lateral inhibitionActivation propagates from the matched node along weighted edges. Adjacent nodes reinforce each other; the disjoint subgraph is suppressed. FIG.04 / SPREADING ACTIVATION + LATERAL INHIBITION 拡散活性化 strawberryα 1.00 deadlineα 0.78 teammatesα 0.74 jiraα 0.62 design doc standup slack ch. on-call epic marketα 0.05 suppressed by inhibition DECAY · 0.78^hopEDGE WEIGHTS · 5 KINDSBLA · α + temporalINHIBITION · cross-cluster Activation flows along weighted edges. The grocery cluster never lights up — it is structurally disjoint.

The Bottom Line

If your memory system can’t tell you what changed, it’s not a memory system. It’s a search engine wearing a trenchcoat.

The fix isn’t more sophisticated embeddings or better chunking strategies. It’s a data model that treats time as a first-class concept, where facts have lifespans, relationships encode structure, and “what’s true now” is a query the system can answer without application-level hacks.

And the retrieval algorithm matters just as much as the data model. Vector similarity finds documents that look like your query. But agent memory requires following relationships — from a billing question to a subscription to an expired credit card to the support ticket filed about it. That’s multi-hop associative retrieval, and it’s what spreading activation does. It’s the dominant model of human memory retrieval in cognitive science (Collins & Loftus, 1975; Anderson’s ACT-R architecture), and Kizuna-Mem is the only commercial system that uses it. We cover the algorithm in detail in our spreading activation deep dive.

That’s what we’re building with Kizuna-Mem. A temporal knowledge graph in Zig and Rust with spreading activation retrieval, because agents deserve better than confident lies.