Your Agent's Memory Is Lying to You
The Confident Lie
Your AI agent just told a paying customer they’re on the Free plan. It didn’t hallucinate. It didn’t make anything up. It retrieved a real document from your vector store, checked the similarity score, and served it with full confidence.
The problem? That document is from January. The user upgraded to Pro in March. Your vector store has both facts, and the January record scored higher because it’s surrounded by onboarding context: welcome emails, feature limitation explanations, setup guides. More text means richer embeddings. Richer embeddings mean higher similarity scores.
Your agent lied with a true document.
The Stale Facts Problem
This isn’t a hypothetical. Every production agent system built on top of a vector store is vulnerable to this. The core issue is that vector databases treat memories as independent points in high-dimensional space. Each document exists in isolation. There are no relationships between them, no temporal ordering, no concept of “this superseded that.”
When you upsert a new document about the user’s Pro subscription, the old Free tier document doesn’t go away. It shouldn’t: you might need it for auditing, for understanding the user’s journey, for answering “when did they upgrade?” But at query time, both documents compete for the top-K slots on equal footing.
Metadata filters help, but they push the problem to the application layer. You end up writing bespoke timestamp comparison logic for every query type. Miss one, and your agent is lying again.
The Strawberry Problem
Stale facts aren’t the only failure mode. Take an agent that manages both work and personal contexts.
The user has a project internally codenamed “Strawberry,” a critical Q2 initiative with weekly standups, design docs, and a Slack channel. The user also asked the agent to remind them to pick up strawberries at the farmer’s market this weekend.
In embedding space, these two concepts are neighbors. “Project Strawberry deadline moved to April” and “strawberries at the Saturday market” share enough lexical overlap to land within cosine distance ~0.3 of each other. When the user asks “what’s happening with Strawberry?”, your agent might helpfully inform them about both the project deadline and the weekend grocery run, blended into a single incoherent response.
This happens because embeddings encode lexical and semantic similarity, not domain membership. They don’t know that one “strawberry” is a codename in a work context and the other is a fruit in a personal context. A knowledge graph does, because it encodes relationships, not just content.
Why Vector Stores Aren’t Memory Systems
A vector store answers one question well: “What documents are most similar to this query?” That’s search. It’s useful. But it’s not memory.
Memory needs to answer harder questions:
- What’s true now? Not what was ever true.
- What changed? Not what’s similar to what changed.
- What did we believe on Tuesday? Point-in-time reconstruction.
- How are these facts related? Graph traversal, not embedding proximity.
A flat vector store can’t answer any of these without bolting on application-level logic that grows more fragile with every new use case.
The Fix: Bitemporal Data Model
The core insight comes from temporal databases, a concept well-established in the database world but mostly absent from AI memory systems. Every fact needs two independent timelines:
Event time: When something happened in the real world. The user upgraded on March 3rd; that’s the event time.
Ingestion time: When the system learned about it. Maybe we processed the webhook on March 3rd at 2:47 PM; that’s the ingestion time.
These two timelines are independent. You can learn about past events late (backfill). You can discover that something you recorded was wrong (correction). Having both timelines lets you answer questions that a single-timeline system cannot:
- “What plan is the user on?” filters for currently-valid event time
- “What did our system believe about the user at 2 PM yesterday?” filters by ingestion time
- “When did we learn about the upgrade?” queries ingestion timestamps
In Kizuna-Mem, we implement this at the edge level of the knowledge graph. Every edge (relationship between nodes) carries four timestamps:
pub const Edge = extern struct {
id: u64,
kind: EdgeKind,
source_id: u64,
target_id: u64,
// Bitemporal timestamps (nanoseconds since epoch)
t_event_valid: i64, // when the fact became true
t_event_invalid: i64, // when the fact stopped being true
t_ingest_created: i64, // when we recorded it
t_ingest_expired: i64, // when we learned it was outdated
confidence: f32,
weight: f32,
// ...
pub fn isCurrentlyValid(self: Edge) bool {
return self.t_event_invalid == std.math.maxInt(i64) and
self.t_ingest_expired == std.math.maxInt(i64);
}
pub fn invalidate(self: *Edge, t_event_end: i64, t_ingest_now: i64) void {
self.t_event_invalid = t_event_end;
self.t_ingest_expired = t_ingest_now;
}
};
A fact is “currently valid” only when both invalidation timestamps are set to maxInt, meaning neither the real-world event has ended, nor have we learned that it’s outdated. Invalidation doesn’t delete data. It marks a temporal boundary.
How It Works in Practice
When the Observer pipeline ingests a plan upgrade, two things happen in the graph store:
// 1. Invalidate the old fact
try graph_store.invalidateEdge(
old_plan_edge_id,
march_3rd_nanos, // event time: when upgrade happened
now_nanos, // ingestion time: when we learned about it
);
// 2. Create the new fact
_ = try graph_store.createEdge(
.semantic, // edge kind
user_node_id, // source: the user
pro_plan_node_id, // target: the Pro plan
march_3rd_nanos, // event time
now_nanos, // ingestion time
);
Both writes go through the WAL (write-ahead log) first, then update in-memory state. Crash-safe by default.
From the Python SDK, the developer experience is straightforward:
async with KizunaMem(endpoint, api_key, tenant_id=42) as client:
# Observe the upgrade event
await client.observe(
speaker="system",
text="User upgraded from Free plan to Pro plan",
timestamp=int(time.time()),
)
# Later: retrieve current billing context
result = await client.retrieve("what plan is the user on?", top_k=5)
# result.nodes contains only currently-valid facts
# The old Free plan fact is still in the graph (for auditing)
# but it's temporally invalidated and won't appear in results
print(result.assembled_context)
The retrieval pipeline uses a fusion approach: BM25 text search, temporal recency boosting, graph traversal, and — for complex questions — automatic query decomposition (pass decompose=true in the retrieve call, and the system splits your query into sub-queries via LLM, runs each independently, and merges results). Critically, temporal filtering happens before ranking. Invalid facts never enter the candidate pool. They can’t outscore current facts, no matter how rich their embeddings are.
Results come back in content tiers: L0 is the full text, L1 is a condensed abstract, and L2 is a brief summary. The system picks the right tier based on your token budget — so you don’t blow half your context window on a single memory when a two-sentence summary would do. And if you’re debugging why certain results showed up, debug enrichment gives you activation traces: hop paths, trigger scores, the whole graph traversal receipt.
The Recency Boost
Beyond hard temporal filtering, Kizuna-Mem applies a recency boost during fusion retrieval. Facts from the last 7 days get a 1.2x multiplier. Facts older than 30 days get a 0.8x penalty. This is configurable, but the defaults reflect a straightforward observation: recent context is usually more relevant.
const age_days = @divTrunc(now - ts, DAY_NS);
const boost: f32 = if (age_days <= 7) 1.2
else if (age_days <= 30) 1.0
else 0.8;
entry.value_ptr.* *= boost;
This soft signal works alongside the hard bitemporal filtering. A fact from yesterday that’s been invalidated won’t appear at all. A valid fact from last week will rank above an equally-relevant valid fact from three months ago.
What About the Strawberry Problem?
The graph structure handles this naturally. “Project Strawberry” lives in a subgraph connected to work-related nodes: teammates, deadlines, Jira tickets, design docs. “Strawberries at the market” connects to personal nodes: grocery lists, weekend plans, the farmer’s market location.
The retrieval engine uses spreading activation — an algorithm from cognitive science (Collins & Loftus, 1975) that models how human memory actually works. When you query about “Project Strawberry,” activation starts at the matching node and flows outward along weighted edges. Work-adjacent nodes (teammates, deadlines, Jira tickets) reinforce each other through mutual activation. The grocery subgraph is structurally isolated from the work context, so lateral inhibition suppresses it.
This isn’t a heuristic. It’s the same mechanism your brain uses to disambiguate “bank” (financial) from “bank” (river) — context primes one network, and inhibition suppresses the other. No commercial agent memory system does this except Kizuna-Mem. We wrote a full deep dive on the algorithm here.
This is the difference between searching for content and traversing a knowledge structure. The graph knows that these are different strawberries.
The Bottom Line
If your memory system can’t tell you what changed, it’s not a memory system. It’s a search engine wearing a trenchcoat.
The fix isn’t more sophisticated embeddings or better chunking strategies. It’s a data model that treats time as a first-class concept, where facts have lifespans, relationships encode structure, and “what’s true now” is a query the system can answer without application-level hacks.
And the retrieval algorithm matters just as much as the data model. Vector similarity finds documents that look like your query. But agent memory requires following relationships — from a billing question to a subscription to an expired credit card to the support ticket filed about it. That’s multi-hop associative retrieval, and it’s what spreading activation does. It’s the dominant model of human memory retrieval in cognitive science (Collins & Loftus, 1975; Anderson’s ACT-R architecture), and Kizuna-Mem is the only commercial system that uses it. We cover the algorithm in detail in our spreading activation deep dive.
That’s what we’re building with Kizuna-Mem. A temporal knowledge graph in Zig and Rust with spreading activation retrieval, because agents deserve better than confident lies.