Article 記事

Nobody Else Is Tuning Their Memory Engine

author Jonathan Conway
timestamp 11 April 2026
classification kizuna-tune / bohb / thompson-sampling / lambdamart / learning-to-rank / agent-memory / retrieval

Nobody Else Is Tuning Their Memory Engine

Pick any agent memory system off the shelf and look at its retrieval configuration. You’ll find a handful of magic numbers: a fusion weight between BM25 and vector similarity, a top-K, maybe a recency decay. Where did those numbers come from? In almost every case, the answer is “someone on the team picked them, ran a few queries, and shipped.”

That’s how the field works right now. Retrieval quality in agent memory is governed by parameters nobody has actually optimized. We built kizuna-tune to do something about that, and the techniques we chose – BOHB, query-adaptive routing, Thompson Sampling bandits, and LambdaMART reranking – have decades of research behind them. None of them are novel. What’s novel is applying them to an agent memory engine.

This post is the background: where these optimizations come from, why they work, and what the rest of the field is doing instead.


The problem with hand-tuned defaults

Spreading activation has roughly 20 tunable parameters. Some are boring (BM25 k1 and b, which decades of IR research settled long ago). Some matter a lot. The ones that matter include:

  • Decay rate. The retention multiplier applied at each hop. Too low and the traversal dies after one step. Too high and activation floods irrelevant nodes.
  • Propagation depth. How many hops the traversal can take before it stops. Two hops is enough for entity lookup. Five hops is needed for multi-hop reasoning.
  • Fusion weights. How much BM25, vector similarity, and graph activation each contribute to the final score. These must sum to 1.0, and the optimal split depends on the query type and the graph shape.
  • Edge type weights. Kizuna-Mem has five edge types (episodic, semantic, temporal, hierarchical, associative). Treating them all the same is almost certainly wrong. Semantic edges carry different information than temporal ones.
  • BLA alpha. The interpolation between structural relevance and temporal recency, borrowed from ACT-R’s Base-Level Activation.

When we shipped Kizuna-Mem’s first version, these were set by running our param_sweep.py against the LoCoMo benchmark and picking values that looked good. It worked. It was also clearly a local optimum. A grid search with 5 values per dimension across 11 parameters would require 48 million benchmark runs. At 60 seconds per run, that’s 93 years of compute. Grid search isn’t an option.

Neither is “guess and check.” The parameter space has interactions: increasing propagation depth makes decay rate more important, because deep traversals amplify small decay differences. You can’t optimize these parameters one at a time.

What you can do is use a smart optimizer that knows how to explore structured spaces. That’s what Phase 1 of kizuna-tune is.


Why BOHB

The optimizer we picked is BOHB (Bayesian Optimization + HyperBand), from the AutoML Freiburg group. The original paper is from 2018, and the reference implementation (hpbandster) is battle-tested on exactly this kind of overnight search.

BOHB combines two ideas. The first is a KDE-based surrogate model, similar to TPE’s Parzen Estimators: it learns distributions over “good” and “bad” configurations and samples new candidates from the ratio. The second is HyperBand’s multi-fidelity scheduling: instead of running every trial at full cost, cheap evaluations are used to screen out obvious losers, and only promising configs get promoted to expensive evaluations.

The multi-fidelity part is what makes this affordable for us. Kizuna-Mem has three natural evaluation tiers already built into the test infrastructure:

Fidelity What runs Time per trial
Low 5 smoke tests ~5 seconds
Medium ~20 activation unit tests ~10 seconds
High Full LoCoMo benchmark ~60 seconds

BOHB samples 27 configurations, evaluates all 27 at low fidelity, promotes the top 9 to medium, promotes the top 3 to high. Roughly 550 test evaluations instead of 2,100 for equivalent exploration. At a 300-trial budget, that brings the full sweep in under four hours instead of the 12+ hours an equivalent exploration at full fidelity would cost. Overnight runs finish overnight.

BOHB multi-fidelity promotion27 candidates evaluated at low fidelity, top 9 promoted to medium, top 3 promoted to high fidelity LoCoMo evaluation. FIG.05 / BOHB MULTI-FIDELITY PROMOTION 多忠実度 TIER · LOWTIER · MEDIUMTIER · HIGH 5 smoke tests · ~5s20 unit tests · ~10sLoCoMo · ~60s 27 candidates total ≈ 135s 9 promoted total ≈ 90s 3 promoted total ≈ 180s ∑ ≈ 405s PER BAND × N BANDS · 300 TRIAL BUDGET FITS UNDER 4H Cheap evaluations screen out obvious losers. Only the survivors pay the LoCoMo cost.

We looked hard at Optuna with TPE + HyperBand pruning. Optuna is the more popular library in 2026 – easier install, nicer dashboard, bigger community. The reason we didn’t pick it: Optuna’s HyperBand pruner bolts pruning onto an optimizer that was never designed for it. BOHB was designed from the start for the cheap/expensive tier structure we have. The original BOHB benchmarks showed 3-9x speedup over vanilla Bayesian optimization on problems with similar dimensionality to ours. That margin is worth the rougher tooling.

One more thing BOHB handles better: our fusion weights must sum to 1.0 with a minimum floor of 0.05. That’s a constrained space, and the KDE surrogate handles it by rejection sampling without any special-case code. TPE needs custom constraint handling for this, which is an easy place to introduce bugs.


Bi-objective optimization is where this gets real

Most tuning papers optimize a single scalar. LoCoMo mean accuracy, or NDCG@10. In production, that’s not what anyone actually wants. A memory engine that takes 500ms to answer a query at 92% accuracy is useless for a voice agent. The same engine at 85% accuracy and 30ms is fine.

kizuna-tune optimizes two objectives at once: maximize accuracy, minimize P99 retrieval latency. Instead of a single “best” config, the output is a Pareto frontier – the set of configurations where no other configuration is better on both axes. Different points on the frontier are appropriate for different workloads.

The AI agent in Phase 2 reads the frontier and picks points based on domain priorities. A financial-compliance profile takes a more accuracy-biased point, with a 50ms latency budget and boosted temporal weight. A high-throughput profile takes a latency-biased point, with an 80% accuracy floor. A coding-assistant profile prioritizes single-hop entity lookup with a 20ms budget.

This matters because the single-scalar framing hides real tradeoffs. Zep publishes a 94.8% DMR score on their own benchmark, which is great, but it’s a single number with no latency disclosure attached. We don’t know what their tail latency looks like. Neither do their users, until they deploy.

Accuracy / latency Pareto frontierEach dot is a sweep trial. The frontier is the set where no other configuration beats it on both axes; profile picks are highlighted on the curve. FIG.06 / ACCURACY × P99 LATENCY · PARETO FRONTIER パレートフロンティア 10ms30ms80ms200ms500msP99 RETRIEVAL LATENCY →68%76%84%92%96% ↑ LOCOMO ACCURACY coding-assistanthigh-throughputfinancial-compliance Each profile picks a different point on the frontier. The Phase 2 agent reads domain priors and selects accordingly.

Query-adaptive routing: one config can’t rule them all

Here’s an experiment you can run yourself. Take a retrieval engine tuned for multi-hop reasoning queries and point it at a simple entity lookup. “What is Sarah Chen’s role?” doesn’t need five hops of graph traversal. It needs one hop – the entity node – and pure BM25 ranking. Running deep activation on it wastes 20ms and sometimes finds the wrong answer because activation spreads to Sarah Chen’s coworkers and ranks them above Sarah herself.

The inverse is also true. A config tuned for entity lookup will fail on “how does the Blackstone acquisition relate to the regulatory action against Evergreen?” because it doesn’t propagate far enough to cover the intermediate nodes.

The answer is query-adaptive retrieval: detect the query type at request time and use a different config for each. Kizuna-Mem already has the skeleton for this (configForQueryType() in the retrieval pipeline). kizuna-tune extends it to five types and runs a separate sweep for each:

Query type Detection LoCoMo category Key parameters
Entity lookup Hash hit, short query single-hop Low depth, high W_BM25
Temporal when/after/before/since temporal High BLA alpha, deep propagation
Multi-hop why/how/relate/between multi-hop Deepest propagation, PPR active
Simple factual Single-noun hash match single-hop Skip activation, pure BM25
Exploratory Long query, no entities open-domain Wide fan, more nodes, higher noise

The classifier runs entirely in Zig with zero dependencies – no LLM call, no regex library, just token matching and entity hash lookups. It adds under 50 microseconds to the request path. Each query type gets its own optimized sub-config, bundled into the domain profile as overrides.

Nobody else in the agent memory space is doing this. Mem0, Zep, Letta, LangMem all use a single retrieval path with a single set of parameters. We haven’t seen a production memory system that routes query types to different configurations.

Query-adaptive routingInbound queries pass through a 50-microsecond classifier and dispatch to one of five sub-configurations. FIG.07 / QUERY-ADAPTIVE ROUTING クエリ適応型ルーティング inbound querystream classifier~50 µs · zig · zero depshash · token · entity entity lookuplow depth · high W_BM25temporalhigh BLA α · deep propagationmulti-hopdeepest depth · PPR activesimple factualskip activation · pure BM25exploratorywide fan · open-domain CLASSIFIER OVERHEAD < 0.05ms · NO LLM CALL · NO REGEX One config can't rule them all. Each query type gets its own sub-configuration, bundled into the domain profile.

Thompson Sampling: the engine that improves while running

Phase 1 gives you a good static configuration. Phase 3 makes the engine get better at each tenant’s specific workload while it’s in production.

The technique is Thompson Sampling with Dirichlet-Categorical bandits, one per parameter per tenant. Thompson Sampling is from 1933 – William Thompson’s paper on sampling from the posterior – and it’s become the standard bandit algorithm for contextual recommendation at places like Netflix and Microsoft over the last decade. It has three properties that matter for us:

  1. Provably near-optimal regret. The total lost reward grows as O(log T), which is the theoretical best rate for bandit problems.
  2. Trivially parallel. Every query samples independently. No coordination, no locks.
  3. Graceful cold start. You initialize the Dirichlet posterior with a prior, which in our case is the Phase 1 profile. New tenants start with the offline-optimized defaults and specialize from there.

Each continuous parameter gets discretized into 10 bins. On every query, each bandit samples a bin, and the combined selections produce the ActivationConfig for that query. After feedback arrives – the user accessed a retrieved node within 5 minutes, or sent explicit relevance feedback – the posterior updates. Over about 100 queries, the distribution tightens and the engine converges on near-optimal parameters for that tenant.

The thing that keeps the system from locking in permanently is concentration decay. Every 500 queries (configurable), each Dirichlet concentration parameter shrinks toward its uniform mean by 5%. The ordering of the bins is preserved, but the confidence degrades, which lets the engine re-explore when usage patterns shift. A compliance team that pivots from entity queries to temporal queries after a regulatory change doesn’t need manual retuning – the bandits adapt within two halflives (~1,000 queries).

Per-tenant isolation is the other key property. A financial services tenant and a coding assistant tenant get independent bandit state. They don’t pollute each other’s posteriors. State is persisted per tenant (bandit_state_{tenant_id}.bin) and recovered from the WAL on restart.

Dirichlet posterior tightening over queriesA 10-bin Dirichlet posterior over a single retrieval parameter. As feedback accumulates, mass concentrates on a near-optimal bin. FIG.08 / THOMPSON SAMPLING POSTERIOR トンプソン抽出 Q = 0 / PRIORQ ≈ 100 / WARMINGQ ≈ 500 / CONVERGED uniform · α ≈ 1 each bin peak emerging · α 5 ≈ 14 α 5 ≈ 92 · near-optimal bin DECAY · -5% / 500QREGRET · O(log T)PER-TENANT STATERECOVERS FROM WAL Sampled bins update on every feedback signal. Concentration decays slowly so the engine re-explores when usage shifts.

LambdaMART reranking: fixed linear fusion has a ceiling

The final piece is the learned reranker. Once enough feedback accumulates (1,000+ labeled query-result pairs per tenant, or globally), kizuna-tune trains a LambdaMART model using LightGBM’s lambdarank objective.

LambdaMART is from Microsoft Research, 2010. Chris Burges and collaborators. It’s a gradient-boosted tree ranker with a specific loss function that directly optimizes NDCG. It won the Yahoo Learning to Rank Challenge in 2010 and it’s been a standard baseline for production search ranking at Microsoft, Yahoo, and others ever since.

The reason to use it here is that tree-based rankers are fast. Tree evaluation is a chain of if/else branches, under 10 microseconds per candidate node, zero allocations. You can run it in the request path without worrying about tail latency. Neural rerankers like monoBERT or ColBERT add 50-100ms minimum and need a GPU.

The features the model learns over are already computed by the retrieval engine: BM25 score, vector similarity, graph activation score, BLA boost, inhibition penalty, hop count, access count, age, last access recency, content tier, edge types on the activation path, fan-in. Twelve features total. The model learns interactions that linear fusion misses – “high BM25 with low vector similarity” is a keyword match without semantic backing and should be downweighted, while “high graph spread combined with recent access” is a strong signal that linear fusion underweights.

In published IR benchmarks, LambdaMART typically yields 5-15% NDCG improvement over hand-tuned linear fusion. We expect similar gains here, and deployment has automatic rollback: if a new model’s validation NDCG is worse than the current model’s, the deployment is rejected. Last three model versions are retained.


What the rest of the field does

We ran through the major agent memory systems to see what optimization story they tell. Here’s what we found.

System Parameter tuning Online learning Learned reranker
Mem0 Hand-tuned defaults None None
Zep / Graphiti Hand-tuned, some published ablations None None
Letta Model-managed (LLM decides) LLM context only None
LangMem Hand-tuned None None
Kizuna-Mem (with kizuna-tune) BOHB sweep, bi-objective Thompson Sampling bandits per tenant* LambdaMART, auto-rollback*

*Phase 1 (BOHB sweep) ships in Track A today. Phase 3 (bandits + reranker) ships in Track B alongside the Zig core changes. See the end of this post for the track breakdown.

Mem0 ships with vector similarity + optional entity triplets, with fusion weights and top-K baked in. The config is tuned once, globally, by the Mem0 team. There’s no per-tenant adaptation and no learned ranker in the open-source codebase. This is fine for what Mem0 is optimizing for – simplicity, fast onboarding, a clean API – but it means retrieval quality has a ceiling that no amount of scale can break through.

Zep / Graphiti has the most sophisticated retrieval algorithm of the competitors (BFS + BM25 + cosine + reciprocal rank fusion), and they’ve published ablation studies. But the fusion weights are fixed, and there’s no online learning. Their DMR benchmark number comes from a single globally-tuned config. If a Zep customer’s query distribution differs from the benchmark distribution, they get the benchmark’s config and have to hope.

Letta takes the LLM-driven approach: the model itself decides what to retrieve from tiered stores. There’s no parameter sweep because there’s barely a retrieval algorithm in the classical sense – retrieval is a function call the model chooses to make. This is creative, but it means retrieval quality is nondeterministic and varies with model capability. It also makes latency unpredictable, because every memory operation is an LLM round trip.

LangMem is a LangChain library. It inherits LangGraph’s checkpoint system for storage and uses vector similarity with importance + recency scoring. The scoring weights are hyperparameters users set by hand. No online learning, no learned reranker. The documentation is thin on tuning guidance.

Nobody in this list is running an offline Bayesian sweep. Nobody is running per-tenant bandits. Nobody is training a LambdaMART reranker on production feedback. The techniques we’re using are 15 to 90 years old. They’ve been production infrastructure at Google, Microsoft, Netflix, Yahoo, and every major search company. The agent memory field just hasn’t gotten around to adopting them yet.


Why this is a durable edge

The easy counter to this post is: “sure, but the other teams can just do the same thing.” True. BOHB is an open-source library. Thompson Sampling is a 90-year-old technique. LambdaMART has had a reference implementation in LightGBM since 2016.

The reason this is still a durable edge is that these techniques don’t drop onto a Python-and-LLM memory stack cleanly.

Thompson Sampling in the request path needs microsecond-latency bandit sampling, per-tenant state persistence to a WAL, and lock-free access across concurrent requests. That’s doable in Zig. It’s painful in Python with GIL contention and per-call allocator overhead.

LambdaMART inference at <10 microseconds per node requires a packed binary format you can evaluate from a systems language without allocating. LightGBM ships a C API that makes this possible, but calling it from Python adds marshaling overhead that eats the latency budget.

Query-adaptive routing with 50 microsecond classification time means the router runs in the hot path, not as a pre-processing step in a separate service. Kizuna-Mem’s Zig core makes this free. A Python memory system has to either skip the router or pay the latency cost.

And dogfooding the experimental memory for the optimization loop itself – using Kizuna-Mem as Instance A while kizuna-tune optimizes Instance B – requires a memory engine that can handle temporal, relational, multi-hop queries over hundreds of experiment nodes. That’s the problem we described in the autoresearch post. Flat logs break. A graph doesn’t.

The Python-first memory systems can bolt on BOHB for offline tuning. That’s straightforward. The harder parts – online bandits in the request path, sub-10-microsecond reranker inference, query-adaptive routing without latency regression – require a rewrite of the retrieval engine in a systems language. That’s a multi-year project, and by the time it ships, we’ll have moved further.


Where to look next

kizuna-tune ships in two tracks. Track A is the Python tooling: BOHB sweep, Phase 2 AI interpretation, LambdaMART training pipeline, profile generation. This ships first and requires no Zig changes. Track B adds the Zig/Rust changes needed for Phase 3 online learning: edge type weight multiplication in activation.zig, PPR teleportation, the Thompson Sampling module, feedback endpoint, reranker inference.

The full design spec is in the repo at docs/superpowers/specs/2026-04-06-kizuna-tune-design.md. If you want to see how we store experiment history in Instance A, how the AI agent queries the graph during Phase 2, or how the bandit state is persisted across restarts, it’s all there.

Memory engines that tune themselves aren’t a research idea. The techniques are old. Nobody has bothered to wire them together. We did.


kizuna-tune lives at tools/kizuna-tune/ in the Kizuna-Mem repository. The 24th post in this series covers the dogfooding architecture – using Kizuna-Mem as the experimental memory for the optimizer itself.