Article 記事

Karpathy's Autoresearch and the Memory Problem for Autonomous AI Agents

      author
      Jonathan Conway
    

      timestamp
      10 April 2026
    

      classification
      autoresearch / karpathy / autonomous-agents / agent-memory / ml-research / temporal-knowledge-graph
    

Andrej Karpathy released autoresearch in March 2026. The premise is simple: give an AI agent a real LLM training setup and let it experiment autonomously overnight. You go to sleep. It runs 100 experiments. You wake up to a TSV log of everything it tried and a git branch advanced past every improvement.

The project has 65,000+ stars for a reason. It points at something bigger than a clever script. It’s the first credible demo of what autonomous AI research agents actually look like in practice. And buried in its design is a problem: how these agents remember what they’ve tried.

How autoresearch Works

The system has three files. prepare.py handles data loading and evaluation and is off limits to the agent. train.py contains the full GPT model, optimizer, and training loop. This is the only file the agent can modify. program.md is the “research org code” that tells the agent how to behave.

The loop is tight:

The agent reads train.py and decides on a modification (architecture change, hyperparameter tweak, optimizer adjustment, batch size experiment)
It commits the change to git
It runs uv run train.py > run.log 2>&1
Training executes for exactly 5 minutes of wall-clock time, excluding startup and compilation
The agent extracts val_bpb (validation bits per byte, a compression-based loss metric where lower is better) from the log
If val_bpb improved: advance the branch. If it worsened: git reset
Log the result to results.tsv
Repeat. Never stop. Never ask the human for permission.

That last point matters. The program.md instructs the agent to iterate indefinitely: “NEVER STOP… Do NOT pause to ask the human if you should continue.” This is autonomous in the real sense. The agent decides what to try, tries it, evaluates the result, and moves on. The human’s role shifts from writing code to writing program.md, the document that defines experimental strategy and constraints.

Karpathy’s framing: “you’re not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files.”

The 5-Minute Constraint

Every experiment gets exactly 5 minutes of training time. This is a deliberate design choice. It makes results directly comparable across experiments regardless of model size, batch configuration, or architectural differences. An agent that discovers a smaller, faster architecture benefits because it gets more training steps within the same wall-clock budget. An agent that bloats the model with unnecessary parameters is penalized because it gets fewer steps.

This fixed-time evaluation selects for efficient architectures, not just accurate ones. At around 8 to 12 experiments per hour depending on startup overhead, the agent can explore 70 to 100 configurations overnight on a single H100.

The Trade-off Philosophy

The program.md embeds a pragmatic evaluation standard: “A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it.” Code deletion that maintains equivalent performance counts as a simplification win. The agent is instructed to optimize for the combination of metric improvement and code quality, not metric alone.

This is a more nuanced objective function than most automated hyperparameter search systems use. The agent isn’t just hill-climbing on a number. It’s making judgment calls about whether a marginal gain justifies added complexity.

Running it yourself

The setup is minimal if you have an NVIDIA GPU:

git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync
uv run prepare.py

Then point your AI agent (Claude, GPT-4, etc.) at the repo and tell it to follow program.md. The agent will create a dated branch, run experiments, and build up a git history of improvements.

The constraint: NVIDIA GPU only, currently tested on H100. Community forks exist for Mac (Apple Silicon), Windows, and AMD. For smaller GPUs, reduce MAX_SEQ_LEN, VOCAB_SIZE, and DEPTH in train.py before starting.

Each experiment takes about 5 minutes. Budget 8 to 12 hours for a serious overnight run. Check results.tsv in the morning for a complete log of everything the agent tried.

Why This Matters

The obvious value: a single engineer can run 100 experiments overnight instead of 8 during a workday. That’s a real productivity multiplier for small research teams that can’t afford to run 50 parallel jobs on a compute cluster.

The deeper value: autoresearch demonstrates a pattern that’s going to generalize far beyond ML training. The pattern is:

Define a task with a measurable objective
Give an autonomous agent access to the relevant code or data
Let it iterate without supervision
Track what worked

This loop applies to compiler optimization, database query tuning, infrastructure configuration, API performance testing, and dozens of other engineering tasks where the feedback cycle is fast and the search space is large. autoresearch is the prototype. The production version of this pattern needs better infrastructure than a TSV file and git commits.

The Memory Problem

autoresearch tracks experimental history in two ways: git commits that survive evaluation (failed experiments are reset), and results.tsv for a flat log of all experiments including failures.

This works fine for 100 experiments. It breaks down at 1,000. And it becomes actively harmful at 10,000.

What the agent forgets

After 200 experiments, the agent has no efficient way to answer these questions:

“What architectural changes correlated with improvements when the optimizer was Muon?” The TSV has the val_bpb numbers. Git has the diffs. But the relationship between architecture type and optimizer choice lives in the code, not in the log. Answering this requires the agent to checkout each commit, parse train.py, classify the architecture, match it with the optimizer configuration, and correlate with the metric. That’s expensive and fragile.

“I tried increasing depth to 12 layers three times and it failed every time. Should I try it again?” The TSV records that three experiments failed. But the agent needs to grep through git diffs to understand what failed. If the depth increase was combined with other changes, the agent can’t isolate the effect of depth alone from the flat log.

“Experiment 47 and experiment 183 both improved val_bpb by similar amounts but used completely different approaches. What do they have in common?” This requires semantic understanding of code diffs, not string matching. The agent would need to parse both versions of train.py, extract the architectural properties, and find shared characteristics. A flat log gives you nothing here.

“What’s the best configuration we’ve found for attention mechanisms specifically?” The TSV doesn’t categorize experiments by component. Every experiment is just a row with a number.

Why git isn’t memory

Git is a version control system. It tracks what changed in code. It doesn’t track why it changed, what the agent expected to happen, or how the result relates to prior experiments. A git history of 500 commits tells you the sequence of changes. It doesn’t tell you the causal structure of the research.

Consider what a human researcher maintains in their head during an experimental campaign: a mental model of which hypotheses have been tested, which variables interact, which approaches are dead ends, and which regions of the search space remain unexplored. They build this model incrementally and use it to guide the next experiment.

An autonomous agent using git + TSV rebuilds this understanding from scratch every iteration by re-reading the log and the current code. It has no persistent model of the experimental landscape. Every decision is made with the same flat context regardless of whether it’s experiment 5 or experiment 500.

What Autonomous Agents Actually Need

The experimental history of an autoresearch run is a temporal knowledge graph. Each experiment is an event with a timestamp. Each experiment modifies entities (model architecture, optimizer, hyperparameters). Each experiment produces metrics. Relationships exist between experiments: “experiment 47 reverted experiment 46”, “experiments 12 through 18 all explored learning rate variations”, “the depth increase in experiment 83 was combined with the batch size change from experiment 79.”

This is structured, temporal, relational data. The same kind of data that powers any agent memory system that handles more than simple recall.

What the agent needs:

Temporal queries. “What was the best val_bpb we achieved before we switched from AdamW to Muon?” This requires point-in-time reasoning over the experimental timeline.

Multi-hop retrieval. “Which optimizer settings produced improvements when combined with architectures that had fewer than 8 layers?” This requires traversing relationships between experiments, architectural properties, and optimizer configurations.

Decay and relevance. Early experiments on a baseline architecture become less relevant as the codebase evolves. An experiment from 400 iterations ago that tested a feature the agent has since removed shouldn’t carry the same weight as a recent experiment on the current architecture.

Entity resolution. The agent might call the same concept different things across experiments. “Increasing model depth” and “adding more transformer layers” and “stacking additional blocks” are the same modification. The memory system needs to recognize this.

Conflict detection. If experiment 47 concluded “wider layers help” and experiment 189 concluded “wider layers hurt”, the agent needs to understand that the context changed between those two experiments (different optimizer, different learning rate, different dataset size) rather than treating them as contradictory.

These are the same retrieval problems that any production agent memory system has to solve. autoresearch just makes them visible because the experimental loop is fast enough that you hit the limits within hours rather than months.

From Research Agent to Production Pattern

autoresearch is a single-agent, single-GPU, single-objective system. It makes the memory problem visible. The production form is more demanding.

Imagine a fleet of agents optimizing different components of a production ML pipeline: one agent tunes the data preprocessing, another optimizes the model architecture, a third experiments with serving configurations. Each agent’s changes affect the others. The preprocessing agent’s tokenizer choice affects what architectures perform well. The architecture agent’s model size affects what serving configurations are feasible.

In this scenario, each agent needs access to the experimental history of the others. The memory system isn’t just a log per agent. It’s a shared temporal knowledge graph where changes by one agent become context for decisions by another.

This is the production form of the memory problem. And it requires the same capabilities that the governed memory engine was built for: temporal awareness (when did this change happen relative to other changes), multi-hop graph traversal (how does this agent’s experiment relate to that agent’s experiment), spreading activation retrieval (which prior experiments are most relevant to the current state, weighted by recency and causal proximity), and entity resolution across different representations of the same concept.

Git gives you a linear commit history. A TSV gives you a flat table. A temporal knowledge graph gives you the structure that autonomous agents need to make informed decisions over long experimental campaigns.

What Comes Next

Karpathy framed autoresearch with a fictional 2026 retrospective: “Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures.” He tagged it as documenting “how it all began.”

The framing is provocative but the trajectory is real. Autonomous agents that iterate on code, evaluate results, and compound improvements without human supervision are going to become standard infrastructure for ML teams. The question is whether the memory systems backing those agents can keep up.

A hundred experiments with a TSV log is a demo. A thousand experiments across multiple agents with shared context, temporal reasoning, and causal structure is production infrastructure. The gap between those two is the same gap between a chatbot that forgets your name and an agent that builds institutional knowledge over months of operation.

The agents are getting autonomous. Their memory systems need to catch up.