An AI Agent Ran 700 Experiments and Found What Human Researchers Missed

Setup

The standard objection to AI-driven research has always been that agents lack the intuition to form good hypotheses. That objection got harder to hold in March 2026. Andrej Karpathy ran an autoresearch agent against nanochat — a small, clean language model codebase — and let it run. It logged 700 experiments. It found an 11% improvement. Human researchers working the same codebase hadn’t found it. The gap wasn’t creativity. It was throughput.

What They Found

The agent identified a configuration change — not a novel architecture, but a tuning adjustment in the training loop — that produced an 11% gain on the model’s benchmark suite. The result took 700 runs to surface. A human researcher running sequential experiments would have needed weeks; the agent ran them in parallel with no context switching cost. Karpathy’s conclusion was direct: every serious AI lab will run autoresearch pipelines within 18 months.

How It Works

Autoresearch agents follow a simple loop: propose a change, run a training or evaluation job, record the result, update a hypothesis table, repeat. The intelligence is in the proposal function — which change to try next based on what’s worked before. This is structured search over a configuration space, not reasoning from first principles. It works because ML improvement is largely empirical: the correct answer exists in the search space, it just requires enough runs to find it. Compute replaces intuition.

Why It Matters

ML research teams at mid-size AI companies now face a productivity gap: labs with autoresearch pipelines iterate faster on the same hardware, compressing the gap between a new idea and a validated result from weeks to hours
Academic ML researchers working on small compute budgets are structurally disadvantaged against teams that can run hundreds of parallel experiments — the field’s output will increasingly concentrate at compute-rich institutions
Founders building research tooling (experiment tracking, hyperparameter search, evals infrastructure) have a new wedge: autoresearch agents need better tooling for logging, deduplication, and hypothesis management

Source: Karpathy on Autoresearch — X