Optimizing a ML model for who's not familiar used to be a human research process of trial and error. Karpathy just released a repo that automate the research and test with parallel agents running 5 minute experiments.
It’s built on a stripped-down version of his earlier nanochat training core — a self-contained ~630-line Python file (train.py) that includes a full GPT model, Muon+AdamW optimizer, and training loop.
The setup is deliberately simple:
- prepare.py handles fixed data prep, tokenization, and evaluation (don’t touch it).
- The human only edits a high-level Markdown file (program.md) with research instructions or ideas.
- An AI coding agent (Claude, etc.) takes over: it edits only train.py, runs a training experiment for exactly 5 minutes (fixed wall-clock budget), measures validation bits-per-byte (val_bpb — lower is better), and decides whether to keep the change.
- Everything happens on a git feature branch. Improvements become commits; failures are discarded. The loop repeats indefinitely.

As Karpathy said it runs 100+ experiments while you sleep overnight. Karpathy ran ~650 over a weekend and confirmed the gains transferred to larger models, improving nanochat’s “time-to-GPT-2” leaderboard score.
Stay Updated
Get the latest AI insights delivered to your inbox. No spam, unsubscribe anytime.
Comments
Sign in as a member to join the conversation.
Loading comments…