Try an idea, measure it, keep what works, revert what doesn't, repeat.
autoresearch is a Claude Code / Claude skill that turns Claude into a disciplined optimization agent. Point it at a measurable target — test runtime, build/bundle size, latency, memory, ML training loss, a Lighthouse score — and it hill-climbs the metric through many small, measured changes: edit → benchmark → keep the wins (as commits) → revert the regressions → repeat, leaving an auditable trail the whole way.
It's a single git branch with many commits. Every kept experiment is one commit, so
git log is a clean record of what actually moved the number. Failed experiments are
reverted automatically (your run log is preserved). A confidence score tells you when a
"win" is real versus within benchmark noise.
This skill is an independent reimplementation for Claude of pi-autoresearch by davebcn87 — an extension that brought the autonomous experiment loop to the pi coding agent. pi-autoresearch is in turn inspired by karpathy/autoresearch.
pi-autoresearch is MIT licensed. No source code was copied — pi-autoresearch is a TypeScript pi extension (custom tools + a live TUI widget), whereas this is a Claude skill (Markdown instructions + bundled shell/Python scripts) that reproduces the same workflow and semantics in Claude's very different runtime. All credit for the original idea, design, and the "try → measure → keep/revert → repeat, forever" framing goes to the pi-autoresearch and karpathy/autoresearch authors. If you use the pi coding agent, use the original — it's excellent.
This skill is distributed for the skills.sh ecosystem
(npx skills, by Vercel). Requires Node.js 18+.
# Install into the current project (.claude/skills/)
npx skills add agusmdev/autoresearch-skill
# Or install globally for all projects (~/.claude/skills/)
npx skills add agusmdev/autoresearch-skill -gManual install (no skills.sh)
git clone https://github.com/agusmdev/autoresearch-skill ~/.claude/skills/autoresearchClaude Code picks up SKILL.md files in ~/.claude/skills/ (global) or
.claude/skills/ (project) on the next session.
After installing, the skill triggers automatically on requests like the ones below — or
you can invoke it explicitly with /autoresearch.
Just describe an iterative optimization with a metric. Examples that trigger it:
- "Optimize the runtime of our test suite in a loop — keep correctness, commit each win."
- "This JSON parser is too slow on big_input.json. Make it faster, but measure it properly."
- "Tune the hyperparameters in config.py to lower val_loss — it's noisy, so be rigorous."
- "Set up experiments to shrink the production bundle."
Claude will: confirm the goal/command/metric/scope → branch from a clean tree → write a benchmark and a session doc → take a baseline → then loop autonomously, committing improvements and reverting regressions, until it's exhausted promising ideas or you stop it. When it stops it reports baseline → best, % improvement, and what worked.
The loop is a resumable manual loop, not a background daemon — Claude drives it within
a turn. To run hands-off across context limits, wrap it with Claude Code's /loop or the
ralph-loop plugin, using a resume prompt like:
continue the autoresearch session in this repo: read autoresearch.md, run scripts/ar_status.py, then run the next experiment
All state lives in autoresearch.md + autoresearch.jsonl, so any fresh session can resume.
The skill is Markdown instructions plus three small scripts that handle the error-prone bookkeeping so every iteration is consistent and the result is auditable:
| Script | Role |
|---|---|
scripts/ar_run.sh |
Runs the benchmark (./autoresearch.sh), times it, parses METRIC name=value lines, runs optional correctness checks. |
scripts/ar_log.py |
Records a run: commits on keep, reverts code on discard/crash/checks_failed (preserving autoresearch.*), appends to autoresearch.jsonl, computes a confidence score. |
scripts/ar_status.py |
Prints a dashboard from the log: baseline vs best, % gain, confidence, recent runs. |
| File | Purpose |
|---|---|
autoresearch.md |
Living session doc — objective, metrics, scope, off-limits, constraints, "what's been tried". A fresh agent resumes from this alone. |
autoresearch.sh |
The benchmark. Prints METRIC name=value lines. Kept fast; editable mid-loop to add signal. |
autoresearch.jsonl |
Append-only log: one config header per segment, one line per run. The machine-readable history. |
autoresearch.checks.sh |
(optional) Correctness checks (tests/types/lint). Run after a passing benchmark; failure blocks keep. |
autoresearch.ideas.md |
(optional) Backlog of promising-but-deferred ideas. |
These are deliberately named autoresearch.* so the auto-revert never deletes them.
- keep — primary metric improved →
git commit(the change becomes a commit). - discard — worse or unchanged → code reverted, no commit.
- crash — benchmark failed to run → reverted.
- checks_failed — metric was valid but correctness checks failed → reverted; can't be kept.
After 3+ measured runs, ar_log.py reports confidence = |best_improvement| / MAD
(Median Absolute Deviation of the metric — a robust noise floor): ≥2.0× likely real,
1.0–2.0× marginal, <1.0× within noise. It's advisory — it never auto-discards.
A regression never scores as "likely real", and the noise floor is estimated from kept
runs so exploratory failures don't bury a genuine win.
Full mechanics, the JSONL format, and resume/recovery steps are in
references/mechanics.md.
A heavyweight measured loop is the wrong tool for a trivial change. Skip it and just edit directly when: you already know the fix; the metric is trivial to eyeball and correctness is obvious; you'll try only one or two things and no reviewable history is needed; or there's no stable, repeatable way to measure the target. Reach for it when the work is genuinely iterative and measurement is the point. (The skill itself says this up front.)
SKILL.md # the skill: setup, the loop, decision rules, stop conditions
scripts/
ar_run.sh # benchmark runner (timing + METRIC parsing + checks)
ar_log.py # keep/revert + jsonl log + confidence
ar_status.py # dashboard from the log
references/
mechanics.md # deep reference: semantics, jsonl format, confidence, resume
assets/
autoresearch.md.template # session-doc starter
autoresearch.sh.template # benchmark starter
Requirements: git, bash, python3 (stdlib only — no third-party packages).
MIT. Inspired by and modeled on pi-autoresearch (MIT) and karpathy/autoresearch.