autoresearch — an autonomous optimization-loop skill for Claude

Try an idea, measure it, keep what works, revert what doesn't, repeat.

autoresearch is a Claude Code / Claude skill that turns Claude into a disciplined optimization agent. Point it at a measurable target — test runtime, build/bundle size, latency, memory, ML training loss, a Lighthouse score — and it hill-climbs the metric through many small, measured changes: edit → benchmark → keep the wins (as commits) → revert the regressions → repeat, leaving an auditable trail the whole way.

It's a single git branch with many commits. Every kept experiment is one commit, so git log is a clean record of what actually moved the number. Failed experiments are reverted automatically (your run log is preserved). A confidence score tells you when a "win" is real versus within benchmark noise.

Credit / inspiration

This skill is an independent reimplementation for Claude of pi-autoresearch by davebcn87 — an extension that brought the autonomous experiment loop to the pi coding agent. pi-autoresearch is in turn inspired by karpathy/autoresearch.

pi-autoresearch is MIT licensed. No source code was copied — pi-autoresearch is a TypeScript pi extension (custom tools + a live TUI widget), whereas this is a Claude skill (Markdown instructions + bundled shell/Python scripts) that reproduces the same workflow and semantics in Claude's very different runtime. All credit for the original idea, design, and the "try → measure → keep/revert → repeat, forever" framing goes to the pi-autoresearch and karpathy/autoresearch authors. If you use the pi coding agent, use the original — it's excellent.

Install

This skill is distributed for the skills.sh ecosystem (npx skills, by Vercel). Requires Node.js 18+.

# Install into the current project (.claude/skills/)
npx skills add agusmdev/autoresearch-skill

# Or install globally for all projects (~/.claude/skills/)
npx skills add agusmdev/autoresearch-skill -g

Manual install (no skills.sh)

git clone https://github.com/agusmdev/autoresearch-skill ~/.claude/skills/autoresearch

Claude Code picks up SKILL.md files in ~/.claude/skills/ (global) or .claude/skills/ (project) on the next session.

After installing, the skill triggers automatically on requests like the ones below — or you can invoke it explicitly with /autoresearch.

Usage

Just describe an iterative optimization with a metric. Examples that trigger it:

"Optimize the runtime of our test suite in a loop — keep correctness, commit each win."
"This JSON parser is too slow on big_input.json. Make it faster, but measure it properly."
"Tune the hyperparameters in config.py to lower val_loss — it's noisy, so be rigorous."
"Set up experiments to shrink the production bundle."

Claude will: confirm the goal/command/metric/scope → branch from a clean tree → write a benchmark and a session doc → take a baseline → then loop autonomously, committing improvements and reverting regressions, until it's exhausted promising ideas or you stop it. When it stops it reports baseline → best, % improvement, and what worked.

Running unattended

The loop is a resumable manual loop, not a background daemon — Claude drives it within a turn. To run hands-off across context limits, wrap it with Claude Code's /loop or the ralph-loop plugin, using a resume prompt like:

continue the autoresearch session in this repo: read autoresearch.md, run scripts/ar_status.py, then run the next experiment

All state lives in autoresearch.md + autoresearch.jsonl, so any fresh session can resume.

How it works

The skill is Markdown instructions plus three small scripts that handle the error-prone bookkeeping so every iteration is consistent and the result is auditable:

Script	Role
`scripts/ar_run.sh`	Runs the benchmark (`./autoresearch.sh`), times it, parses `METRIC name=value` lines, runs optional correctness checks.
`scripts/ar_log.py`	Records a run: commits on `keep`, reverts code on `discard`/`crash`/`checks_failed` (preserving `autoresearch.*`), appends to `autoresearch.jsonl`, computes a confidence score.
`scripts/ar_status.py`	Prints a dashboard from the log: baseline vs best, % gain, confidence, recent runs.

Session files (created in your project during a run)

File	Purpose
`autoresearch.md`	Living session doc — objective, metrics, scope, off-limits, constraints, "what's been tried". A fresh agent resumes from this alone.
`autoresearch.sh`	The benchmark. Prints `METRIC name=value` lines. Kept fast; editable mid-loop to add signal.
`autoresearch.jsonl`	Append-only log: one config header per segment, one line per run. The machine-readable history.
`autoresearch.checks.sh`	(optional) Correctness checks (tests/types/lint). Run after a passing benchmark; failure blocks `keep`.
`autoresearch.ideas.md`	(optional) Backlog of promising-but-deferred ideas.

These are deliberately named autoresearch.* so the auto-revert never deletes them.

Status semantics

keep — primary metric improved → git commit (the change becomes a commit).
discard — worse or unchanged → code reverted, no commit.
crash — benchmark failed to run → reverted.
checks_failed — metric was valid but correctness checks failed → reverted; can't be kept.

Confidence score

After 3+ measured runs, ar_log.py reports confidence = |best_improvement| / MAD (Median Absolute Deviation of the metric — a robust noise floor): ≥2.0× likely real, 1.0–2.0× marginal, <1.0× within noise. It's advisory — it never auto-discards. A regression never scores as "likely real", and the noise floor is estimated from kept runs so exploratory failures don't bury a genuine win.

Full mechanics, the JSONL format, and resume/recovery steps are in references/mechanics.md.

When NOT to use it

A heavyweight measured loop is the wrong tool for a trivial change. Skip it and just edit directly when: you already know the fix; the metric is trivial to eyeball and correctness is obvious; you'll try only one or two things and no reviewable history is needed; or there's no stable, repeatable way to measure the target. Reach for it when the work is genuinely iterative and measurement is the point. (The skill itself says this up front.)

What's in this repo

SKILL.md                      # the skill: setup, the loop, decision rules, stop conditions
scripts/
  ar_run.sh                   # benchmark runner (timing + METRIC parsing + checks)
  ar_log.py                   # keep/revert + jsonl log + confidence
  ar_status.py                # dashboard from the log
references/
  mechanics.md                # deep reference: semantics, jsonl format, confidence, resume
assets/
  autoresearch.md.template    # session-doc starter
  autoresearch.sh.template    # benchmark starter

Requirements: git, bash, python3 (stdlib only — no third-party packages).

License

MIT. Inspired by and modeled on pi-autoresearch (MIT) and karpathy/autoresearch.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
references		references
scripts		scripts
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autoresearch — an autonomous optimization-loop skill for Claude

Credit / inspiration

Install

Usage

Running unattended

How it works

Session files (created in your project during a run)

Status semantics

Confidence score

When NOT to use it

What's in this repo

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

autoresearch — an autonomous optimization-loop skill for Claude

Credit / inspiration

Install

Usage

Running unattended

How it works

Session files (created in your project during a run)

Status semantics

Confidence score

When NOT to use it

What's in this repo

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages