Skip to content

agusmdev/autoresearch-skill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

autoresearch — an autonomous optimization-loop skill for Claude

Try an idea, measure it, keep what works, revert what doesn't, repeat.

autoresearch is a Claude Code / Claude skill that turns Claude into a disciplined optimization agent. Point it at a measurable target — test runtime, build/bundle size, latency, memory, ML training loss, a Lighthouse score — and it hill-climbs the metric through many small, measured changes: edit → benchmark → keep the wins (as commits) → revert the regressions → repeat, leaving an auditable trail the whole way.

It's a single git branch with many commits. Every kept experiment is one commit, so git log is a clean record of what actually moved the number. Failed experiments are reverted automatically (your run log is preserved). A confidence score tells you when a "win" is real versus within benchmark noise.


Credit / inspiration

This skill is an independent reimplementation for Claude of pi-autoresearch by davebcn87 — an extension that brought the autonomous experiment loop to the pi coding agent. pi-autoresearch is in turn inspired by karpathy/autoresearch.

pi-autoresearch is MIT licensed. No source code was copied — pi-autoresearch is a TypeScript pi extension (custom tools + a live TUI widget), whereas this is a Claude skill (Markdown instructions + bundled shell/Python scripts) that reproduces the same workflow and semantics in Claude's very different runtime. All credit for the original idea, design, and the "try → measure → keep/revert → repeat, forever" framing goes to the pi-autoresearch and karpathy/autoresearch authors. If you use the pi coding agent, use the original — it's excellent.


Install

This skill is distributed for the skills.sh ecosystem (npx skills, by Vercel). Requires Node.js 18+.

# Install into the current project (.claude/skills/)
npx skills add agusmdev/autoresearch-skill

# Or install globally for all projects (~/.claude/skills/)
npx skills add agusmdev/autoresearch-skill -g
Manual install (no skills.sh)
git clone https://github.com/agusmdev/autoresearch-skill ~/.claude/skills/autoresearch

Claude Code picks up SKILL.md files in ~/.claude/skills/ (global) or .claude/skills/ (project) on the next session.

After installing, the skill triggers automatically on requests like the ones below — or you can invoke it explicitly with /autoresearch.


Usage

Just describe an iterative optimization with a metric. Examples that trigger it:

  • "Optimize the runtime of our test suite in a loop — keep correctness, commit each win."
  • "This JSON parser is too slow on big_input.json. Make it faster, but measure it properly."
  • "Tune the hyperparameters in config.py to lower val_loss — it's noisy, so be rigorous."
  • "Set up experiments to shrink the production bundle."

Claude will: confirm the goal/command/metric/scope → branch from a clean tree → write a benchmark and a session doc → take a baseline → then loop autonomously, committing improvements and reverting regressions, until it's exhausted promising ideas or you stop it. When it stops it reports baseline → best, % improvement, and what worked.

Running unattended

The loop is a resumable manual loop, not a background daemon — Claude drives it within a turn. To run hands-off across context limits, wrap it with Claude Code's /loop or the ralph-loop plugin, using a resume prompt like:

continue the autoresearch session in this repo: read autoresearch.md, run scripts/ar_status.py, then run the next experiment

All state lives in autoresearch.md + autoresearch.jsonl, so any fresh session can resume.


How it works

The skill is Markdown instructions plus three small scripts that handle the error-prone bookkeeping so every iteration is consistent and the result is auditable:

Script Role
scripts/ar_run.sh Runs the benchmark (./autoresearch.sh), times it, parses METRIC name=value lines, runs optional correctness checks.
scripts/ar_log.py Records a run: commits on keep, reverts code on discard/crash/checks_failed (preserving autoresearch.*), appends to autoresearch.jsonl, computes a confidence score.
scripts/ar_status.py Prints a dashboard from the log: baseline vs best, % gain, confidence, recent runs.

Session files (created in your project during a run)

File Purpose
autoresearch.md Living session doc — objective, metrics, scope, off-limits, constraints, "what's been tried". A fresh agent resumes from this alone.
autoresearch.sh The benchmark. Prints METRIC name=value lines. Kept fast; editable mid-loop to add signal.
autoresearch.jsonl Append-only log: one config header per segment, one line per run. The machine-readable history.
autoresearch.checks.sh (optional) Correctness checks (tests/types/lint). Run after a passing benchmark; failure blocks keep.
autoresearch.ideas.md (optional) Backlog of promising-but-deferred ideas.

These are deliberately named autoresearch.* so the auto-revert never deletes them.

Status semantics

  • keep — primary metric improved → git commit (the change becomes a commit).
  • discard — worse or unchanged → code reverted, no commit.
  • crash — benchmark failed to run → reverted.
  • checks_failed — metric was valid but correctness checks failed → reverted; can't be kept.

Confidence score

After 3+ measured runs, ar_log.py reports confidence = |best_improvement| / MAD (Median Absolute Deviation of the metric — a robust noise floor): ≥2.0× likely real, 1.0–2.0× marginal, <1.0× within noise. It's advisory — it never auto-discards. A regression never scores as "likely real", and the noise floor is estimated from kept runs so exploratory failures don't bury a genuine win.

Full mechanics, the JSONL format, and resume/recovery steps are in references/mechanics.md.


When NOT to use it

A heavyweight measured loop is the wrong tool for a trivial change. Skip it and just edit directly when: you already know the fix; the metric is trivial to eyeball and correctness is obvious; you'll try only one or two things and no reviewable history is needed; or there's no stable, repeatable way to measure the target. Reach for it when the work is genuinely iterative and measurement is the point. (The skill itself says this up front.)


What's in this repo

SKILL.md                      # the skill: setup, the loop, decision rules, stop conditions
scripts/
  ar_run.sh                   # benchmark runner (timing + METRIC parsing + checks)
  ar_log.py                   # keep/revert + jsonl log + confidence
  ar_status.py                # dashboard from the log
references/
  mechanics.md                # deep reference: semantics, jsonl format, confidence, resume
assets/
  autoresearch.md.template    # session-doc starter
  autoresearch.sh.template    # benchmark starter

Requirements: git, bash, python3 (stdlib only — no third-party packages).


License

MIT. Inspired by and modeled on pi-autoresearch (MIT) and karpathy/autoresearch.

About

Autonomous optimization-loop skill for Claude (try, measure, keep, revert, repeat). A Claude reimplementation of pi-autoresearch. Install via: npx skills add agusmdev/autoresearch-skill

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors