🧭 OpenSkill

Open-World Self-Evolution for LLM Agents

An agent that builds both its skills and its own verification signals from scratch — using only a task prompt and open-world resources, with no target-task supervision.

Note

Code is on the way. This repository currently hosts the project overview and release plan. Star ⭐ and watch 👀 to be notified when the code, skills, and benchmark drop. See the roadmap.

TL;DR

Self-evolving agents need to adapt after deployment — but existing methods assume a usable learning loop is already there: curated skills, successful trajectories, or verifier signals. Real open-world deployments may offer none of these, only a task prompt.

OpenSkill studies open-world self-evolution: an agent must build both its skills and its own verification signals from scratch, drawing on open-world resources but no target-task supervision. Target-task supervision is reserved strictly for final evaluation.

📈 Scalable
_{Skills are sourced from the open world, not bounded by a human's or model's prior knowledge.}

🌐 Grounded
_{Knowledge and verification anchors come from real docs, repositories, and the web.}

🔒 Supervision-free
_{No gold answers, rewards, or verifier outputs during learning — a leakage barrier keeps them out.}

The Idea — a new paradigm for self-evolving skills

Unlike human-curated, LLM-generated, or supervised self-evolution, OpenSkill acquires skills from the open world and verifies them with self-built virtual tasks — making it simultaneously scalable, grounded, and supervision-free. Prior paradigms each miss at least one of these properties.

Four paradigms for self-evolving agent skills: Human-Curated, LLM-Generated, Supervised Self-Evolution, and Ours: Open-World.

How OpenSkill works

Given only a task prompt, a base model, tool access, and open-world resources, OpenSkill bootstraps a learning loop from scratch in three stages.

Stage	Name	What happens
01	Open-world knowledge acquisition	Retrieves task-relevant knowledge and independent verification anchors from docs, repos, papers, and the web — then drafts a structured skill plan.
02	Leakage-free skill evolution	Drafts skills and refines them in a sandbox against self-built virtual tests grounded in the anchors, fixing bugs and knowledge gaps over up to three rounds.
03	Zero-shot target evaluation	Deploys the frozen skill to the target agent. Ground-truth tests are unlocked only here, at final evaluation — never during construction.

OpenSkill framework overview: open-world knowledge acquisition, leakage-free evolution loop with a virtual-task verifier and diagnostic retriever, and final evaluation.

_{A leakage barrier keeps target supervision out of skill construction, unlocking it only for final evaluation.}

Results — best automated pass rate on every setting

On SkillsBench (11 domains) OpenSkill beats the strongest closed-world baseline by +8.9 / +8.8 points and lands within 1–3 points of the human upper bound — while honoring the no-supervision constraint.

Metric	Value
Overall pass rate on Opus 4.6	43.6% (+8.9 over best baseline)
Overall pass rate on GPT 5.2	42.1% (+8.8 over best baseline)
GT test intents covered by self-built verifier	88.9%
Domains best / tied-best on Opus 4.6	8 / 11

SkillsBench — overall average pass rate (%) (Human = reference upper bound, excluded from ranking)

Target agent	No Skill	Self-Gen	CoT	Skill-Creator	AutoSkill	Memento	OpenSkill	Human
Opus 4.6 (Claude Code)	25.5	23.9	23.9	34.7	24.7	30.1	43.6	44.5
GPT 5.2 (Codex)	25.0	32.2	33.3	29.2	11.2	15.6	42.1	44.8

Beyond SkillsBench, OpenSkill is also the best automated method on SocialMaze (82.7% / 70.7%) and ScienceWorld (90.0% / 85.3%) across both target agents.

Analysis — skills transfer, the verifier aligns, every component matters _{(click to expand)}

RQ1 — Transferability Skills generated by Opus 4.6 transfer as-is to four weaker models, improving by +5.5 to +14.8 points over no-skill with no model-specific adaptation.

RQ2 — Virtual verifier quality Without ever seeing ground-truth tests, the verifier reaches 80.5% recall against GT-positive outcomes, 60.7% overall agreement, and covers 88.9% of GT test intents.

RQ3 — Component contribution. On SocialMaze, reward peaks at three refinement rounds; open-world query and the virtual verifier each improve over a parametric-only baseline and are largely complementary.

Transferability of Opus 4.6-generated skills to four weaker models.

Ablations on SocialMaze: reward vs refinement iterations, and component contributions.

🗺️ Roadmap

Releases ship in phases. ⭐ the repo to get notified as each lands.

🟢 Now

Project page & overview — openlair.github.io/openskill
Paper preprint (arXiv) — coming soon

🟡 Next

Core OpenSkill framework code (knowledge acquisition → skill evolution → evaluation)
Reproduction scripts for the SkillsBench main results

Citation

@article{openskill2026,
  title   = {OpenSkill: Open-World Self-Evolution for LLM Agents},
  author  = {Yan, Zhiling and Song, Dingjie and Zhang, Hanrong and Liang, Wei and
             Zhang, Yuxuan and Dai, Yutong and He, Lifang and Yu, Philip S. and
             Xu, Ran and Li, Xiang and Sun, Lichao},
  journal = {arXiv preprint},
  year    = {2026}
}

Authors & Affiliations _{(click to expand)}

Zhiling Yan^1,*, Dingjie Song^1,*, Hanrong Zhang², Wei Liang¹, Yuxuan Zhang^3,4, Yutong Dai⁵, Lifang He¹, Philip S. Yu², Ran Xu⁵, Xiang Li⁶, Lichao Sun^1,†

¹ Lehigh University · ² University of Illinois Chicago · ³ University of British Columbia · ⁴ Vector Institute · ⁵ Salesforce AI Research · ⁶ Massachusetts General Hospital & Harvard Medical School

_{* Equal contribution † Corresponding author}

_{OpenSkill · Open-World Self-Evolution for LLM Agents · 2026 · OpenLAIR}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧭 OpenSkill

Open-World Self-Evolution for LLM Agents

TL;DR

The Idea — a new paradigm for self-evolving skills

How OpenSkill works

Results — best automated pass rate on every setting

Analysis — skills transfer, the verifier aligns, every component matters _{(click to expand)}

🗺️ Roadmap

🟢 Now

🟡 Next

Citation

Authors & Affiliations _{(click to expand)}

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🧭 OpenSkill

Open-World Self-Evolution for LLM Agents

TL;DR

The Idea — a new paradigm for self-evolving skills

How OpenSkill works

Results — best automated pass rate on every setting

Analysis — skills transfer, the verifier aligns, every component matters (click to expand)

🗺️ Roadmap

🟢 Now

🟡 Next

Citation

Authors & Affiliations (click to expand)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Analysis — skills transfer, the verifier aligns, every component matters _{(click to expand)}

Authors & Affiliations _{(click to expand)}

Packages