An agent that builds both its skills and its own verification signals from scratch — using only a task prompt and open-world resources, with no target-task supervision.
Note
Code is on the way. This repository currently hosts the project overview and release plan. Star ⭐ and watch 👀 to be notified when the code, skills, and benchmark drop. See the roadmap.
Self-evolving agents need to adapt after deployment — but existing methods assume a usable learning loop is already there: curated skills, successful trajectories, or verifier signals. Real open-world deployments may offer none of these, only a task prompt.
OpenSkill studies open-world self-evolution: an agent must build both its skills and its own verification signals from scratch, drawing on open-world resources but no target-task supervision. Target-task supervision is reserved strictly for final evaluation.
| 📈 Scalable Skills are sourced from the open world, not bounded by a human's or model's prior knowledge. |
🌐 Grounded Knowledge and verification anchors come from real docs, repositories, and the web. |
🔒 Supervision-free No gold answers, rewards, or verifier outputs during learning — a leakage barrier keeps them out. |
Unlike human-curated, LLM-generated, or supervised self-evolution, OpenSkill acquires skills from the open world and verifies them with self-built virtual tasks — making it simultaneously scalable, grounded, and supervision-free. Prior paradigms each miss at least one of these properties.
Given only a task prompt, a base model, tool access, and open-world resources, OpenSkill bootstraps a learning loop from scratch in three stages.
| Stage | Name | What happens |
|---|---|---|
| 01 | Open-world knowledge acquisition | Retrieves task-relevant knowledge and independent verification anchors from docs, repos, papers, and the web — then drafts a structured skill plan. |
| 02 | Leakage-free skill evolution | Drafts skills and refines them in a sandbox against self-built virtual tests grounded in the anchors, fixing bugs and knowledge gaps over up to three rounds. |
| 03 | Zero-shot target evaluation | Deploys the frozen skill to the target agent. Ground-truth tests are unlocked only here, at final evaluation — never during construction. |
A leakage barrier keeps target supervision out of skill construction, unlocking it only for final evaluation.
On SkillsBench (11 domains) OpenSkill beats the strongest closed-world baseline by +8.9 / +8.8 points and lands within 1–3 points of the human upper bound — while honoring the no-supervision constraint.
| Metric | Value |
|---|---|
| Overall pass rate on Opus 4.6 | 43.6% (+8.9 over best baseline) |
| Overall pass rate on GPT 5.2 | 42.1% (+8.8 over best baseline) |
| GT test intents covered by self-built verifier | 88.9% |
| Domains best / tied-best on Opus 4.6 | 8 / 11 |
SkillsBench — overall average pass rate (%) (Human = reference upper bound, excluded from ranking)
| Target agent | No Skill | Self-Gen | CoT | Skill-Creator | AutoSkill | Memento | OpenSkill | Human |
|---|---|---|---|---|---|---|---|---|
| Opus 4.6 (Claude Code) | 25.5 | 23.9 | 23.9 | 34.7 | 24.7 | 30.1 | 43.6 | 44.5 |
| GPT 5.2 (Codex) | 25.0 | 32.2 | 33.3 | 29.2 | 11.2 | 15.6 | 42.1 | 44.8 |
Beyond SkillsBench, OpenSkill is also the best automated method on SocialMaze (82.7% / 70.7%) and ScienceWorld (90.0% / 85.3%) across both target agents.
|
RQ1 — Transferability Skills generated by Opus 4.6 transfer as-is to four weaker models, improving by +5.5 to +14.8 points over no-skill with no model-specific adaptation. |
RQ2 — Virtual verifier quality Without ever seeing ground-truth tests, the verifier reaches 80.5% recall against GT-positive outcomes, 60.7% overall agreement, and covers 88.9% of GT test intents. |
RQ3 — Component contribution. On SocialMaze, reward peaks at three refinement rounds; open-world query and the virtual verifier each improve over a parametric-only baseline and are largely complementary.
Releases ship in phases. ⭐ the repo to get notified as each lands.
- Project page & overview — openlair.github.io/openskill
- Paper preprint (arXiv) — coming soon
- Core OpenSkill framework code (knowledge acquisition → skill evolution → evaluation)
- Reproduction scripts for the SkillsBench main results
@article{openskill2026,
title = {OpenSkill: Open-World Self-Evolution for LLM Agents},
author = {Yan, Zhiling and Song, Dingjie and Zhang, Hanrong and Liang, Wei and
Zhang, Yuxuan and Dai, Yutong and He, Lifang and Yu, Philip S. and
Xu, Ran and Li, Xiang and Sun, Lichao},
journal = {arXiv preprint},
year = {2026}
}Zhiling Yan1,*, Dingjie Song1,*, Hanrong Zhang2, Wei Liang1, Yuxuan Zhang3,4, Yutong Dai5, Lifang He1, Philip S. Yu2, Ran Xu5, Xiang Li6, Lichao Sun1,†
1 Lehigh University · 2 University of Illinois Chicago · 3 University of British Columbia · 4 Vector Institute · 5 Salesforce AI Research · 6 Massachusetts General Hospital & Harvard Medical School
* Equal contribution † Corresponding author


