Skip to content

OpenLAIR/OpenSkill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🧭 OpenSkill

Open-World Self-Evolution for LLM Agents

An agent that builds both its skills and its own verification signals from scratch — using only a task prompt and open-world resources, with no target-task supervision.

Website Paper arXiv Code License

Note

Code is on the way. This repository currently hosts the project overview and release plan. Star ⭐ and watch 👀 to be notified when the code, skills, and benchmark drop. See the roadmap.


TL;DR

Self-evolving agents need to adapt after deployment — but existing methods assume a usable learning loop is already there: curated skills, successful trajectories, or verifier signals. Real open-world deployments may offer none of these, only a task prompt.

OpenSkill studies open-world self-evolution: an agent must build both its skills and its own verification signals from scratch, drawing on open-world resources but no target-task supervision. Target-task supervision is reserved strictly for final evaluation.

📈 Scalable
Skills are sourced from the open world, not bounded by a human's or model's prior knowledge.
🌐 Grounded
Knowledge and verification anchors come from real docs, repositories, and the web.
🔒 Supervision-free
No gold answers, rewards, or verifier outputs during learning — a leakage barrier keeps them out.

The Idea — a new paradigm for self-evolving skills

Unlike human-curated, LLM-generated, or supervised self-evolution, OpenSkill acquires skills from the open world and verifies them with self-built virtual tasks — making it simultaneously scalable, grounded, and supervision-free. Prior paradigms each miss at least one of these properties.

Four paradigms for self-evolving agent skills: Human-Curated, LLM-Generated, Supervised Self-Evolution, and Ours: Open-World.

How OpenSkill works

Given only a task prompt, a base model, tool access, and open-world resources, OpenSkill bootstraps a learning loop from scratch in three stages.

Stage Name What happens
01 Open-world knowledge acquisition Retrieves task-relevant knowledge and independent verification anchors from docs, repos, papers, and the web — then drafts a structured skill plan.
02 Leakage-free skill evolution Drafts skills and refines them in a sandbox against self-built virtual tests grounded in the anchors, fixing bugs and knowledge gaps over up to three rounds.
03 Zero-shot target evaluation Deploys the frozen skill to the target agent. Ground-truth tests are unlocked only here, at final evaluation — never during construction.
OpenSkill framework overview: open-world knowledge acquisition, leakage-free evolution loop with a virtual-task verifier and diagnostic retriever, and final evaluation.
A leakage barrier keeps target supervision out of skill construction, unlocking it only for final evaluation.

Results — best automated pass rate on every setting

On SkillsBench (11 domains) OpenSkill beats the strongest closed-world baseline by +8.9 / +8.8 points and lands within 1–3 points of the human upper bound — while honoring the no-supervision constraint.

Metric Value
Overall pass rate on Opus 4.6 43.6%  (+8.9 over best baseline)
Overall pass rate on GPT 5.2 42.1%  (+8.8 over best baseline)
GT test intents covered by self-built verifier 88.9%
Domains best / tied-best on Opus 4.6 8 / 11

SkillsBench — overall average pass rate (%)  (Human = reference upper bound, excluded from ranking)

Target agent No Skill Self-Gen CoT Skill-Creator AutoSkill Memento OpenSkill Human
Opus 4.6 (Claude Code) 25.5 23.9 23.9 34.7 24.7 30.1 43.6 44.5
GPT 5.2 (Codex) 25.0 32.2 33.3 29.2 11.2 15.6 42.1 44.8

Beyond SkillsBench, OpenSkill is also the best automated method on SocialMaze (82.7% / 70.7%) and ScienceWorld (90.0% / 85.3%) across both target agents.


Analysis — skills transfer, the verifier aligns, every component matters  (click to expand)


RQ1 — Transferability Skills generated by Opus 4.6 transfer as-is to four weaker models, improving by +5.5 to +14.8 points over no-skill with no model-specific adaptation.

RQ2 — Virtual verifier quality Without ever seeing ground-truth tests, the verifier reaches 80.5% recall against GT-positive outcomes, 60.7% overall agreement, and covers 88.9% of GT test intents.

RQ3 — Component contribution. On SocialMaze, reward peaks at three refinement rounds; open-world query and the virtual verifier each improve over a parametric-only baseline and are largely complementary.

Transferability of Opus 4.6-generated skills to four weaker models.   Ablations on SocialMaze: reward vs refinement iterations, and component contributions.

🗺️ Roadmap

Releases ship in phases. ⭐ the repo to get notified as each lands.

🟢 Now

🟡 Next

  • Core OpenSkill framework code (knowledge acquisition → skill evolution → evaluation)
  • Reproduction scripts for the SkillsBench main results

Citation

@article{openskill2026,
  title   = {OpenSkill: Open-World Self-Evolution for LLM Agents},
  author  = {Yan, Zhiling and Song, Dingjie and Zhang, Hanrong and Liang, Wei and
             Zhang, Yuxuan and Dai, Yutong and He, Lifang and Yu, Philip S. and
             Xu, Ran and Li, Xiang and Sun, Lichao},
  journal = {arXiv preprint},
  year    = {2026}
}

Authors & Affiliations  (click to expand)


Zhiling Yan1,*, Dingjie Song1,*, Hanrong Zhang2, Wei Liang1, Yuxuan Zhang3,4, Yutong Dai5, Lifang He1, Philip S. Yu2, Ran Xu5, Xiang Li6, Lichao Sun1,†

1 Lehigh University  ·  2 University of Illinois Chicago  ·  3 University of British Columbia  ·  4 Vector Institute  ·  5 Salesforce AI Research  ·  6 Massachusetts General Hospital & Harvard Medical School

* Equal contribution    † Corresponding author


              

OpenSkill · Open-World Self-Evolution for LLM Agents · 2026 · OpenLAIR

About

Open-World Self-Evolution for LLM Agents — agents that build both their skills and their own verification signals from scratch, with no target-task supervision. (Code coming soon.)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors