A test runner for agentskills.io-style AI agent skills
-
Updated
May 20, 2026 - TypeScript
A test runner for agentskills.io-style AI agent skills
SWE-bench for your codebase — mine your merged PRs into local, contamination-free coding-agent benchmarks. Adapters: claude-code, aider (Opus 4.7 / GPT-5.5 / Sonnet 4.6 / Gemini 3.1 Pro).
An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"
AI agent evolving strategies through automated self-play overnight. Generic framework with GEPA-inspired feedback loop and Elo tracking.
Create your self-hosted, open-source Operator model.
Isomorphic agent tooling: author once, run anywhere. Build, lint, route, fan out, eval, trace, guard, contract, and ledger AI-agent workflows across Cursor, Claude Code, Codex, and OpenCode.
LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending operations under real regulatory constraints.
Build a private evaluation dataset to optimize your organization's token costs.
Legal Action Boundary Eval (LABE): public proxy eval for legal AI workflows at the action boundary
Reproducible evaluation harness for hidden coordination variables in multi-agent LLM systems.
Portable evaluation bundles for agents and agent-shaped workflows: bounded, reproducible, regression-aware proof surfaces for quality claims.
Alpha benchmark for repo continuation intelligence
𝘈 𝘔𝘶𝘭𝘵𝘪-𝘈𝘨𝘦𝘯𝘵 𝘚𝘺𝘴𝘵𝘦𝘮 𝘧𝘰𝘳 𝘊𝘳𝘰𝘴𝘴-𝘊𝘩𝘦𝘤𝘬𝘪𝘯𝘨 𝘗𝘩𝘪𝘴𝘩𝘪𝘯𝘨 𝘜𝘙𝘓𝘴.
The node-level tracing library for agentic software.
Agent evaluation sketches for banking due diligence and research: classification, context verification, pruning, and test-driven workflows.
Long-context quality probes and KV-cache research on local GPUs: retrieval is not utilization.
Verification-native local coding agent runtime with eval gates, memory, subagents, and model profiles.
Credential-free agent production lab with deterministic planning fixtures, traces, cost accounting, eval assertions, and HTML reports.
Eval and promotion harness for AI agents: compare baseline vs candidate behaviour and promote only changes that survive evidence.
Replay, regression, trace packaging, failure analysis, and dataset slicing for LLM agents
Add a description, image, and links to the agent-evals topic page so that developers can more easily learn about it.
To associate your repository with the agent-evals topic, visit your repo's landing page and select "manage topics."