agent-evals

Star

Here are 33 public repositories matching this topic...

darkrishabh / agent-skills-eval

Star

A test runner for agentskills.io-style AI agent skills

cli yaml typescript ai-agents jsonl llm-evaluation llm-evals agent-evals agent-skills openai-compatible agentskills

Updated May 20, 2026
TypeScript

HumphreySun98 / repoagentbench

Star

SWE-bench for your codebase — mine your merged PRs into local, contamination-free coding-agent benchmarks. Adapters: claude-code, aider (Opus 4.7 / GPT-5.5 / Sonnet 4.6 / Gemini 3.1 Pro).

benchmark developer-tools ai-agents aider llm-eval coding-agents agent-evals swe-bench gemini-3-1-pro claude-opus-4-7 gpt-5-5

Updated Apr 30, 2026
Python

The-Swarm-Corporation / StatisticalModelEvaluator

Star

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

ai ml multiagent agents llms evals llm-evals agent-evals multi-agent-eval

Updated Oct 6, 2025
Python

MrTsepa / autoevolve

Star

AI agent evolving strategies through automated self-play overnight. Generic framework with GEPA-inspired feedback loop and Elo tracking.

python genetic-algorithm evolutionary-algorithms game-ai autonomous-agents ai-agents self-play prompt-optimization llm-agents agent-evals

Updated Mar 24, 2026
Python

iMeanAI / open-source-operator

Star

Create your self-hosted, open-source Operator model.

training-infra agent-evals gui-agent browseruse native-agent-model

Updated Apr 10, 2025
Python

Agent-Pattern-Labs / iso

Star

Isomorphic agent tooling: author once, run anywhere. Build, lint, route, fan out, eval, trace, guard, contract, and ledger AI-agent workflows across Cursor, Claude Code, Codex, and OpenCode.

Updated May 21, 2026
TypeScript

shubchat / loab

Star

LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending operations under real regulatory constraints.

multi-agent lending ai-safety llm-agents llm-benchmarking agent-evals tool-use-ai

Updated Mar 31, 2026
Python

s1liconcow / repogauge

Star

Build a private evaluation dataset to optimize your organization's token costs.

token-cost agent-evals swe-bench

Updated Apr 26, 2026
Python

bigkan8 / legal-action-boundary-eval

Star

Legal Action Boundary Eval (LABE): public proxy eval for legal AI workflows at the action boundary

ai-safety legal-ai agent-evals contract-ai compliance-ai

Updated Apr 19, 2026
Python

aak204 / llm-coordination-harness

Star

Reproducible evaluation harness for hidden coordination variables in multi-agent LLM systems.

benchmark research evaluation coordination multi-agent reproducibility negative-results llm openrouter agent-evals

Updated Apr 5, 2026
Python

8Dionysus / aoa-evals

Star

Portable evaluation bundles for agents and agent-shaped workflows: bounded, reproducible, regression-aware proof surfaces for quality claims.

regression-testing aoa agent-evals boundary-testing workflow-evaluation safety-evals scoring-rubrics comparative-evaluation longitudinal-evaluation agents-of-abyss eval-bundles

Updated May 22, 2026
Python

mverab / Reposcale

Star

Alpha benchmark for repo continuation intelligence

python open-source benchmark evaluation developer-tools ai-agents ai-engineering llm-evals agent-evals llm-benchmark

Updated Apr 10, 2026
Python

vksundararajan / cross-check

Star

𝘈 𝘔𝘶𝘭𝘵𝘪-𝘈𝘨𝘦𝘯𝘵 𝘚𝘺𝘴𝘵𝘦𝘮 𝘧𝘰𝘳 𝘊𝘳𝘰𝘴𝘴-𝘊𝘩𝘦𝘤𝘬𝘪𝘯𝘨 𝘗𝘩𝘪𝘴𝘩𝘪𝘯𝘨 𝘜𝘙𝘓𝘴.

dockerfile pytest cybersecurity adk mesop agent-development agent-evals adk-python agent-testing

Updated Dec 17, 2025
Python

kallemickelborg / nodetracer

Star

The node-level tracing library for agentic software.

agent evaluation developer-tools observability traceability multi-agent-systems evals agentic-workflow agentic-ai agent-evals agent-orchestration agent-observability

Updated Mar 9, 2026
Python

HomenShum / openai-agent-eval-framework

Star

Agent evaluation sketches for banking due diligence and research: classification, context verification, pruning, and test-driven workflows.

typescript research banking agent-evals llm-judge context-verification

Updated Sep 18, 2025
TypeScript

sztlink / turboquant-cuda-bench

Star

Long-context quality probes and KV-cache research on local GPUs: retrieval is not utilization.

benchmark cuda rag kv-cache field-notes long-context llama-cpp vllm retrieval-augmented-generation qwen llm-evaluation rag-evaluation agent-evals turboquant

Updated May 22, 2026
JavaScript

2830500285 / omni-agent

Star

Verification-native local coding agent runtime with eval gates, memory, subagents, and model profiles.

Updated May 16, 2026
TypeScript

KIM3310 / ai-agent-production-lab

Star

Credential-free agent production lab with deterministic planning fixtures, traces, cost accounting, eval assertions, and HTML reports.

python reliability observability cost-tracking agent-evals

Updated May 16, 2026
Python

LeoStehlik / loopsmith

Star

Eval and promotion harness for AI agents: compare baseline vs candidate behaviour and promote only changes that survive evidence.

developer-tools ai-agents evals prompt-optimization agent-evals agent-reliability

Updated May 20, 2026
Python

Jasvina / AgentReliabilityKit

Star

Replay, regression, trace packaging, failure analysis, and dataset slicing for LLM agents

regression-testing benchmark-datasets trace-analysis llm-agents ai-infrastructure agent-evals

Updated May 3, 2026
Python

Improve this page

Add a description, image, and links to the agent-evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-evals topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-evals

Here are 33 public repositories matching this topic...

darkrishabh / agent-skills-eval

HumphreySun98 / repoagentbench

The-Swarm-Corporation / StatisticalModelEvaluator

MrTsepa / autoevolve

iMeanAI / open-source-operator

Agent-Pattern-Labs / iso

shubchat / loab

s1liconcow / repogauge

bigkan8 / legal-action-boundary-eval

aak204 / llm-coordination-harness

8Dionysus / aoa-evals

mverab / Reposcale

vksundararajan / cross-check

kallemickelborg / nodetracer

HomenShum / openai-agent-eval-framework

sztlink / turboquant-cuda-bench

2830500285 / omni-agent

KIM3310 / ai-agent-production-lab

LeoStehlik / loopsmith

Jasvina / AgentReliabilityKit

Improve this page

Add this topic to your repo