agent-eval

Star

Here are 18 public repositories matching this topic...

zozo123 / meta-harness-on-islo

Sponsor

Star

Meta-harness optimization loop wired onto Islo sandboxes. POC: 0/5→5/5 in four proposer steps. Built on islo.dev.

harbor llm-agents agent-eval meta-harness islo harness-optimization

Updated May 5, 2026
HTML

zozo123 / meta-harness-on-islo-page

Sponsor

Star

Project page for Meta-harness on Islo (POC). https://zozo123.github.io/meta-harness-on-islo-page/

project-page agent-eval meta-harness islo

Updated May 5, 2026
JavaScript

0-co / company

Star

AI-operated company. Building agent-friend: universal tool adapter for AI agents. @tool → OpenAI, Claude, Gemini, MCP. Live 24/7 on Twitch.

python twitch structured-logging interactive-cli exponential-backoff human-in-the-loop zero-dependencies open-startup ai-agent autonomous-ai building-in-public llm-tools agent-security mcp-security personal-ai-agent agent-eval agent-friend

Updated Mar 26, 2026
Python

arthursoares / openclaw-llm-bench

Star

A reasoning benchmark runner for comparing LLMs as OpenClaw agents use them. 52 prompts, 3 eval sets, 11 traps, LLM-as-judge, tier-based leaderboard.

gpt reasoning claude llm-eval ollama llm-as-judge llm-benchmark openclaw agent-eval

Updated Apr 11, 2026
Python

gojiplus / understudy

Star

Scenario Testing for AI Agents

simulation evaluation agentic agent-evaluation google-adk agent-eval

Updated Jun 4, 2026
Python

linny006 / agent-eval-harness

Star

Live, open-source benchmark for comparing AI coding agents on real GitHub issues

Updated Jun 8, 2026
Python

matt-rachlin / agent-eval-harness

Star

Online evals on live agent traces. Open-source, self-hostable, OpenTelemetry-native eval harness with regression detection.

python ai agents observability otel fastapi opentelemetry llm llmops llm-eval evals agent-eval

Updated May 24, 2026
Python

ttxs69 / awesome-coding-agent-eval

Star

A curated list of benchmarks, harnesses, leaderboards, and tools for evaluating AI coding agents.

benchmark leaderboard evaluation awesome-list codex ai-agent llm aider claude-code coding-agent swe-bench agent-eval ai-coding-agent-benchmark coding-agent-benchmark

Updated Jun 8, 2026

rogerchappel / ledgerpet

Star

Local-first synthetic finance anomaly trainer for agent evals.

cli synthetic-data local-first agent-eval finance-ops

Updated Jun 8, 2026
JavaScript

fitchmultz / agent-eval

Star

Transcript-first evaluation tool for comparing coding-agent sessions across Codex, Claude Code, and Pi.

typescript evaluation pi transcripts codex coding-agents claude-code agent-eval

Updated May 29, 2026
TypeScript

mizcausevic-dev / agent-eval-arena

Star

Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions, cost-quality leaderboards, and CI gates for model promotion.

express typescript platform-engineering regression-detection ml-ops ai-platform ai-governance llm-eval agent-eval ci-gate

Updated Jun 1, 2026
TypeScript

Viprasol-Tech / agentcheck

Star

Regression testing for AI agents — snapshot tool-calls, diff in CI, fail on regressions. A GitHub Action. By Viprasol Tech.

testing typescript ci snapshot-testing regression-testing ai-agents github-action llm llmops agent-eval

Updated Jun 7, 2026
TypeScript

pingwest-ai / agent-eval

Star

开源通用 AI Agent 真实任务评测 · 同 Prompt、客观开奖、评分细则全公开 | Open-source evaluation of general-purpose AI Agents on real-world tasks with verifiable outcomes — by PingWest / 硅星人

benchmark evaluation ai-agents llm llm-evaluation deep-research agent-eval

Updated Jun 6, 2026

hermes-labs-ai / agent-convergence-scorer

Star

agent-convergence-scorer is a CLI and Python library that scores how lexically similar N agent or LLM outputs are: exact-match rate, Jaccard token overlap, divergence point, and a composite convergence score over any list of runs. An eval primitive for measuring reproducibility and fan-out collapse. Lexical, not semantic. Zero deps.

cli benchmark consistency evaluation similarity multi-agent convergence reproducibility agents jaccard divergence llm llm-evaluation ai-reliability eval-harness agent-eval

Updated Jun 7, 2026
Python

stevenchouai / agent-scorecard

Star

Trace-first evaluation harness for deciding whether AI agents deserve more tokens, permissions, and trust

python evaluation roi ai-agents proof-chain agent-eval

Updated May 16, 2026
Python

zyy5114 / AgentEvalKit

Star

Lightweight CI-native regression and behavior-aware evaluation toolkit for black-box agent workflows.

python cli json-schema tooling regression-testing github-actions llm-evals agent-eval

Updated May 9, 2026
Python

jeremylongshore / j-rig-skill-binary-eval

Sponsor

Star

Binary-criteria evaluation harness for Claude skills with planned extension to plugins, agents, and MCP servers. Score every change yes/no across 7 layers — package integrity, trigger quality, functional quality, regression protection, baseline value, model variance, rollout safety. Never gradients.

mcp regression-testing skill-evaluation ai-evaluation llm-eval claude-code plugin-testing eval-harness agent-eval binary-criteria

Updated Jun 8, 2026
TypeScript

jeremylongshore / intent-eval-lab

Sponsor

Star

Vendor-neutral research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes (Claude Code, Gemini CLI, Copilot CLI, Codex CLI).

mcp skill-discovery opentelemetry ai-evaluation gemini-cli claude-code plugin-testing cross-cli agent-eval invocation-rate

Updated Jun 8, 2026
Python

Improve this page

Add a description, image, and links to the agent-eval topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-eval topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-eval

Here are 18 public repositories matching this topic...

zozo123 / meta-harness-on-islo

zozo123 / meta-harness-on-islo-page

0-co / company

arthursoares / openclaw-llm-bench

gojiplus / understudy

linny006 / agent-eval-harness

matt-rachlin / agent-eval-harness

ttxs69 / awesome-coding-agent-eval

rogerchappel / ledgerpet

fitchmultz / agent-eval

mizcausevic-dev / agent-eval-arena

Viprasol-Tech / agentcheck

pingwest-ai / agent-eval

hermes-labs-ai / agent-convergence-scorer

stevenchouai / agent-scorecard

zyy5114 / AgentEvalKit

jeremylongshore / j-rig-skill-binary-eval

jeremylongshore / intent-eval-lab

Improve this page

Add this topic to your repo