Experiment/reliability at k by sharonxz · Pull Request #25 · anote-ai/Research-CodeBench

sharonxz · 2026-06-24T01:02:25Z

No description provided.

- Add make_rollout_benchmark() to data.py: generates n independent rollouts per (task, agent) with per-rollout variance around a true skill level; sets execution_success via pr >= 0.95 threshold - Add reliability_at_k() to evaluate.py: correctly groups results by (task_id, agent_name), uses rollout counts as n/c in pass_at_k - Add single_rollout_proxy() to evaluate.py: cheap H3 proxy combining pass_rate, regression penalty, and tool efficiency - Add regression tests proving H1 category error and reliability@k behavior - Add scripts/run_experiments.py running all four experiments with results - Add scipy and matplotlib to dependencies Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Generates 4 matplotlib figures saved to figures/: - fig1_baseline.png: Pass@1 vs current Pass@5 artifact - fig2_h1_proof.png: H1 category-error proof (same reliability, different scores) - fig3_h2_comparison.png: current pass@5 vs reliability@5 magnitude gap - fig4_h3_correlation.png: single-rollout proxy vs reliability@5 scatter Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

sharonxz and others added 2 commits June 20, 2026 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experiment/reliability at k#25

Experiment/reliability at k#25
sharonxz wants to merge 2 commits into
mainfrom
experiment/reliability-at-k

sharonxz commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sharonxz commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant