Skip to content

Experiment/reliability at k#25

Open
sharonxz wants to merge 2 commits into
mainfrom
experiment/reliability-at-k
Open

Experiment/reliability at k#25
sharonxz wants to merge 2 commits into
mainfrom
experiment/reliability-at-k

Conversation

@sharonxz

Copy link
Copy Markdown
Contributor

No description provided.

sharonxz and others added 2 commits June 20, 2026 15:41
- Add make_rollout_benchmark() to data.py: generates n independent
  rollouts per (task, agent) with per-rollout variance around a true
  skill level; sets execution_success via pr >= 0.95 threshold
- Add reliability_at_k() to evaluate.py: correctly groups results by
  (task_id, agent_name), uses rollout counts as n/c in pass_at_k
- Add single_rollout_proxy() to evaluate.py: cheap H3 proxy combining
  pass_rate, regression penalty, and tool efficiency
- Add regression tests proving H1 category error and reliability@k behavior
- Add scripts/run_experiments.py running all four experiments with results
- Add scipy and matplotlib to dependencies

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Generates 4 matplotlib figures saved to figures/:
- fig1_baseline.png: Pass@1 vs current Pass@5 artifact
- fig2_h1_proof.png: H1 category-error proof (same reliability, different scores)
- fig3_h2_comparison.png: current pass@5 vs reliability@5 magnitude gap
- fig4_h3_correlation.png: single-rollout proxy vs reliability@5 scatter

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant