Docs/strengthen design doc by sharonxz · Pull Request #26 · anote-ai/Research-CodeBench

sharonxz · 2026-06-24T15:59:31Z

No description provided.

Rewrites DESIGN_DOC.md to focus on the core measurement error (applying pass@k to within-submission test counts violates the i.i.d. assumption of the Chen et al. 2021 estimator). Adds: - Formal i.i.d. assumption statement and proof-by-counterexample in the intro (H1 moved up to motivate the entire document) - "What This Is (and Is Not)" subsection framing reliability@k as the correct operationalization, not a new formula - Related Work section covering Chen et al., SWE-bench, EvalPlus, LiveCodeBench, and Wang et al. 2023 (self-consistency) - Simulation Validity section explaining synthetic data design, what the simulation validates, and its limitations - Explicit proxy formula (pass_rate × (1−regression_rate) × tool_eff) with ρ ≥ 0.70 threshold and fallback strategy - Leaderboard flip analysis subsection under Experiment 2 showing how agent rankings can invert between current and correct metric Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…ion (issue #5) Adds H4 (security-adjusted reliability@k decorrelated from functional reliability@k) as a second primary contribution alongside the pass@k operationalization error (H1–H3). Changes: - Vision statement reframed around two independent benchmark failures: measurement error and security blindspot - New "Security Blindspot" subsection: CWE-89/78/502/259 context, Veracode 2025 (2.74× vulnerability rate), OWASP Top 10 framing - H4 hypothesis with Kendall τ < 0.6 falsifiability condition and security threshold rationale - Experiment 4 protocol: security_adjusted_reliability@k, dual leaderboard flip table, heuristic scanner caveats - Metrics Reference: security_score and security_adjusted_reliability@k implementations documented - Related Work: added CyberSecEval, SecurityEval, Veracode 2025; updated gap statement to joint metric novelty - Simulation Validity: explicit note that H4 requires real agent code - Why This Matters: enterprise deployment and Anote product angle - Implementation Plan: run_security_exp.py, fig5 scoped to Phase 2 - Timeline: updated to ICSE 2027 primary target (~Aug 2026) - Open Questions: H4 risks + security scanner validation risk - Related Issues: #5 and #10 explicitly closed out Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

sharonxz and others added 2 commits June 23, 2026 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Docs/strengthen design doc#26

Docs/strengthen design doc#26
sharonxz wants to merge 2 commits into
mainfrom
docs/strengthen-design-doc

sharonxz commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sharonxz commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant