Docs/strengthen design doc#26
Open
sharonxz wants to merge 2 commits into
Open
Conversation
Rewrites DESIGN_DOC.md to focus on the core measurement error (applying pass@k to within-submission test counts violates the i.i.d. assumption of the Chen et al. 2021 estimator). Adds: - Formal i.i.d. assumption statement and proof-by-counterexample in the intro (H1 moved up to motivate the entire document) - "What This Is (and Is Not)" subsection framing reliability@k as the correct operationalization, not a new formula - Related Work section covering Chen et al., SWE-bench, EvalPlus, LiveCodeBench, and Wang et al. 2023 (self-consistency) - Simulation Validity section explaining synthetic data design, what the simulation validates, and its limitations - Explicit proxy formula (pass_rate × (1−regression_rate) × tool_eff) with ρ ≥ 0.70 threshold and fallback strategy - Leaderboard flip analysis subsection under Experiment 2 showing how agent rankings can invert between current and correct metric Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…ion (issue #5) Adds H4 (security-adjusted reliability@k decorrelated from functional reliability@k) as a second primary contribution alongside the pass@k operationalization error (H1–H3). Changes: - Vision statement reframed around two independent benchmark failures: measurement error and security blindspot - New "Security Blindspot" subsection: CWE-89/78/502/259 context, Veracode 2025 (2.74× vulnerability rate), OWASP Top 10 framing - H4 hypothesis with Kendall τ < 0.6 falsifiability condition and security threshold rationale - Experiment 4 protocol: security_adjusted_reliability@k, dual leaderboard flip table, heuristic scanner caveats - Metrics Reference: security_score and security_adjusted_reliability@k implementations documented - Related Work: added CyberSecEval, SecurityEval, Veracode 2025; updated gap statement to joint metric novelty - Simulation Validity: explicit note that H4 requires real agent code - Why This Matters: enterprise deployment and Anote product angle - Implementation Plan: run_security_exp.py, fig5 scoped to Phase 2 - Timeline: updated to ICSE 2027 primary target (~Aug 2026) - Open Questions: H4 risks + security scanner validation risk - Related Issues: #5 and #10 explicitly closed out Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.