Skip to content

Docs/strengthen design doc#26

Open
sharonxz wants to merge 2 commits into
mainfrom
docs/strengthen-design-doc
Open

Docs/strengthen design doc#26
sharonxz wants to merge 2 commits into
mainfrom
docs/strengthen-design-doc

Conversation

@sharonxz

Copy link
Copy Markdown
Contributor

No description provided.

sharonxz and others added 2 commits June 23, 2026 22:18
Rewrites DESIGN_DOC.md to focus on the core measurement error
(applying pass@k to within-submission test counts violates the i.i.d.
assumption of the Chen et al. 2021 estimator). Adds:

- Formal i.i.d. assumption statement and proof-by-counterexample in the
  intro (H1 moved up to motivate the entire document)
- "What This Is (and Is Not)" subsection framing reliability@k as the
  correct operationalization, not a new formula
- Related Work section covering Chen et al., SWE-bench, EvalPlus,
  LiveCodeBench, and Wang et al. 2023 (self-consistency)
- Simulation Validity section explaining synthetic data design, what the
  simulation validates, and its limitations
- Explicit proxy formula (pass_rate × (1−regression_rate) × tool_eff)
  with ρ ≥ 0.70 threshold and fallback strategy
- Leaderboard flip analysis subsection under Experiment 2 showing how
  agent rankings can invert between current and correct metric

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…ion (issue #5)

Adds H4 (security-adjusted reliability@k decorrelated from functional
reliability@k) as a second primary contribution alongside the pass@k
operationalization error (H1–H3). Changes:

- Vision statement reframed around two independent benchmark failures:
  measurement error and security blindspot
- New "Security Blindspot" subsection: CWE-89/78/502/259 context,
  Veracode 2025 (2.74× vulnerability rate), OWASP Top 10 framing
- H4 hypothesis with Kendall τ < 0.6 falsifiability condition and
  security threshold rationale
- Experiment 4 protocol: security_adjusted_reliability@k, dual
  leaderboard flip table, heuristic scanner caveats
- Metrics Reference: security_score and security_adjusted_reliability@k
  implementations documented
- Related Work: added CyberSecEval, SecurityEval, Veracode 2025;
  updated gap statement to joint metric novelty
- Simulation Validity: explicit note that H4 requires real agent code
- Why This Matters: enterprise deployment and Anote product angle
- Implementation Plan: run_security_exp.py, fig5 scoped to Phase 2
- Timeline: updated to ICSE 2027 primary target (~Aug 2026)
- Open Questions: H4 risks + security scanner validation risk
- Related Issues: #5 and #10 explicitly closed out

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant