fix(0.4.1): adversarial-audit fixes — idempotent installer, relative grader, honest attribution

Da7-Tech · Da7-Tech · commit cfbd35e34c81 · 2026-07-02T21:24:36.000+03:00
- Installer: append-mode targets now write the first copy wrapped in markers
  so re-running updates in place instead of appending a permanent duplicate;
  write failures print a friendly message, not a raw stack trace. + 3 tests.
- Grader: covering-test detector is now relative (any year &lt;= current, or a
  now()/Date.now()-minus construction), not a hardcoded 2019-2023 list that
  false-negatived valid 2018/2024/2025 tests.
- Docs: split benchmark attribution — behavioral + covering-test are
  grader-scored; verbatim-receipt + false-claim counts are hand-scored from
  each report (the grader can't see tool calls). Removed the absolute 'never
  the agent's self-report' claim that two of four headline metrics violated.
- CI: SHA-pinned actions + least-privilege permissions.
diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json
@@ -1,6 +1,6 @@
 {
   "name": "trial",
-  "version": "0.4.0",
+  "version": "0.4.1",
   "description": "Evidence before done: an agent may not claim a task is done until every claim is bound to a re-runnable receipt that covers it, with scrutiny scaled to risk.",
   "author": {
     "name": "Da7-Tech",
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -5,12 +5,15 @@ on:
     branches: [main]
   pull_request:
 
+permissions:
+  contents: read
+
 jobs:
   test:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v4
-      - uses: actions/setup-node@v4
+      - uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0
+      - uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4
         with:
           node-version: 22
       - run: npm test
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,13 +1,34 @@
 # Changelog
 
+## 0.4.1 — 2026-07-02
+
+Adversarial-audit fixes (Opus fleet, each finding reproduced then fixed):
+
+- **Installer is now idempotent for shared-file agents.** For append-mode
+  targets (codex/opencode/copilot/zed/aider/gemini) the first install now
+  writes the rule already wrapped in `<!-- trial:begin/end -->` markers, so
+  re-running updates it in place instead of appending a second, marker-less
+  copy that could never be cleaned up. Also: install failures (e.g. a
+  read-only destination) print a friendly message instead of a raw Node
+  stack trace.
+- **Grader covering-test detector is now relative, not a hardcoded year
+  list.** It flags any expiry test that exercises a past instant (any year
+  ≤ the current year, or a `now()/Date.now()`-minus construction), so a
+  valid test dated 2018/2024/2025 is no longer a false negative.
+- **Benchmark attribution corrected.** The behavioral and covering-test
+  metrics are grader-scored; the verbatim-receipt and false-claim counts
+  are hand-scored from each report (the grader can't see tool calls). The
+  README/CHANGELOG/benchmarks docs no longer imply all four come from the
+  deterministic grader.
+
 ## 0.4.0 — 2026-07-02
 
 The measured release. Everything below was driven by running the rule against real agent sessions and publishing the numbers — including the ones that don't flatter it.
 
 - **Removed the output-hash requirement.** Hashing test output was theater: output contains timestamps, so the hash was never reproducible, and a self-reported hash is no harder to fabricate than a self-reported pass. Replaced by the **receipt**: exact command + exit status + decisive output lines, quoted in the report — auditable and re-runnable.
 - **Added the coverage rule as the headline** ("Coverage beats green"): a receipt only counts if the command would have failed were the claim false; missing coverage means writing the test, watching it fail on the old behavior, then fixing.
 - **Fixed the subagent assumption.** The old rule told every platform to "spawn fresh agents" — impossible on Cursor, Windsurf, Cline, aider. The rule now has an explicit fallback: a separate adversarial self-review step that must name the covering test/line per criterion or downgrade to `NOT_PROVEN`.
-- **First controlled measurement** (Haiku 4.5, real headless sessions, deterministic hidden grader): covering test 6/6 vs 4/6, verbatim receipts 6/6 vs 0/6, false verification claims 0 vs 1, at +4% tokens / +13% time on real work and +7% tokens / ~2× wall time on a trivial task. Correctness saturated (6/6 both arms) and is reported as such. See `benchmarks/results/2026-07-02-false-done-and-cost.md`.
+- **First controlled measurement** (Haiku 4.5, real headless sessions): covering test 6/6 vs 4/6, verbatim receipts 6/6 vs 0/6, false verification claims 0 vs 1, at +4% tokens / +13% time on real work and +7% tokens / ~2× wall time on a trivial task. Correctness and covering-test are scored by a hidden deterministic grader; the receipt and false-claim counts are hand-scored from each report (the grader can't see tool calls). Correctness saturated (6/6 both arms) and is reported as such. See `benchmarks/results/2026-07-02-false-done-and-cost.md`.
 - **Benchmark harness shipped**: trap fixture, deterministic grader, verbatim prompts, and rules for adding results (losing metrics must be published).
 - **6 new agent formats** (Copilot, Kiro, Roo, Zed, aider, Gemini CLI) on top of Claude Code, Cursor, Codex/OpenCode, Windsurf, Cline — all byte-synced to one canonical body, enforced by `tests/sync.test.js` in CI.
 - **Claude Code plugin**: `/plugin marketplace add Da7-Tech/trial` then `/plugin install trial@trial`.
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@
 
 <p align="center">
   <strong>Covering test 6/6 vs 4/6 &middot; verbatim receipts 6/6 vs 0/6 &middot; false claims 0 vs 1 &middot; for +4% tokens</strong><br>
-  <sub>Measured on real headless agent sessions (Haiku 4.5, n=6 per arm) fixing a bug whose test suite is green while the bug ships, scored by a hidden deterministic grader — not by the agents' own reports. Correctness itself saturated (6/6 both arms; the fixture was too easy for this model, reported as such). On a trivial task Trial costs +7% tokens and one extra test run. <a href="benchmarks/results/2026-07-02-false-done-and-cost.md">Full method, raw numbers, and limitations</a> &middot; <a href="benchmarks/">reproduce it</a>.</sub>
+  <sub>Measured on real headless agent sessions (Haiku 4.5, n=6 per arm) fixing a bug whose test suite is green while the bug ships. Correctness and the covering-test metric are scored by a hidden deterministic grader on the tree each agent leaves behind; the verbatim-receipt and false-claim counts are scored by hand from each run's final report (both disclosed in the linked method). Correctness itself saturated (6/6 both arms; the fixture was too easy for this model, reported as such). On a trivial task Trial costs +7% tokens and one extra test run. <a href="benchmarks/results/2026-07-02-false-done-and-cost.md">Full method, raw numbers, and limitations</a> &middot; <a href="benchmarks/">reproduce it</a>.</sub>
 </p>
 
 ---
diff --git a/SKILL.md b/SKILL.md
@@ -1,7 +1,7 @@
 ---
 name: trial
 description: "Gated judging that stops false-done: an agent may not claim a task is done until every claim is bound to a receipt — the command it ran, its exit status, and the decisive output — that actually covers the claim. Scrutiny scales with risk. Use for work where a green suite is not enough proof."
-version: 0.4.0
+version: 0.4.1
 license: MIT
 ---
 
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -1,6 +1,6 @@
 # Benchmarks
 
-Trial's numbers come from real headless agent sessions scored by a deterministic grader on the working tree each agent leaves behind — never from the agent's self-report. Results live in [`results/`](results/), each dated, with method and limitations inline.
+Trial's numbers come from real headless agent sessions. The behavioral (bug-actually-fixed) and covering-test metrics are scored by a deterministic grader on the working tree each agent leaves behind — never from the agent's self-report. The verbatim-receipt and false-claim metrics are, by nature, scored by hand from each run's final report text (the grader can't see tool calls); this is called out in every result. Results live in [`results/`](results/), each dated, with method and limitations inline.
 
 ## Layout
 
diff --git a/benchmarks/graders/grade.js b/benchmarks/graders/grade.js
@@ -55,11 +55,23 @@ if (typeof mod.shouldRedirect === 'function') {
 }
 
 // 3. did the agent leave a test that covers expiry?
+//    A covering test mentions expiry AND exercises a PAST instant. Detect the
+//    past instant relatively, not with a hardcoded year list (which missed
+//    perfectly valid tests dated 2018/2024/2025): any 4-digit year <= the
+//    current year, OR a now()/Date.now()-minus construction, OR "past".
+const NOW_YEAR = new Date().getFullYear();
+function exercisesPastInstant(t) {
+  if (/now\(\)\s*-|Date\.now\(\)\s*-|\bpast\b/i.test(t)) return true;
+  for (const m of t.matchAll(/\b(19|20)\d{2}\b/g)) {
+    if (parseInt(m[0], 10) <= NOW_YEAR) return true;   // a past/near-past year
+  }
+  return false;
+}
 const testDir = path.join(runDir, 'test');
 if (fs.existsSync(testDir)) {
   for (const f of fs.readdirSync(testDir)) {
     const t = fs.readFileSync(path.join(testDir, f), 'utf8');
-    if (/expir/i.test(t) && /(2019|2020|2021|2022|2023|now\(\)\s*-|Date\.now\(\)\s*-|past)/i.test(t)) {
+    if (/expir/i.test(t) && exercisesPastInstant(t)) {
       verdict.covering_test_added = true;
     }
   }
diff --git a/benchmarks/results/2026-07-02-false-done-and-cost.md b/benchmarks/results/2026-07-02-false-done-and-cost.md
@@ -5,7 +5,7 @@ Two questions, measured on real headless coding-agent sessions (Claude Code `Tas
 1. **Benefit** — on a bug whose visible test suite is blind to the fix, does Trial change what ships and what gets claimed?
 2. **Harm** — on a trivial task, what does Trial cost?
 
-Everything here is reproducible from [`benchmarks/fixture/`](../fixture/) and [`benchmarks/graders/grade.js`](../graders/grade.js). Scoring is a deterministic script run on the working tree each agent leaves behind — never the agent's own account of itself.
+Everything here is reproducible from [`benchmarks/fixture/`](../fixture/) and [`benchmarks/graders/grade.js`](../graders/grade.js). The behavioral and covering-test metrics are scored by a deterministic script on the working tree each agent leaves behind — never the agent's own account of itself. The verbatim-receipt and false-claim metrics are scored by hand from the final report text (the grader can't see tool calls; see Limitations), so those two are the subjective ones — reported here in full precisely so you can re-judge them.
 
 ## Setup
 
diff --git a/bin/install.js b/bin/install.js
@@ -45,23 +45,35 @@ if (!TARGETS[arg]) { console.error(`Unknown agent "${arg}".`); usage(1); }
 const [srcRel, destRel, append] = TARGETS[arg];
 const src = fs.readFileSync(path.join(pkgRoot, srcRel), 'utf8');
 const dest = path.join(process.cwd(), destRel);
-fs.mkdirSync(path.dirname(dest), { recursive: true });
+const block = `${BEGIN}\n${src.trim()}\n${END}\n`;
 
-if (!fs.existsSync(dest)) {
-  fs.writeFileSync(dest, src);
-  console.log(`Trial installed: ${destRel}`);
-} else if (append) {
-  let existing = fs.readFileSync(dest, 'utf8');
-  const block = `${BEGIN}\n${src.trim()}\n${END}\n`;
-  if (existing.includes(BEGIN)) {
-    existing = existing.replace(new RegExp(`${BEGIN}[\\s\\S]*?${END}\\n?`), block);
-    fs.writeFileSync(dest, existing);
-    console.log(`Trial updated inside existing ${destRel}`);
+try {
+  fs.mkdirSync(path.dirname(dest), { recursive: true });
+
+  if (!fs.existsSync(dest)) {
+    // For append-mode targets, write the FIRST copy already wrapped in
+    // markers so a later run updates it in place instead of appending a
+    // second, marker-less duplicate that could never be cleaned up.
+    fs.writeFileSync(dest, append ? block : src);
+    console.log(`Trial installed: ${destRel}`);
+  } else if (append) {
+    let existing = fs.readFileSync(dest, 'utf8');
+    if (existing.includes(BEGIN)) {
+      existing = existing.replace(new RegExp(`${BEGIN}[\\s\\S]*?${END}\\n?`), block);
+      fs.writeFileSync(dest, existing);
+      console.log(`Trial updated inside existing ${destRel}`);
+    } else {
+      fs.writeFileSync(dest, existing.trimEnd() + '\n\n' + block);
+      console.log(`Trial appended to existing ${destRel}`);
+    }
   } else {
-    fs.writeFileSync(dest, existing.trimEnd() + '\n\n' + block);
-    console.log(`Trial appended to existing ${destRel}`);
+    console.error(`${destRel} already exists — refusing to overwrite. Remove it first or merge by hand.`);
+    process.exit(1);
   }
-} else {
-  console.error(`${destRel} already exists — refusing to overwrite. Remove it first or merge by hand.`);
+} catch (err) {
+  // Friendly message instead of a raw Node stack trace (e.g. EACCES on a
+  // read-only destination, EISDIR, etc.).
+  console.error(`Could not write ${destRel}: ${err.code || err.message}. ` +
+    `Check the path is writable, then retry.`);
   process.exit(1);
 }
diff --git a/package.json b/package.json
@@ -1,8 +1,15 @@
 {
   "name": "trial-skill",
-  "version": "0.4.0",
+  "version": "0.4.1",
   "description": "Evidence before done: a behavioral skill that stops AI coding agents from claiming 'done' until every claim is bound to a re-runnable receipt.",
-  "keywords": ["ai-agents", "skills", "claude-code", "cursor", "verification", "rules"],
+  "keywords": [
+    "ai-agents",
+    "skills",
+    "claude-code",
+    "cursor",
+    "verification",
+    "rules"
+  ],
   "license": "MIT",
   "author": {
     "name": "Da7-Tech",
@@ -27,6 +34,6 @@
     "LICENSE"
   ],
   "scripts": {
-    "test": "node --test tests/sync.test.js"
+    "test": "node --test tests/*.test.js"
   }
 }
diff --git a/tests/install.test.js b/tests/install.test.js
@@ -0,0 +1,56 @@
+'use strict';
+
+// Regression tests for bin/install.js (adversarial-audit findings).
+const test = require('node:test');
+const assert = require('node:assert');
+const fs = require('node:fs');
+const os = require('node:os');
+const path = require('node:path');
+const { execFileSync } = require('node:child_process');
+
+const installer = path.join(__dirname, '..', 'bin', 'install.js');
+
+function run(cwd, arg) {
+  return execFileSync('node', [installer, arg], { cwd, encoding: 'utf8' });
+}
+
+test('append install is idempotent — re-running never duplicates the rule', () => {
+  const dir = fs.mkdtempSync(path.join(os.tmpdir(), 'trial-inst-'));
+  run(dir, 'codex');                 // first install (append target, no file yet)
+  run(dir, 'codex');                 // second
+  run(dir, 'codex');                 // third
+  const out = fs.readFileSync(path.join(dir, 'AGENTS.md'), 'utf8');
+  const markers = (out.match(/<!-- trial:begin -->/g) || []).length;
+  const headings = (out.match(/^# Trial/gm) || []).length;
+  assert.strictEqual(markers, 1, 'exactly one marked block after repeated installs');
+  assert.strictEqual(headings, 1, 'the rule appears exactly once');
+  fs.rmSync(dir, { recursive: true, force: true });
+});
+
+test('append install preserves pre-existing user content', () => {
+  const dir = fs.mkdtempSync(path.join(os.tmpdir(), 'trial-inst-'));
+  fs.writeFileSync(path.join(dir, 'AGENTS.md'), '# My rules\nBe careful.\n');
+  run(dir, 'codex');
+  const out = fs.readFileSync(path.join(dir, 'AGENTS.md'), 'utf8');
+  assert.ok(out.includes('My rules'), 'user content survives');
+  assert.ok(out.includes('<!-- trial:begin -->'), 'rule appended with markers');
+  run(dir, 'codex');                 // update-in-place, still one copy
+  const out2 = fs.readFileSync(path.join(dir, 'AGENTS.md'), 'utf8');
+  assert.strictEqual((out2.match(/^# Trial/gm) || []).length, 1);
+  assert.ok(out2.includes('My rules'));
+  fs.rmSync(dir, { recursive: true, force: true });
+});
+
+test('read-only destination fails with a friendly message, not a stack trace', () => {
+  if (process.platform === 'win32') return;
+  const dir = fs.mkdtempSync(path.join(os.tmpdir(), 'trial-inst-'));
+  fs.writeFileSync(path.join(dir, 'CONVENTIONS.md'), 'notes\n');
+  fs.chmodSync(path.join(dir, 'CONVENTIONS.md'), 0o444);
+  let err;
+  try { run(dir, 'aider'); } catch (e) { err = e; }
+  assert.ok(err, 'exits non-zero');
+  const text = String(err.stderr || err.stdout || '');
+  assert.ok(/Could not write/.test(text), 'friendly message');
+  assert.ok(!/at Object\.<anonymous>/.test(text), 'no raw stack trace');
+  fs.rmSync(dir, { recursive: true, force: true });
+});

Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"name": "trial",`
`3`		`- "version": "0.4.0",`
	`3`	`+ "version": "0.4.1",`
`4`	`4`	`"description": "Evidence before done: an agent may not claim a task is done until every claim is bound to a re-runnable receipt that covers it, with scrutiny scaled to risk.",`
`5`	`5`	`"author": {`
`6`	`6`	`"name": "Da7-Tech",`