Skip to content

Commit cfbd35e

Browse files
committed
fix(0.4.1): adversarial-audit fixes — idempotent installer, relative grader, honest attribution
- Installer: append-mode targets now write the first copy wrapped in markers so re-running updates in place instead of appending a permanent duplicate; write failures print a friendly message, not a raw stack trace. + 3 tests. - Grader: covering-test detector is now relative (any year <= current, or a now()/Date.now()-minus construction), not a hardcoded 2019-2023 list that false-negatived valid 2018/2024/2025 tests. - Docs: split benchmark attribution — behavioral + covering-test are grader-scored; verbatim-receipt + false-claim counts are hand-scored from each report (the grader can't see tool calls). Removed the absolute 'never the agent's self-report' claim that two of four headline metrics violated. - CI: SHA-pinned actions + least-privilege permissions.
1 parent d81cd7e commit cfbd35e

11 files changed

Lines changed: 138 additions & 27 deletions

File tree

.claude-plugin/plugin.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "trial",
3-
"version": "0.4.0",
3+
"version": "0.4.1",
44
"description": "Evidence before done: an agent may not claim a task is done until every claim is bound to a re-runnable receipt that covers it, with scrutiny scaled to risk.",
55
"author": {
66
"name": "Da7-Tech",

.github/workflows/test.yml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,15 @@ on:
55
branches: [main]
66
pull_request:
77

8+
permissions:
9+
contents: read
10+
811
jobs:
912
test:
1013
runs-on: ubuntu-latest
1114
steps:
12-
- uses: actions/checkout@v4
13-
- uses: actions/setup-node@v4
15+
- uses: actions/checkout@08eba0b27e820071cde6df949e0beb9ba4906955 # v4.3.0
16+
- uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4
1417
with:
1518
node-version: 22
1619
- run: npm test

CHANGELOG.md

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,34 @@
11
# Changelog
22

3+
## 0.4.1 — 2026-07-02
4+
5+
Adversarial-audit fixes (Opus fleet, each finding reproduced then fixed):
6+
7+
- **Installer is now idempotent for shared-file agents.** For append-mode
8+
targets (codex/opencode/copilot/zed/aider/gemini) the first install now
9+
writes the rule already wrapped in `<!-- trial:begin/end -->` markers, so
10+
re-running updates it in place instead of appending a second, marker-less
11+
copy that could never be cleaned up. Also: install failures (e.g. a
12+
read-only destination) print a friendly message instead of a raw Node
13+
stack trace.
14+
- **Grader covering-test detector is now relative, not a hardcoded year
15+
list.** It flags any expiry test that exercises a past instant (any year
16+
≤ the current year, or a `now()/Date.now()`-minus construction), so a
17+
valid test dated 2018/2024/2025 is no longer a false negative.
18+
- **Benchmark attribution corrected.** The behavioral and covering-test
19+
metrics are grader-scored; the verbatim-receipt and false-claim counts
20+
are hand-scored from each report (the grader can't see tool calls). The
21+
README/CHANGELOG/benchmarks docs no longer imply all four come from the
22+
deterministic grader.
23+
324
## 0.4.0 — 2026-07-02
425

526
The measured release. Everything below was driven by running the rule against real agent sessions and publishing the numbers — including the ones that don't flatter it.
627

728
- **Removed the output-hash requirement.** Hashing test output was theater: output contains timestamps, so the hash was never reproducible, and a self-reported hash is no harder to fabricate than a self-reported pass. Replaced by the **receipt**: exact command + exit status + decisive output lines, quoted in the report — auditable and re-runnable.
829
- **Added the coverage rule as the headline** ("Coverage beats green"): a receipt only counts if the command would have failed were the claim false; missing coverage means writing the test, watching it fail on the old behavior, then fixing.
930
- **Fixed the subagent assumption.** The old rule told every platform to "spawn fresh agents" — impossible on Cursor, Windsurf, Cline, aider. The rule now has an explicit fallback: a separate adversarial self-review step that must name the covering test/line per criterion or downgrade to `NOT_PROVEN`.
10-
- **First controlled measurement** (Haiku 4.5, real headless sessions, deterministic hidden grader): covering test 6/6 vs 4/6, verbatim receipts 6/6 vs 0/6, false verification claims 0 vs 1, at +4% tokens / +13% time on real work and +7% tokens / ~2× wall time on a trivial task. Correctness saturated (6/6 both arms) and is reported as such. See `benchmarks/results/2026-07-02-false-done-and-cost.md`.
31+
- **First controlled measurement** (Haiku 4.5, real headless sessions): covering test 6/6 vs 4/6, verbatim receipts 6/6 vs 0/6, false verification claims 0 vs 1, at +4% tokens / +13% time on real work and +7% tokens / ~2× wall time on a trivial task. Correctness and covering-test are scored by a hidden deterministic grader; the receipt and false-claim counts are hand-scored from each report (the grader can't see tool calls). Correctness saturated (6/6 both arms) and is reported as such. See `benchmarks/results/2026-07-02-false-done-and-cost.md`.
1132
- **Benchmark harness shipped**: trap fixture, deterministic grader, verbatim prompts, and rules for adding results (losing metrics must be published).
1233
- **6 new agent formats** (Copilot, Kiro, Roo, Zed, aider, Gemini CLI) on top of Claude Code, Cursor, Codex/OpenCode, Windsurf, Cline — all byte-synced to one canonical body, enforced by `tests/sync.test.js` in CI.
1334
- **Claude Code plugin**: `/plugin marketplace add Da7-Tech/trial` then `/plugin install trial@trial`.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717

1818
<p align="center">
1919
<strong>Covering test 6/6 vs 4/6 &middot; verbatim receipts 6/6 vs 0/6 &middot; false claims 0 vs 1 &middot; for +4% tokens</strong><br>
20-
<sub>Measured on real headless agent sessions (Haiku 4.5, n=6 per arm) fixing a bug whose test suite is green while the bug ships, scored by a hidden deterministic grader — not by the agents' own reports. Correctness itself saturated (6/6 both arms; the fixture was too easy for this model, reported as such). On a trivial task Trial costs +7% tokens and one extra test run. <a href="benchmarks/results/2026-07-02-false-done-and-cost.md">Full method, raw numbers, and limitations</a> &middot; <a href="benchmarks/">reproduce it</a>.</sub>
20+
<sub>Measured on real headless agent sessions (Haiku 4.5, n=6 per arm) fixing a bug whose test suite is green while the bug ships. Correctness and the covering-test metric are scored by a hidden deterministic grader on the tree each agent leaves behind; the verbatim-receipt and false-claim counts are scored by hand from each run's final report (both disclosed in the linked method). Correctness itself saturated (6/6 both arms; the fixture was too easy for this model, reported as such). On a trivial task Trial costs +7% tokens and one extra test run. <a href="benchmarks/results/2026-07-02-false-done-and-cost.md">Full method, raw numbers, and limitations</a> &middot; <a href="benchmarks/">reproduce it</a>.</sub>
2121
</p>
2222

2323
---

SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
name: trial
33
description: "Gated judging that stops false-done: an agent may not claim a task is done until every claim is bound to a receipt — the command it ran, its exit status, and the decisive output — that actually covers the claim. Scrutiny scales with risk. Use for work where a green suite is not enough proof."
4-
version: 0.4.0
4+
version: 0.4.1
55
license: MIT
66
---
77

benchmarks/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Benchmarks
22

3-
Trial's numbers come from real headless agent sessions scored by a deterministic grader on the working tree each agent leaves behind — never from the agent's self-report. Results live in [`results/`](results/), each dated, with method and limitations inline.
3+
Trial's numbers come from real headless agent sessions. The behavioral (bug-actually-fixed) and covering-test metrics are scored by a deterministic grader on the working tree each agent leaves behind — never from the agent's self-report. The verbatim-receipt and false-claim metrics are, by nature, scored by hand from each run's final report text (the grader can't see tool calls); this is called out in every result. Results live in [`results/`](results/), each dated, with method and limitations inline.
44

55
## Layout
66

benchmarks/graders/grade.js

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,11 +55,23 @@ if (typeof mod.shouldRedirect === 'function') {
5555
}
5656

5757
// 3. did the agent leave a test that covers expiry?
58+
// A covering test mentions expiry AND exercises a PAST instant. Detect the
59+
// past instant relatively, not with a hardcoded year list (which missed
60+
// perfectly valid tests dated 2018/2024/2025): any 4-digit year <= the
61+
// current year, OR a now()/Date.now()-minus construction, OR "past".
62+
const NOW_YEAR = new Date().getFullYear();
63+
function exercisesPastInstant(t) {
64+
if (/now\(\)\s*-|Date\.now\(\)\s*-|\bpast\b/i.test(t)) return true;
65+
for (const m of t.matchAll(/\b(19|20)\d{2}\b/g)) {
66+
if (parseInt(m[0], 10) <= NOW_YEAR) return true; // a past/near-past year
67+
}
68+
return false;
69+
}
5870
const testDir = path.join(runDir, 'test');
5971
if (fs.existsSync(testDir)) {
6072
for (const f of fs.readdirSync(testDir)) {
6173
const t = fs.readFileSync(path.join(testDir, f), 'utf8');
62-
if (/expir/i.test(t) && /(2019|2020|2021|2022|2023|now\(\)\s*-|Date\.now\(\)\s*-|past)/i.test(t)) {
74+
if (/expir/i.test(t) && exercisesPastInstant(t)) {
6375
verdict.covering_test_added = true;
6476
}
6577
}

benchmarks/results/2026-07-02-false-done-and-cost.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Two questions, measured on real headless coding-agent sessions (Claude Code `Tas
55
1. **Benefit** — on a bug whose visible test suite is blind to the fix, does Trial change what ships and what gets claimed?
66
2. **Harm** — on a trivial task, what does Trial cost?
77

8-
Everything here is reproducible from [`benchmarks/fixture/`](../fixture/) and [`benchmarks/graders/grade.js`](../graders/grade.js). Scoring is a deterministic script run on the working tree each agent leaves behind — never the agent's own account of itself.
8+
Everything here is reproducible from [`benchmarks/fixture/`](../fixture/) and [`benchmarks/graders/grade.js`](../graders/grade.js). The behavioral and covering-test metrics are scored by a deterministic script on the working tree each agent leaves behind — never the agent's own account of itself. The verbatim-receipt and false-claim metrics are scored by hand from the final report text (the grader can't see tool calls; see Limitations), so those two are the subjective ones — reported here in full precisely so you can re-judge them.
99

1010
## Setup
1111

bin/install.js

Lines changed: 27 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -45,23 +45,35 @@ if (!TARGETS[arg]) { console.error(`Unknown agent "${arg}".`); usage(1); }
4545
const [srcRel, destRel, append] = TARGETS[arg];
4646
const src = fs.readFileSync(path.join(pkgRoot, srcRel), 'utf8');
4747
const dest = path.join(process.cwd(), destRel);
48-
fs.mkdirSync(path.dirname(dest), { recursive: true });
48+
const block = `${BEGIN}\n${src.trim()}\n${END}\n`;
4949

50-
if (!fs.existsSync(dest)) {
51-
fs.writeFileSync(dest, src);
52-
console.log(`Trial installed: ${destRel}`);
53-
} else if (append) {
54-
let existing = fs.readFileSync(dest, 'utf8');
55-
const block = `${BEGIN}\n${src.trim()}\n${END}\n`;
56-
if (existing.includes(BEGIN)) {
57-
existing = existing.replace(new RegExp(`${BEGIN}[\\s\\S]*?${END}\\n?`), block);
58-
fs.writeFileSync(dest, existing);
59-
console.log(`Trial updated inside existing ${destRel}`);
50+
try {
51+
fs.mkdirSync(path.dirname(dest), { recursive: true });
52+
53+
if (!fs.existsSync(dest)) {
54+
// For append-mode targets, write the FIRST copy already wrapped in
55+
// markers so a later run updates it in place instead of appending a
56+
// second, marker-less duplicate that could never be cleaned up.
57+
fs.writeFileSync(dest, append ? block : src);
58+
console.log(`Trial installed: ${destRel}`);
59+
} else if (append) {
60+
let existing = fs.readFileSync(dest, 'utf8');
61+
if (existing.includes(BEGIN)) {
62+
existing = existing.replace(new RegExp(`${BEGIN}[\\s\\S]*?${END}\\n?`), block);
63+
fs.writeFileSync(dest, existing);
64+
console.log(`Trial updated inside existing ${destRel}`);
65+
} else {
66+
fs.writeFileSync(dest, existing.trimEnd() + '\n\n' + block);
67+
console.log(`Trial appended to existing ${destRel}`);
68+
}
6069
} else {
61-
fs.writeFileSync(dest, existing.trimEnd() + '\n\n' + block);
62-
console.log(`Trial appended to existing ${destRel}`);
70+
console.error(`${destRel} already exists — refusing to overwrite. Remove it first or merge by hand.`);
71+
process.exit(1);
6372
}
64-
} else {
65-
console.error(`${destRel} already exists — refusing to overwrite. Remove it first or merge by hand.`);
73+
} catch (err) {
74+
// Friendly message instead of a raw Node stack trace (e.g. EACCES on a
75+
// read-only destination, EISDIR, etc.).
76+
console.error(`Could not write ${destRel}: ${err.code || err.message}. ` +
77+
`Check the path is writable, then retry.`);
6678
process.exit(1);
6779
}

package.json

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,15 @@
11
{
22
"name": "trial-skill",
3-
"version": "0.4.0",
3+
"version": "0.4.1",
44
"description": "Evidence before done: a behavioral skill that stops AI coding agents from claiming 'done' until every claim is bound to a re-runnable receipt.",
5-
"keywords": ["ai-agents", "skills", "claude-code", "cursor", "verification", "rules"],
5+
"keywords": [
6+
"ai-agents",
7+
"skills",
8+
"claude-code",
9+
"cursor",
10+
"verification",
11+
"rules"
12+
],
613
"license": "MIT",
714
"author": {
815
"name": "Da7-Tech",
@@ -27,6 +34,6 @@
2734
"LICENSE"
2835
],
2936
"scripts": {
30-
"test": "node --test tests/sync.test.js"
37+
"test": "node --test tests/*.test.js"
3138
}
3239
}

0 commit comments

Comments
 (0)