You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Installer: append-mode targets now write the first copy wrapped in markers
so re-running updates in place instead of appending a permanent duplicate;
write failures print a friendly message, not a raw stack trace. + 3 tests.
- Grader: covering-test detector is now relative (any year <= current, or a
now()/Date.now()-minus construction), not a hardcoded 2019-2023 list that
false-negatived valid 2018/2024/2025 tests.
- Docs: split benchmark attribution — behavioral + covering-test are
grader-scored; verbatim-receipt + false-claim counts are hand-scored from
each report (the grader can't see tool calls). Removed the absolute 'never
the agent's self-report' claim that two of four headline metrics violated.
- CI: SHA-pinned actions + least-privilege permissions.
Copy file name to clipboardExpand all lines: .claude-plugin/plugin.json
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
{
2
2
"name": "trial",
3
-
"version": "0.4.0",
3
+
"version": "0.4.1",
4
4
"description": "Evidence before done: an agent may not claim a task is done until every claim is bound to a re-runnable receipt that covers it, with scrutiny scaled to risk.",
Copy file name to clipboardExpand all lines: CHANGELOG.md
+22-1Lines changed: 22 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,34 @@
1
1
# Changelog
2
2
3
+
## 0.4.1 — 2026-07-02
4
+
5
+
Adversarial-audit fixes (Opus fleet, each finding reproduced then fixed):
6
+
7
+
-**Installer is now idempotent for shared-file agents.** For append-mode
8
+
targets (codex/opencode/copilot/zed/aider/gemini) the first install now
9
+
writes the rule already wrapped in `<!-- trial:begin/end -->` markers, so
10
+
re-running updates it in place instead of appending a second, marker-less
11
+
copy that could never be cleaned up. Also: install failures (e.g. a
12
+
read-only destination) print a friendly message instead of a raw Node
13
+
stack trace.
14
+
-**Grader covering-test detector is now relative, not a hardcoded year
15
+
list.** It flags any expiry test that exercises a past instant (any year
16
+
≤ the current year, or a `now()/Date.now()`-minus construction), so a
17
+
valid test dated 2018/2024/2025 is no longer a false negative.
18
+
-**Benchmark attribution corrected.** The behavioral and covering-test
19
+
metrics are grader-scored; the verbatim-receipt and false-claim counts
20
+
are hand-scored from each report (the grader can't see tool calls). The
21
+
README/CHANGELOG/benchmarks docs no longer imply all four come from the
22
+
deterministic grader.
23
+
3
24
## 0.4.0 — 2026-07-02
4
25
5
26
The measured release. Everything below was driven by running the rule against real agent sessions and publishing the numbers — including the ones that don't flatter it.
6
27
7
28
-**Removed the output-hash requirement.** Hashing test output was theater: output contains timestamps, so the hash was never reproducible, and a self-reported hash is no harder to fabricate than a self-reported pass. Replaced by the **receipt**: exact command + exit status + decisive output lines, quoted in the report — auditable and re-runnable.
8
29
-**Added the coverage rule as the headline** ("Coverage beats green"): a receipt only counts if the command would have failed were the claim false; missing coverage means writing the test, watching it fail on the old behavior, then fixing.
9
30
-**Fixed the subagent assumption.** The old rule told every platform to "spawn fresh agents" — impossible on Cursor, Windsurf, Cline, aider. The rule now has an explicit fallback: a separate adversarial self-review step that must name the covering test/line per criterion or downgrade to `NOT_PROVEN`.
10
-
-**First controlled measurement** (Haiku 4.5, real headless sessions, deterministic hidden grader): covering test 6/6 vs 4/6, verbatim receipts 6/6 vs 0/6, false verification claims 0 vs 1, at +4% tokens / +13% time on real work and +7% tokens / ~2× wall time on a trivial task. Correctness saturated (6/6 both arms) and is reported as such. See `benchmarks/results/2026-07-02-false-done-and-cost.md`.
31
+
-**First controlled measurement** (Haiku 4.5, real headless sessions): covering test 6/6 vs 4/6, verbatim receipts 6/6 vs 0/6, false verification claims 0 vs 1, at +4% tokens / +13% time on real work and +7% tokens / ~2× wall time on a trivial task. Correctness and covering-test are scored by a hidden deterministic grader; the receipt and false-claim counts are hand-scored from each report (the grader can't see tool calls). Correctness saturated (6/6 both arms) and is reported as such. See `benchmarks/results/2026-07-02-false-done-and-cost.md`.
11
32
-**Benchmark harness shipped**: trap fixture, deterministic grader, verbatim prompts, and rules for adding results (losing metrics must be published).
12
33
-**6 new agent formats** (Copilot, Kiro, Roo, Zed, aider, Gemini CLI) on top of Claude Code, Cursor, Codex/OpenCode, Windsurf, Cline — all byte-synced to one canonical body, enforced by `tests/sync.test.js` in CI.
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@
17
17
18
18
<palign="center">
19
19
<strong>Covering test 6/6 vs 4/6 · verbatim receipts 6/6 vs 0/6 · false claims 0 vs 1 · for +4% tokens</strong><br>
20
-
<sub>Measured on real headless agent sessions (Haiku 4.5, n=6 per arm) fixing a bug whose test suite is green while the bug ships, scored by a hidden deterministic grader — not by the agents' own reports. Correctness itself saturated (6/6 both arms; the fixture was too easy for this model, reported as such). On a trivial task Trial costs +7% tokens and one extra test run. <ahref="benchmarks/results/2026-07-02-false-done-and-cost.md">Full method, raw numbers, and limitations</a> · <ahref="benchmarks/">reproduce it</a>.</sub>
20
+
<sub>Measured on real headless agent sessions (Haiku 4.5, n=6 per arm) fixing a bug whose test suite is green while the bug ships. Correctness and the covering-test metric are scored by a hidden deterministic grader on the tree each agent leaves behind; the verbatim-receipt and false-claim counts are scored by hand from each run's final report (both disclosed in the linked method). Correctness itself saturated (6/6 both arms; the fixture was too easy for this model, reported as such). On a trivial task Trial costs +7% tokens and one extra test run. <ahref="benchmarks/results/2026-07-02-false-done-and-cost.md">Full method, raw numbers, and limitations</a> · <ahref="benchmarks/">reproduce it</a>.</sub>
Copy file name to clipboardExpand all lines: SKILL.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
name: trial
3
3
description: "Gated judging that stops false-done: an agent may not claim a task is done until every claim is bound to a receipt — the command it ran, its exit status, and the decisive output — that actually covers the claim. Scrutiny scales with risk. Use for work where a green suite is not enough proof."
Copy file name to clipboardExpand all lines: benchmarks/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Benchmarks
2
2
3
-
Trial's numbers come from real headless agent sessionsscored by a deterministic grader on the working tree each agent leaves behind — never from the agent's self-report. Results live in [`results/`](results/), each dated, with method and limitations inline.
3
+
Trial's numbers come from real headless agent sessions. The behavioral (bug-actually-fixed) and covering-test metrics are scored by a deterministic grader on the working tree each agent leaves behind — never from the agent's self-report. The verbatim-receipt and false-claim metrics are, by nature, scored by hand from each run's final report text (the grader can't see tool calls); this is called out in every result. Results live in [`results/`](results/), each dated, with method and limitations inline.
Copy file name to clipboardExpand all lines: benchmarks/results/2026-07-02-false-done-and-cost.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,7 @@ Two questions, measured on real headless coding-agent sessions (Claude Code `Tas
5
5
1.**Benefit** — on a bug whose visible test suite is blind to the fix, does Trial change what ships and what gets claimed?
6
6
2.**Harm** — on a trivial task, what does Trial cost?
7
7
8
-
Everything here is reproducible from [`benchmarks/fixture/`](../fixture/) and [`benchmarks/graders/grade.js`](../graders/grade.js). Scoring is a deterministic script run on the working tree each agent leaves behind — never the agent's own account of itself.
8
+
Everything here is reproducible from [`benchmarks/fixture/`](../fixture/) and [`benchmarks/graders/grade.js`](../graders/grade.js). The behavioral and covering-test metrics are scored by a deterministic script on the working tree each agent leaves behind — never the agent's own account of itself. The verbatim-receipt and false-claim metrics are scored by hand from the final report text (the grader can't see tool calls; see Limitations), so those two are the subjective ones — reported here in full precisely so you can re-judge them.
Copy file name to clipboardExpand all lines: package.json
+10-3Lines changed: 10 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,15 @@
1
1
{
2
2
"name": "trial-skill",
3
-
"version": "0.4.0",
3
+
"version": "0.4.1",
4
4
"description": "Evidence before done: a behavioral skill that stops AI coding agents from claiming 'done' until every claim is bound to a re-runnable receipt.",
0 commit comments