autobrowse: add opt-in --browser-trace integration#121
Conversation
Wire the sibling browser-trace skill into the autobrowse loop behind a default-off --browser-trace flag. When set, each iteration pre-creates a Browserbase --keep-alive session, attaches bb-capture as a passive CDP observer, and runs evaluate.mjs with --connect-url so every inner browse call is rewritten to attach via --cdp $connectUrl --session autobrowse-main. browse stop is suppressed under trace mode so the named daemon survives the iteration. The --cdp attach mode is the key: --remote routes Network.*/Console.* events only to the driving client, so the trace observer sees nothing useful. --cdp gives the observer the full per-page firehose, matching the canonical pattern documented in browser-trace's SKILL.md. Default path is unchanged; --browser-trace implies --env remote. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- Add a fail-fast preflight in the traced-path block: if
${CLAUDE_SKILL_DIR}/../browser-trace/scripts/bb-capture.mjs is missing,
bail with an install hint pointing at github.com/browserbase/skills.
Prevents the harness from blowing up halfway through with "no such
file or directory" on bb-capture.mjs.
- Make explicit in "Read the trace" + "Form one hypothesis" that under
--browser-trace there are now TWO complementary traces: trace.json
(what the agent did, per turn) and cdp/summary.json (what the browser
did, per page). Document the cross-reference pattern: find the failing
page in cdp/summary.json, then jump back to trace.json for the turn
whose command produced it. Hypotheses should cite both — "the click
went through but /api/checkout returned 403" beats "the click failed."
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
| \`\`\` | ||
| browse open <url> --local | ||
| \`\`\``; | ||
| \`\`\``); |
There was a problem hiding this comment.
System prompt contradicts itself about browse stop in trace mode
Medium Severity
When connectUrl is set, envDesc explicitly tells the inner agent "Do not run browse stop", but the Workflow Pattern (steps 1 and 7) and Critical Rule 1 — all part of the same system prompt — still instruct the agent to run browse stop. The openFlag being set to "" also produces trailing spaces in those template lines (e.g. browse open <url> instead of browse open <url>). While the executeCommand suppression prevents functional breakage, the contradictory instructions cause the inner agent to waste turns on suppressed browse stop calls, burning through the 30-turn budget unnecessarily.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit d602d44. Configure here.
Closes the gap flagged in PR #121 review: /autobrowse --tasks a,b --browser-trace would silently fall back to the untraced path inside each spawned sub-agent because the sub-agent prompt template said only "Use --env <env>" without mentioning --browser-trace. The sub-agent prompt now explicitly requires the traced-path loop block (pre-create session, bb-capture, --connect-url to evaluate.mjs, stop+ bisect, release) when the parent invocation passed --browser-trace. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
| Task | Config | Status | Turns | Cost | CDP events | First page captured |
|---|---|---|---|---|---|---|
| hn-top-story | no-trace | ✅ completed | 4 | $0.09 | — | — |
| hn-top-story | with-trace | ✅ completed | 3 | $0.08 | 81 | news.ycombinator.com |
| httpbin-form | no-trace | ✅ completed | 16 | $0.40 | — | — |
| httpbin-form | with-trace | ✅ completed | 17 | $0.45 | 89 (2 pages) | httpbin.org/forms/post + /post |
| flights-search | no-trace | ✅ completed | 6 | $0.15 | — | — |
| flights-search | with-trace | ✅ completed | 3 | $0.07 | 605 | google.com/travel/flights |
Per-task convergence delta (with-trace minus without)
| Task | Turns Δ | Cost Δ | Interpretation |
|---|---|---|---|
| hn-top-story | −1 | −$0.01 | Noise — both passed in ≤4 turns either way |
| httpbin-form | +1 | +$0.05 | Trace slightly slower/more expensive |
| flights-search | −3 | −$0.08 | Trace ran in half the turns, but single-run nondeterminism dominates at this sample size |
No regression: every cell completed; trace mode never failed. Cost differences are within Sonnet 4.6 noise.
Trace content quality (with-trace cells)
| Task | Network reqs | Failed | Distinct request types | Per-page breakdown useful? |
|---|---|---|---|---|
| hn-top-story | 7 | 0 | Document, Image, Script, Stylesheet, Other | Marginal — single page, no XHRs, all GETs return 200. |
| httpbin-form | 5 (across 2 pages) | 1 failed (page 2 — the POST) | Document, Image, Other | Yes — page 2's failed request encodes the form-submit roundtrip; this is exactly the kind of evidence an outer agent could cite ("submit returned non-200, switch to FormData / wait for redirect"). |
| flights-search | 54 | 0 | XHR(4), Fetch(8), Script(17), Image(13), Stylesheet(2), Font(4), Document(2), Manifest, Other | Yes — XHRs against gstatic/google.com with 54 reqs in a single page; this is the network-heavy case where the trace's per-page bisect would let an outer agent say "the search API returned 200 with N results, so the layout selector is wrong" instead of guessing. |
Limitations & caveats
- Single-iteration runs only. Every cell passed on its first iteration, so the most interesting signal — does the cross-trace pattern actually help an outer agent form a better hypothesis on iter 2+? — wasn't tested. We only verified the trace populates; we did not observe the cross-reference loop.
- N=1 per cell. Sonnet 4.6 takes different paths between runs; the −3 turns delta on flights could vanish in a re-run. The plan budgeted for re-runs but they weren't needed for the "does it crash?" sanity pass.
- No task forced ≥2 iterations. All three tasks were beatable in one shot. To observe the "trace informs strategy.md" path empirically, we'd need a corpus where the first iter is expected to fail (e.g. a site that requires a
wait timeoutheuristic the inner agent doesn't know upfront, or a CAPTCHA-gated page). - Subjective "strategy quality" rating from the plan wasn't applied because there are no multi-iter strategy.md updates to rate.
Next steps
- The single-iter data proves the integration works across three diverse profiles (simple read, form submit, heavy JS / XHR) and doesn't regress completion. That's enough to ship v1 with confidence.
- The single-iter data does not answer whether the outer agent benefits from two complementary files vs. one unified log. That question only matters when the outer agent actually has to read both — which only happens on iter 2+ when forming a strategy update.
- Therefore: the right next probe is a failure-first eval. Pick 2–3 tasks where iter 1 reliably fails (so iter 2 has to use the trace to recover) and compare:
- With current two-trace shape: does the outer agent actually consult both
trace.jsonandcdp/summary.json? Quality of resulting strategy.md heuristic? - With a quick unified-log prototype (Derek's suggestion): same iter-2 attempt, but with the events merged. Easier for the outer agent? Does it form a better/faster heuristic?
- With current two-trace shape: does the outer agent actually consult both
If unified wins on iter-2 hypothesis quality, do the refactor. If two-trace wins or it's a wash, keep v1 and the cross-reference docs from PR #121 are sufficient.
Per design feedback (Derek): the two-trace cross-reference UX from the
prior commits puts friction on the outer agent. Replace it with a single
time-ordered NDJSON stream that joins the agent's turn log
(trace.json) and the browser-trace CDP firehose (cdp/raw.ndjson) into
one source-tagged file the outer agent reads first each iteration.
New: skills/autobrowse/scripts/unify-trace.mjs
- Joins trace.json + cdp/raw.ndjson by wall-clock timestamp
- Reuses bisect-cdp.mjs's monotonic→wall-clock conversion logic
(anchor on first monotonic CDP timestamp + manifest.started_at)
- Filters out noisy/redundant CDP methods (dataReceived, ExtraInfo,
frame Started/Stopped Loading, executionContextCreated, ...) so
the unified log stays scannable
- Inline summary fields per browser-event family (url+status+mime
for Network.*, level+text for Console.*, name for Page.lifecycle,
etc.); full payloads remain in the cdp/ drill-down files
- Annotates each browser row with page_id, derived by walking
top-level Page.frameNavigated events in the same order bisect-cdp
uses
SKILL.md changes:
- Traced-path block now invokes unify-trace.mjs after bisect-cdp
- "Read the trace" section rewritten: unified-events.jsonl is the
primary path; structured files (cdp/summary.json, trace.json,
cdp/network/failed.jsonl, etc.) are documented as agent-consumable
drill-downs for queries the stream doesn't naturally answer
- "Form one hypothesis" updated: hypothesis must cite a specific
line/timestamp in unified-events.jsonl or name the drill-down file
Verified end-to-end on news.ycombinator.com → first story's article:
6 turns, $0.31, 126-event unified log (113 browser + 13 agent),
strictly time-ordered, no leaked sessions.
No changes to evaluate.mjs or browser-trace. Default path
(no --browser-trace) is unchanged — no unify step runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Update — consolidating to a unified event logPer Derek's nudge, the two-trace cross-reference UX shipped in earlier commits added friction the outer agent didn't need. New commit What changed:
Verified end-to-end on No changes to Closes the design question raised in the prior eval comment about whether to unify the logs. |
Two small fixes surfaced by v3.1 verification:
1. Static bug: --browser-trace hardcoded the local browse-daemon name as
"autobrowse-main". Two parallel evaluate.mjs invocations (SKILL.md
Step-3 multi-task sub-agents) would collide on that socket. Derive a
per-process daemon name from a sha1 hash of the connectUrl
(autobrowse-<8 hex chars>) so each Browserbase session gets its own
local daemon. Deterministic, no deps beyond node:crypto.
2. QOL: MAX_TURNS is now overridable via the MAX_TURNS env var (default
30). Useful for testing failure-first recovery without patching
source. No effect on default invocations.
Verification (v3.1, /tmp/autobrowse-eval-v2):
- Part 1: joiner's page_id walk matches bisect-cdp exactly (no drift).
- Part 2: SKIP_METHODS audit — Network.dataReceived + *ExtraInfo etc.
are filtered correctly; ExtraInfo headers remain accessible via
drill-down (cdp/raw.ndjson, query.mjs). Known limitation: header-based
hypotheses need drill-down today.
- Part 3: parallel-daemon contention fixed (this commit). Unit-level
verified: different connectUrls → different daemon names.
- Part 4: failure-first single-arm on httpbin-form-incomplete
- iter 1 (MAX_TURNS=15, --browser-trace): max_turns failure at $0.38;
agent burned 4 turns on verification checks (turns 11-14) before
submit clicked at turn 15. POST went through (200 OK in unified log)
but agent ran out of turns before reading response.
- Manual outer-agent step: read unified-events.jsonl, cited specific
events (POST line + 4 verification turns), wrote evidence-grounded
strategy.md update telling iter 2 to skip the verification loop.
- iter 2: 12 turns, $0.27, status=completed, echoed_custname matches.
- Net: unified-log evidence drove a real recovery on a failure-first
task. ~$0.65 total v3.1 spend.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
v3.1 verification (4/4 PASS) —
|
| Part | What it tested | Result |
|---|---|---|
| 1 | unify-trace.mjs's page_id walk matches bisect-cdp.mjs |
✅ PASS — no drift; same per-page URL assignment as bisect's pages[] |
| 2 | SKIP_METHODS list isn't filtering load-bearing signal |
✅ PASS with note — Network.dataReceived (178 events) and *ExtraInfo (74 events) correctly filtered as noise; header-based hypotheses would need drill-down (cdp/raw.ndjson, query.mjs). Acceptable as documented limitation |
| 3 | Multi-task --browser-trace daemon contention |
evaluate.mjs hardcoded daemon name autobrowse-main; parallel sub-agents would collide. Fixed in 921697c: daemon name now derived from sha1(connectUrl).slice(0,8), e.g. autobrowse-69ff4e4e. Unit-verified: different connectUrls → different names |
| 4 | Unified log actually drives multi-iter recovery | ✅ PASS — failure-first single-arm on httpbin-form-incomplete: iter 1 hit max_turns=15 ($0.38, agent burned 4 turns on verification before submit). I (as outer agent) read unified-events.jsonl, cited specific events (Network.requestWillBeSent POST /post + 4 wasted verification turns), wrote evidence-grounded strategy.md. Iter 2: 12 turns, $0.27, status=completed, pass-check matches. The unified-log evidence drove a real recovery. |
Also bundled a small QOL fix in the same commit: MAX_TURNS env-overridable (default 30 unchanged). Used for forcing iter-1 failure in Part 4; useful for any future failure-first testing.
PR commits now:
78cee35— initial integrationd602d44— preflight + cross-trace docs (superseded by unified)5ad0a44— multi-task sub-agent prompt editb67906f— unified event log + drill-down921697c— daemon-name uniqueness + MAX_TURNS env override
Ready for review.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 921697c. Configure here.
| return Math.floor((ts - anchorCdp) * 1000 + startedMs); | ||
| } | ||
| return Math.floor(ts); | ||
| } |
There was a problem hiding this comment.
CDP events without params.timestamp silently dropped from unified stream
Medium Severity
cdpTsToMs reads only ev.params.timestamp, but several CDP event types handled in summarizeCdp have no timestamp field in params — including Page.frameNavigated, Console.messageAdded, Page.navigatedWithinDocument, Network.webSocketCreated, and Target.* events. These return null from cdpTsToMs and are filtered out at the if (ts_ms == null) continue guard. The summarizeCdp switch cases for these events are effectively dead code, and the unified stream silently loses navigation and console events that the file's own header comment claims to include.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 921697c. Configure here.
Pair to the descriptor-capture change landing in the browse CLI: every
page-driving browse call (click/fill/select/goto/wait/etc.) now
side-channels a rich node descriptor to
$O11Y_ROOT/$O11Y_RUN_ID/cdp/descriptors.ndjson when both env vars are
set. The CLI was already gated on O11Y_ROOT — adding O11Y_RUN_ID lets
it know which run subdirectory to write into without inferring from
the most-recently-modified dir.
Two lines change in the traced-path loop block:
- export O11Y_RUN_ID="$RUN_ID" alongside O11Y_ROOT
- documentation note on what descriptors.ndjson is and that the
outer agent should skip it when reading the trace (it feeds
downstream codegen, not hypothesis formation)
No effect when --browser-trace is unset or when the CLI doesn't yet
have the descriptor-capture utility — the new env var is harmless.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>


Summary
browser-traceskill into the autobrowse loop behind a default-off--browser-traceflag (remote-only; implies--env remote).evaluate.mjsgains a--connect-url <wss>flag that injects--cdp $connect_url --session autobrowse-maininto every innerbrowsecall and suppressesbrowse stopso the named daemon survives the iteration.--cdpattach mode is the key fix:--remoteroutesNetwork.*/Console.*events only to the driving client, leaving the trace observer with just a fewPage.lifecycleEvents.--cdpgives the observer the full per-page firehose — matching the canonical pattern inbrowser-trace's SKILL.md.What you get
Verified end-to-end on
news.ycombinator.com:completed.news.ycombinator.com,news.css,hn.js, …); per-page bisect shows{"requests":7,"failed":0,"byType":{"Document":1,"Image":3,...}}.Compare to default mode (without
--browser-trace) which is byte-for-byte unchanged.Files changed
skills/autobrowse/scripts/evaluate.mjs—+65: new flag,rewriteArgsForTraceargv injector,browse stopsuppression, conditional system prompt.skills/autobrowse/SKILL.md—+55: opt-in flag wired into the loop (six surgical edits — entry points, args, run-the-inner-agent split, read-the-trace subsection, hypothesis example, rules).browser-traceskill — unchanged.Test plan
/autobrowse --task <existing> --env remote(no flag) — behavior identical to today.--browser-tracerequires remote:evaluate.mjs --connect-url … --env localerrors out cleanly.pages[0].url == "https://news.ycombinator.com/", not(initial).browse stopdefense: harness suppresses it so the named daemon survives.🤖 Generated with Claude Code
Note
Medium Risk
Changes remote Browserbase session lifecycle and browse CLI invocation; mistakes could leak sessions or break parallel runs, but default path is unchanged and teardown order is documented.
Overview
Adds an opt-in
--browser-tracepath to autobrowse that pairs each remote iteration with the sibling browser-trace skill: pre-create a Browserbase session, runbb-capture, drive the inner agent viaevaluate.mjs --connect-url, then stop → bisect → unify → release (session must stay alive until bisect finishes).evaluate.mjsrewrites page-drivingbrowsecalls to attach with--cdpand a per-connectUrl session name (parallel-safe), strips--remote/--local, and no-opsbrowse stopso the outer harness owns the session. Default runs without the flag are unchanged.New
unify-trace.mjsmerges agenttrace.jsonwith CDP intounified-events.jsonlfor time-ordered, evidence-basedstrategy.mdupdates;SKILL.mddocuments the traced loop, drill-downs, and stricter hypotheses citing that file.Reviewed by Cursor Bugbot for commit 2c33cbd. Bugbot is set up for automated code reviews on this repo. Configure here.