autobrowse: add opt-in --browser-trace integration by aq17 · Pull Request #121 · browserbase/skills

aq17 · 2026-05-29T00:23:43Z

Summary

Wire the sibling browser-trace skill into the autobrowse loop behind a default-off --browser-trace flag (remote-only; implies --env remote).
evaluate.mjs gains a --connect-url <wss> flag that injects --cdp $connect_url --session autobrowse-main into every inner browse call and suppresses browse stop so the named daemon survives the iteration.
The --cdp attach mode is the key fix: --remote routes Network.*/Console.* events only to the driving client, leaving the trace observer with just a few Page.lifecycleEvents. --cdp gives the observer the full per-page firehose — matching the canonical pattern in browser-trace's SKILL.md.

What you get

Verified end-to-end on news.ycombinator.com:

Inner agent: 3 turns, $0.08, status completed.
Trace: 78 CDP events, 7 Network requests captured with real URLs (news.ycombinator.com, news.css, hn.js, …); per-page bisect shows {"requests":7,"failed":0,"byType":{"Document":1,"Image":3,...}}.
No leftover Browserbase sessions; clean release each iter.

Compare to default mode (without --browser-trace) which is byte-for-byte unchanged.

Files changed

skills/autobrowse/scripts/evaluate.mjs — +65: new flag, rewriteArgsForTrace argv injector, browse stop suppression, conditional system prompt.
skills/autobrowse/SKILL.md — +55: opt-in flag wired into the loop (six surgical edits — entry points, args, run-the-inner-agent split, read-the-trace subsection, hypothesis example, rules).
browser-trace skill — unchanged.

Test plan

Default-path regression: /autobrowse --task <existing> --env remote (no flag) — behavior identical to today.
--browser-trace requires remote: evaluate.mjs --connect-url … --env local errors out cleanly.
Single traced iteration on HN: trace captures all 7 Network requests + lifecycle events; pages[0].url == "https://news.ycombinator.com/", not (initial).
browse stop defense: harness suppresses it so the named daemon survives.
No session leaks after the loop.

🤖 Generated with Claude Code

Note

Medium Risk
Changes remote Browserbase session lifecycle and browse CLI invocation; mistakes could leak sessions or break parallel runs, but default path is unchanged and teardown order is documented.

Overview
Adds an opt-in --browser-trace path to autobrowse that pairs each remote iteration with the sibling browser-trace skill: pre-create a Browserbase session, run bb-capture, drive the inner agent via evaluate.mjs --connect-url, then stop → bisect → unify → release (session must stay alive until bisect finishes).

evaluate.mjs rewrites page-driving browse calls to attach with --cdp and a per-connectUrl session name (parallel-safe), strips --remote/--local, and no-ops browse stop so the outer harness owns the session. Default runs without the flag are unchanged.

New unify-trace.mjs merges agent trace.json with CDP into unified-events.jsonl for time-ordered, evidence-based strategy.md updates; SKILL.md documents the traced loop, drill-downs, and stricter hypotheses citing that file.

^{Reviewed by Cursor Bugbot for commit 2c33cbd. Bugbot is set up for automated code reviews on this repo. Configure here.}

Wire the sibling browser-trace skill into the autobrowse loop behind a default-off --browser-trace flag. When set, each iteration pre-creates a Browserbase --keep-alive session, attaches bb-capture as a passive CDP observer, and runs evaluate.mjs with --connect-url so every inner browse call is rewritten to attach via --cdp $connectUrl --session autobrowse-main. browse stop is suppressed under trace mode so the named daemon survives the iteration. The --cdp attach mode is the key: --remote routes Network.*/Console.* events only to the driving client, so the trace observer sees nothing useful. --cdp gives the observer the full per-page firehose, matching the canonical pattern documented in browser-trace's SKILL.md. Default path is unchanged; --browser-trace implies --env remote. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

- Add a fail-fast preflight in the traced-path block: if ${CLAUDE_SKILL_DIR}/../browser-trace/scripts/bb-capture.mjs is missing, bail with an install hint pointing at github.com/browserbase/skills. Prevents the harness from blowing up halfway through with "no such file or directory" on bb-capture.mjs. - Make explicit in "Read the trace" + "Form one hypothesis" that under --browser-trace there are now TWO complementary traces: trace.json (what the agent did, per turn) and cdp/summary.json (what the browser did, per page). Document the cross-reference pattern: find the failing page in cdp/summary.json, then jump back to trace.json for the turn whose command produced it. Hypotheses should cite both — "the click went through but /api/checkout returned 403" beats "the click failed." Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

cursor · 2026-05-29T00:34:36Z

 \`\`\`
 browse open <url> --local
-\`\`\``;
+\`\`\``);


System prompt contradicts itself about browse stop in trace mode

Medium Severity

When connectUrl is set, envDesc explicitly tells the inner agent "Do not run browse stop", but the Workflow Pattern (steps 1 and 7) and Critical Rule 1 — all part of the same system prompt — still instruct the agent to run browse stop. The openFlag being set to "" also produces trailing spaces in those template lines (e.g. browse open <url> instead of browse open <url>). While the executeCommand suppression prevents functional breakage, the contradictory instructions cause the inner agent to waste turns on suppressed browse stop calls, burning through the 30-turn budget unnecessarily.

Additional Locations (1)

skills/autobrowse/scripts/evaluate.mjs#L381-L392

^{Reviewed by Cursor Bugbot for commit d602d44. Configure here.}

Closes the gap flagged in PR #121 review: /autobrowse --tasks a,b --browser-trace would silently fall back to the untraced path inside each spawned sub-agent because the sub-agent prompt template said only "Use --env <env>" without mentioning --browser-trace. The sub-agent prompt now explicitly requires the traced-path loop block (pre-create session, bb-capture, --connect-url to evaluate.mjs, stop+ bisect, release) when the parent invocation passed --browser-trace. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

aq17 · 2026-05-29T17:49:05Z

`--browser-trace` v1 — Empirical Evaluation Report

Goal

Determine whether the v1 two-trace shape (separate trace.json + .o11y/<run-id>/cdp/) delivers useful evidence — and inform Derek's open question: should we unify these into a single event log?

Matrix — 3 tasks × 2 configurations × 1 iteration

Task	Config	Status	Turns	Cost	CDP events	First page captured
hn-top-story	no-trace	✅ completed	4	$0.09	—	—
hn-top-story	with-trace	✅ completed	3	$0.08	81	news.ycombinator.com
httpbin-form	no-trace	✅ completed	16	$0.40	—	—
httpbin-form	with-trace	✅ completed	17	$0.45	89 (2 pages)	httpbin.org/forms/post + /post
flights-search	no-trace	✅ completed	6	$0.15	—	—
flights-search	with-trace	✅ completed	3	$0.07	605	google.com/travel/flights

Per-task convergence delta (with-trace minus without)

Task	Turns Δ	Cost Δ	Interpretation
hn-top-story	−1	−$0.01	Noise — both passed in ≤4 turns either way
httpbin-form	+1	+$0.05	Trace slightly slower/more expensive
flights-search	−3	−$0.08	Trace ran in half the turns, but single-run nondeterminism dominates at this sample size

No regression: every cell completed; trace mode never failed. Cost differences are within Sonnet 4.6 noise.

Trace content quality (with-trace cells)

Task	Network reqs	Failed	Distinct request types	Per-page breakdown useful?
hn-top-story	7	0	Document, Image, Script, Stylesheet, Other	Marginal — single page, no XHRs, all GETs return 200.
httpbin-form	5 (across 2 pages)	1 failed (page 2 — the POST)	Document, Image, Other	Yes — page 2's failed request encodes the form-submit roundtrip; this is exactly the kind of evidence an outer agent could cite ("submit returned non-200, switch to FormData / wait for redirect").
flights-search	54	0	XHR(4), Fetch(8), Script(17), Image(13), Stylesheet(2), Font(4), Document(2), Manifest, Other	Yes — XHRs against gstatic/google.com with 54 reqs in a single page; this is the network-heavy case where the trace's per-page bisect would let an outer agent say "the search API returned 200 with N results, so the layout selector is wrong" instead of guessing.

Limitations & caveats

Single-iteration runs only. Every cell passed on its first iteration, so the most interesting signal — does the cross-trace pattern actually help an outer agent form a better hypothesis on iter 2+? — wasn't tested. We only verified the trace populates; we did not observe the cross-reference loop.
N=1 per cell. Sonnet 4.6 takes different paths between runs; the −3 turns delta on flights could vanish in a re-run. The plan budgeted for re-runs but they weren't needed for the "does it crash?" sanity pass.
No task forced ≥2 iterations. All three tasks were beatable in one shot. To observe the "trace informs strategy.md" path empirically, we'd need a corpus where the first iter is expected to fail (e.g. a site that requires a wait timeout heuristic the inner agent doesn't know upfront, or a CAPTCHA-gated page).
Subjective "strategy quality" rating from the plan wasn't applied because there are no multi-iter strategy.md updates to rate.

Next steps

The single-iter data proves the integration works across three diverse profiles (simple read, form submit, heavy JS / XHR) and doesn't regress completion. That's enough to ship v1 with confidence.
The single-iter data does not answer whether the outer agent benefits from two complementary files vs. one unified log. That question only matters when the outer agent actually has to read both — which only happens on iter 2+ when forming a strategy update.
Therefore: the right next probe is a failure-first eval. Pick 2–3 tasks where iter 1 reliably fails (so iter 2 has to use the trace to recover) and compare:
- With current two-trace shape: does the outer agent actually consult both trace.json and cdp/summary.json? Quality of resulting strategy.md heuristic?
- With a quick unified-log prototype (Derek's suggestion): same iter-2 attempt, but with the events merged. Easier for the outer agent? Does it form a better/faster heuristic?

If unified wins on iter-2 hypothesis quality, do the refactor. If two-trace wins or it's a wash, keep v1 and the cross-reference docs from PR #121 are sufficient.

Per design feedback (Derek): the two-trace cross-reference UX from the prior commits puts friction on the outer agent. Replace it with a single time-ordered NDJSON stream that joins the agent's turn log (trace.json) and the browser-trace CDP firehose (cdp/raw.ndjson) into one source-tagged file the outer agent reads first each iteration. New: skills/autobrowse/scripts/unify-trace.mjs - Joins trace.json + cdp/raw.ndjson by wall-clock timestamp - Reuses bisect-cdp.mjs's monotonic→wall-clock conversion logic (anchor on first monotonic CDP timestamp + manifest.started_at) - Filters out noisy/redundant CDP methods (dataReceived, ExtraInfo, frame Started/Stopped Loading, executionContextCreated, ...) so the unified log stays scannable - Inline summary fields per browser-event family (url+status+mime for Network.*, level+text for Console.*, name for Page.lifecycle, etc.); full payloads remain in the cdp/ drill-down files - Annotates each browser row with page_id, derived by walking top-level Page.frameNavigated events in the same order bisect-cdp uses SKILL.md changes: - Traced-path block now invokes unify-trace.mjs after bisect-cdp - "Read the trace" section rewritten: unified-events.jsonl is the primary path; structured files (cdp/summary.json, trace.json, cdp/network/failed.jsonl, etc.) are documented as agent-consumable drill-downs for queries the stream doesn't naturally answer - "Form one hypothesis" updated: hypothesis must cite a specific line/timestamp in unified-events.jsonl or name the drill-down file Verified end-to-end on news.ycombinator.com → first story's article: 6 turns, $0.31, 126-event unified log (113 browser + 13 agent), strictly time-ordered, no leaked sessions. No changes to evaluate.mjs or browser-trace. Default path (no --browser-trace) is unchanged — no unify step runs. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

aq17 · 2026-05-29T18:26:39Z

Update — consolidating to a unified event log

Per Derek's nudge, the two-trace cross-reference UX shipped in earlier commits added friction the outer agent didn't need. New commit b67906f replaces it with a single time-ordered NDJSON stream as the primary trace artifact, while keeping the existing structured files as agent-consumable drill-downs.

What changed:

New: skills/autobrowse/scripts/unify-trace.mjs — joins trace.json (agent turns) + cdp/raw.ndjson (browser firehose) into unified-events.jsonl at the run root. Each line is source-tagged ("agent" | "browser"), time-ordered, with inline summary fields per event family.
SKILL.md "Read the trace": now points outer agents at unified-events.jsonl first. The structured drill-down files (cdp/summary.json, cdp/network/failed.jsonl, trace.json, etc.) are documented as agent tools for queries the stream doesn't naturally answer (per-page totals, grouped failure lists, full payloads).
SKILL.md "Form one hypothesis": hypothesis must cite a specific line/timestamp in unified-events.jsonl or name the drill-down file. Keeps updates evidence-grounded.

Verified end-to-end on news.ycombinator.com (extract first article paragraph): 6 turns, $0.31, 126-event unified log (113 browser + 13 agent), strictly time-ordered. Sample first lines show interleaved sources:

2026-05-29T18:24:59.001Z browser Page.lifecycleEvent      load
2026-05-29T18:25:00.034Z browser Runtime.consoleAPICalled [v3-piercer] installed Object
2026-05-29T18:25:02.783Z agent   reasoning                I'll navigate to Hacker News, find the top story...
2026-05-29T18:25:04.702Z browser Runtime.consoleAPICalled [v3-piercer] installed Object

No changes to evaluate.mjs or the browser-trace skill. Default path (no --browser-trace) is unchanged — no unify step runs.

Closes the design question raised in the prior eval comment about whether to unify the logs.

Two small fixes surfaced by v3.1 verification: 1. Static bug: --browser-trace hardcoded the local browse-daemon name as "autobrowse-main". Two parallel evaluate.mjs invocations (SKILL.md Step-3 multi-task sub-agents) would collide on that socket. Derive a per-process daemon name from a sha1 hash of the connectUrl (autobrowse-<8 hex chars>) so each Browserbase session gets its own local daemon. Deterministic, no deps beyond node:crypto. 2. QOL: MAX_TURNS is now overridable via the MAX_TURNS env var (default 30). Useful for testing failure-first recovery without patching source. No effect on default invocations. Verification (v3.1, /tmp/autobrowse-eval-v2): - Part 1: joiner's page_id walk matches bisect-cdp exactly (no drift). - Part 2: SKIP_METHODS audit — Network.dataReceived + *ExtraInfo etc. are filtered correctly; ExtraInfo headers remain accessible via drill-down (cdp/raw.ndjson, query.mjs). Known limitation: header-based hypotheses need drill-down today. - Part 3: parallel-daemon contention fixed (this commit). Unit-level verified: different connectUrls → different daemon names. - Part 4: failure-first single-arm on httpbin-form-incomplete - iter 1 (MAX_TURNS=15, --browser-trace): max_turns failure at $0.38; agent burned 4 turns on verification checks (turns 11-14) before submit clicked at turn 15. POST went through (200 OK in unified log) but agent ran out of turns before reading response. - Manual outer-agent step: read unified-events.jsonl, cited specific events (POST line + 4 verification turns), wrote evidence-grounded strategy.md update telling iter 2 to skip the verification loop. - iter 2: 12 turns, $0.27, status=completed, echoed_custname matches. - Net: unified-log evidence drove a real recovery on a failure-first task. ~$0.65 total v3.1 spend. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

aq17 · 2026-05-29T19:15:11Z

v3.1 verification (4/4 PASS) — `921697c`

Spent ~$0.65 validating the four unprovens flagged in the prior gut-check:

Part	What it tested	Result
1	`unify-trace.mjs`'s `page_id` walk matches `bisect-cdp.mjs`	✅ PASS — no drift; same per-page URL assignment as bisect's `pages[]`
2	`SKIP_METHODS` list isn't filtering load-bearing signal	✅ PASS with note — `Network.dataReceived` (178 events) and `ExtraInfo` (74 events) correctly filtered as noise; header-based hypotheses* would need drill-down (`cdp/raw.ndjson`, `query.mjs`). Acceptable as documented limitation
3	Multi-task `--browser-trace` daemon contention	⚠️ STATIC BUG FOUND — `evaluate.mjs` hardcoded daemon name `autobrowse-main`; parallel sub-agents would collide. Fixed in `921697c`: daemon name now derived from `sha1(connectUrl).slice(0,8)`, e.g. `autobrowse-69ff4e4e`. Unit-verified: different connectUrls → different names
4	Unified log actually drives multi-iter recovery	✅ PASS — failure-first single-arm on `httpbin-form-incomplete`: iter 1 hit `max_turns=15` ($0.38, agent burned 4 turns on verification before submit). I (as outer agent) read `unified-events.jsonl`, cited specific events (`Network.requestWillBeSent POST /post` + 4 wasted verification turns), wrote evidence-grounded strategy.md. Iter 2: 12 turns, $0.27, status=completed, pass-check matches. The unified-log evidence drove a real recovery.

Also bundled a small QOL fix in the same commit: MAX_TURNS env-overridable (default 30 unchanged). Used for forcing iter-1 failure in Part 4; useful for any future failure-first testing.

PR commits now:

78cee35 — initial integration
d602d44 — preflight + cross-trace docs (superseded by unified)
5ad0a44 — multi-task sub-agent prompt edit
b67906f — unified event log + drill-down
921697c — daemon-name uniqueness + MAX_TURNS env override

Ready for review.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 921697c. Configure here.}

cursor · 2026-05-29T19:23:25Z

+    return Math.floor((ts - anchorCdp) * 1000 + startedMs);
+  }
+  return Math.floor(ts);
+}


CDP events without params.timestamp silently dropped from unified stream

Medium Severity

cdpTsToMs reads only ev.params.timestamp, but several CDP event types handled in summarizeCdp have no timestamp field in params — including Page.frameNavigated, Console.messageAdded, Page.navigatedWithinDocument, Network.webSocketCreated, and Target.* events. These return null from cdpTsToMs and are filtered out at the if (ts_ms == null) continue guard. The summarizeCdp switch cases for these events are effectively dead code, and the unified stream silently loses navigation and console events that the file's own header comment claims to include.

Additional Locations (1)

skills/autobrowse/scripts/unify-trace.mjs#L145-L153

^{Reviewed by Cursor Bugbot for commit 921697c. Configure here.}

Pair to the descriptor-capture change landing in the browse CLI: every page-driving browse call (click/fill/select/goto/wait/etc.) now side-channels a rich node descriptor to $O11Y_ROOT/$O11Y_RUN_ID/cdp/descriptors.ndjson when both env vars are set. The CLI was already gated on O11Y_ROOT — adding O11Y_RUN_ID lets it know which run subdirectory to write into without inferring from the most-recently-modified dir. Two lines change in the traced-path loop block: - export O11Y_RUN_ID="$RUN_ID" alongside O11Y_ROOT - documentation note on what descriptors.ndjson is and that the outer agent should skip it when reading the trace (it feeds downstream codegen, not hypothesis formation) No effect when --browser-trace is unset or when the CLI doesn't yet have the descriptor-capture utility — the new env var is harmless. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

aq17 and others added 2 commits May 28, 2026 17:22

aq17 requested review from shubh24 and ziruihao and removed request for shubh24 May 29, 2026 00:28

cursor Bot reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autobrowse: add opt-in --browser-trace integration#121

autobrowse: add opt-in --browser-trace integration#121
aq17 wants to merge 6 commits into
mainfrom
autobrowse-browser-trace

aq17 commented May 29, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot May 29, 2026

Uh oh!

aq17 commented May 29, 2026 •

edited

Loading

Uh oh!

aq17 commented May 29, 2026

Uh oh!

aq17 commented May 29, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aq17 commented May 29, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What you get

Files changed

Test plan

Uh oh!

cursor Bot May 29, 2026

Choose a reason for hiding this comment

System prompt contradicts itself about browse stop in trace mode

Uh oh!

aq17 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

--browser-trace v1 — Empirical Evaluation Report

Goal

Matrix — 3 tasks × 2 configurations × 1 iteration

Per-task convergence delta (with-trace minus without)

Trace content quality (with-trace cells)

Limitations & caveats

Next steps

Uh oh!

aq17 commented May 29, 2026

Update — consolidating to a unified event log

Uh oh!

aq17 commented May 29, 2026

v3.1 verification (4/4 PASS) — 921697c

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 29, 2026

Choose a reason for hiding this comment

CDP events without params.timestamp silently dropped from unified stream

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aq17 commented May 29, 2026 •

edited by cursor Bot

Loading

System prompt contradicts itself about `browse stop` in trace mode

aq17 commented May 29, 2026 •

edited

Loading

`--browser-trace` v1 — Empirical Evaluation Report

v3.1 verification (4/4 PASS) — `921697c`