Skip to content

autobrowse: add opt-in --browser-trace integration#121

Open
aq17 wants to merge 6 commits into
mainfrom
autobrowse-browser-trace
Open

autobrowse: add opt-in --browser-trace integration#121
aq17 wants to merge 6 commits into
mainfrom
autobrowse-browser-trace

Conversation

@aq17
Copy link
Copy Markdown
Contributor

@aq17 aq17 commented May 29, 2026

Summary

  • Wire the sibling browser-trace skill into the autobrowse loop behind a default-off --browser-trace flag (remote-only; implies --env remote).
  • evaluate.mjs gains a --connect-url <wss> flag that injects --cdp $connect_url --session autobrowse-main into every inner browse call and suppresses browse stop so the named daemon survives the iteration.
  • The --cdp attach mode is the key fix: --remote routes Network.*/Console.* events only to the driving client, leaving the trace observer with just a few Page.lifecycleEvents. --cdp gives the observer the full per-page firehose — matching the canonical pattern in browser-trace's SKILL.md.

What you get

Verified end-to-end on news.ycombinator.com:

  • Inner agent: 3 turns, $0.08, status completed.
  • Trace: 78 CDP events, 7 Network requests captured with real URLs (news.ycombinator.com, news.css, hn.js, …); per-page bisect shows {"requests":7,"failed":0,"byType":{"Document":1,"Image":3,...}}.
  • No leftover Browserbase sessions; clean release each iter.

Compare to default mode (without --browser-trace) which is byte-for-byte unchanged.

Files changed

  • skills/autobrowse/scripts/evaluate.mjs+65: new flag, rewriteArgsForTrace argv injector, browse stop suppression, conditional system prompt.
  • skills/autobrowse/SKILL.md+55: opt-in flag wired into the loop (six surgical edits — entry points, args, run-the-inner-agent split, read-the-trace subsection, hypothesis example, rules).
  • browser-trace skill — unchanged.

Test plan

  • Default-path regression: /autobrowse --task <existing> --env remote (no flag) — behavior identical to today.
  • --browser-trace requires remote: evaluate.mjs --connect-url … --env local errors out cleanly.
  • Single traced iteration on HN: trace captures all 7 Network requests + lifecycle events; pages[0].url == "https://news.ycombinator.com/", not (initial).
  • browse stop defense: harness suppresses it so the named daemon survives.
  • No session leaks after the loop.

🤖 Generated with Claude Code


Note

Medium Risk
Changes remote Browserbase session lifecycle and browse CLI invocation; mistakes could leak sessions or break parallel runs, but default path is unchanged and teardown order is documented.

Overview
Adds an opt-in --browser-trace path to autobrowse that pairs each remote iteration with the sibling browser-trace skill: pre-create a Browserbase session, run bb-capture, drive the inner agent via evaluate.mjs --connect-url, then stop → bisect → unify → release (session must stay alive until bisect finishes).

evaluate.mjs rewrites page-driving browse calls to attach with --cdp and a per-connectUrl session name (parallel-safe), strips --remote/--local, and no-ops browse stop so the outer harness owns the session. Default runs without the flag are unchanged.

New unify-trace.mjs merges agent trace.json with CDP into unified-events.jsonl for time-ordered, evidence-based strategy.md updates; SKILL.md documents the traced loop, drill-downs, and stricter hypotheses citing that file.

Reviewed by Cursor Bugbot for commit 2c33cbd. Bugbot is set up for automated code reviews on this repo. Configure here.

aq17 and others added 2 commits May 28, 2026 17:22
Wire the sibling browser-trace skill into the autobrowse loop behind a
default-off --browser-trace flag. When set, each iteration pre-creates a
Browserbase --keep-alive session, attaches bb-capture as a passive CDP
observer, and runs evaluate.mjs with --connect-url so every inner browse
call is rewritten to attach via --cdp $connectUrl --session
autobrowse-main. browse stop is suppressed under trace mode so the named
daemon survives the iteration.

The --cdp attach mode is the key: --remote routes Network.*/Console.*
events only to the driving client, so the trace observer sees nothing
useful. --cdp gives the observer the full per-page firehose, matching
the canonical pattern documented in browser-trace's SKILL.md.

Default path is unchanged; --browser-trace implies --env remote.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- Add a fail-fast preflight in the traced-path block: if
  ${CLAUDE_SKILL_DIR}/../browser-trace/scripts/bb-capture.mjs is missing,
  bail with an install hint pointing at github.com/browserbase/skills.
  Prevents the harness from blowing up halfway through with "no such
  file or directory" on bb-capture.mjs.

- Make explicit in "Read the trace" + "Form one hypothesis" that under
  --browser-trace there are now TWO complementary traces: trace.json
  (what the agent did, per turn) and cdp/summary.json (what the browser
  did, per page). Document the cross-reference pattern: find the failing
  page in cdp/summary.json, then jump back to trace.json for the turn
  whose command produced it. Hypotheses should cite both — "the click
  went through but /api/checkout returned 403" beats "the click failed."

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@aq17 aq17 requested review from shubh24 and ziruihao and removed request for shubh24 May 29, 2026 00:28
\`\`\`
browse open <url> --local
\`\`\``;
\`\`\``);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System prompt contradicts itself about browse stop in trace mode

Medium Severity

When connectUrl is set, envDesc explicitly tells the inner agent "Do not run browse stop", but the Workflow Pattern (steps 1 and 7) and Critical Rule 1 — all part of the same system prompt — still instruct the agent to run browse stop. The openFlag being set to "" also produces trailing spaces in those template lines (e.g. browse open <url> instead of browse open <url>). While the executeCommand suppression prevents functional breakage, the contradictory instructions cause the inner agent to waste turns on suppressed browse stop calls, burning through the 30-turn budget unnecessarily.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d602d44. Configure here.

Closes the gap flagged in PR #121 review: /autobrowse --tasks a,b
--browser-trace would silently fall back to the untraced path inside
each spawned sub-agent because the sub-agent prompt template said only
"Use --env <env>" without mentioning --browser-trace.

The sub-agent prompt now explicitly requires the traced-path loop block
(pre-create session, bb-capture, --connect-url to evaluate.mjs, stop+
bisect, release) when the parent invocation passed --browser-trace.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@aq17
Copy link
Copy Markdown
Contributor Author

aq17 commented May 29, 2026

--browser-trace v1 — Empirical Evaluation Report

Goal

Determine whether the v1 two-trace shape (separate trace.json + .o11y/<run-id>/cdp/) delivers useful evidence — and inform Derek's open question: should we unify these into a single event log?

Matrix — 3 tasks × 2 configurations × 1 iteration

Task Config Status Turns Cost CDP events First page captured
hn-top-story no-trace ✅ completed 4 $0.09
hn-top-story with-trace ✅ completed 3 $0.08 81 news.ycombinator.com
httpbin-form no-trace ✅ completed 16 $0.40
httpbin-form with-trace ✅ completed 17 $0.45 89 (2 pages) httpbin.org/forms/post + /post
flights-search no-trace ✅ completed 6 $0.15
flights-search with-trace ✅ completed 3 $0.07 605 google.com/travel/flights

Per-task convergence delta (with-trace minus without)

Task Turns Δ Cost Δ Interpretation
hn-top-story −1 −$0.01 Noise — both passed in ≤4 turns either way
httpbin-form +1 +$0.05 Trace slightly slower/more expensive
flights-search −3 −$0.08 Trace ran in half the turns, but single-run nondeterminism dominates at this sample size

No regression: every cell completed; trace mode never failed. Cost differences are within Sonnet 4.6 noise.

Trace content quality (with-trace cells)

Task Network reqs Failed Distinct request types Per-page breakdown useful?
hn-top-story 7 0 Document, Image, Script, Stylesheet, Other Marginal — single page, no XHRs, all GETs return 200.
httpbin-form 5 (across 2 pages) 1 failed (page 2 — the POST) Document, Image, Other Yes — page 2's failed request encodes the form-submit roundtrip; this is exactly the kind of evidence an outer agent could cite ("submit returned non-200, switch to FormData / wait for redirect").
flights-search 54 0 XHR(4), Fetch(8), Script(17), Image(13), Stylesheet(2), Font(4), Document(2), Manifest, Other Yes — XHRs against gstatic/google.com with 54 reqs in a single page; this is the network-heavy case where the trace's per-page bisect would let an outer agent say "the search API returned 200 with N results, so the layout selector is wrong" instead of guessing.

Limitations & caveats

  1. Single-iteration runs only. Every cell passed on its first iteration, so the most interesting signal — does the cross-trace pattern actually help an outer agent form a better hypothesis on iter 2+? — wasn't tested. We only verified the trace populates; we did not observe the cross-reference loop.
  2. N=1 per cell. Sonnet 4.6 takes different paths between runs; the −3 turns delta on flights could vanish in a re-run. The plan budgeted for re-runs but they weren't needed for the "does it crash?" sanity pass.
  3. No task forced ≥2 iterations. All three tasks were beatable in one shot. To observe the "trace informs strategy.md" path empirically, we'd need a corpus where the first iter is expected to fail (e.g. a site that requires a wait timeout heuristic the inner agent doesn't know upfront, or a CAPTCHA-gated page).
  4. Subjective "strategy quality" rating from the plan wasn't applied because there are no multi-iter strategy.md updates to rate.

Next steps

  • The single-iter data proves the integration works across three diverse profiles (simple read, form submit, heavy JS / XHR) and doesn't regress completion. That's enough to ship v1 with confidence.
  • The single-iter data does not answer whether the outer agent benefits from two complementary files vs. one unified log. That question only matters when the outer agent actually has to read both — which only happens on iter 2+ when forming a strategy update.
  • Therefore: the right next probe is a failure-first eval. Pick 2–3 tasks where iter 1 reliably fails (so iter 2 has to use the trace to recover) and compare:
    • With current two-trace shape: does the outer agent actually consult both trace.json and cdp/summary.json? Quality of resulting strategy.md heuristic?
    • With a quick unified-log prototype (Derek's suggestion): same iter-2 attempt, but with the events merged. Easier for the outer agent? Does it form a better/faster heuristic?

If unified wins on iter-2 hypothesis quality, do the refactor. If two-trace wins or it's a wash, keep v1 and the cross-reference docs from PR #121 are sufficient.

Per design feedback (Derek): the two-trace cross-reference UX from the
prior commits puts friction on the outer agent. Replace it with a single
time-ordered NDJSON stream that joins the agent's turn log
(trace.json) and the browser-trace CDP firehose (cdp/raw.ndjson) into
one source-tagged file the outer agent reads first each iteration.

New: skills/autobrowse/scripts/unify-trace.mjs
  - Joins trace.json + cdp/raw.ndjson by wall-clock timestamp
  - Reuses bisect-cdp.mjs's monotonic→wall-clock conversion logic
    (anchor on first monotonic CDP timestamp + manifest.started_at)
  - Filters out noisy/redundant CDP methods (dataReceived, ExtraInfo,
    frame Started/Stopped Loading, executionContextCreated, ...) so
    the unified log stays scannable
  - Inline summary fields per browser-event family (url+status+mime
    for Network.*, level+text for Console.*, name for Page.lifecycle,
    etc.); full payloads remain in the cdp/ drill-down files
  - Annotates each browser row with page_id, derived by walking
    top-level Page.frameNavigated events in the same order bisect-cdp
    uses

SKILL.md changes:
  - Traced-path block now invokes unify-trace.mjs after bisect-cdp
  - "Read the trace" section rewritten: unified-events.jsonl is the
    primary path; structured files (cdp/summary.json, trace.json,
    cdp/network/failed.jsonl, etc.) are documented as agent-consumable
    drill-downs for queries the stream doesn't naturally answer
  - "Form one hypothesis" updated: hypothesis must cite a specific
    line/timestamp in unified-events.jsonl or name the drill-down file

Verified end-to-end on news.ycombinator.com → first story's article:
  6 turns, $0.31, 126-event unified log (113 browser + 13 agent),
  strictly time-ordered, no leaked sessions.

No changes to evaluate.mjs or browser-trace. Default path
(no --browser-trace) is unchanged — no unify step runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@aq17
Copy link
Copy Markdown
Contributor Author

aq17 commented May 29, 2026

Update — consolidating to a unified event log

Per Derek's nudge, the two-trace cross-reference UX shipped in earlier commits added friction the outer agent didn't need. New commit b67906f replaces it with a single time-ordered NDJSON stream as the primary trace artifact, while keeping the existing structured files as agent-consumable drill-downs.

What changed:

  • New: skills/autobrowse/scripts/unify-trace.mjs — joins trace.json (agent turns) + cdp/raw.ndjson (browser firehose) into unified-events.jsonl at the run root. Each line is source-tagged ("agent" | "browser"), time-ordered, with inline summary fields per event family.
  • SKILL.md "Read the trace": now points outer agents at unified-events.jsonl first. The structured drill-down files (cdp/summary.json, cdp/network/failed.jsonl, trace.json, etc.) are documented as agent tools for queries the stream doesn't naturally answer (per-page totals, grouped failure lists, full payloads).
  • SKILL.md "Form one hypothesis": hypothesis must cite a specific line/timestamp in unified-events.jsonl or name the drill-down file. Keeps updates evidence-grounded.

Verified end-to-end on news.ycombinator.com (extract first article paragraph): 6 turns, $0.31, 126-event unified log (113 browser + 13 agent), strictly time-ordered. Sample first lines show interleaved sources:

2026-05-29T18:24:59.001Z browser Page.lifecycleEvent      load
2026-05-29T18:25:00.034Z browser Runtime.consoleAPICalled [v3-piercer] installed Object
2026-05-29T18:25:02.783Z agent   reasoning                I'll navigate to Hacker News, find the top story...
2026-05-29T18:25:04.702Z browser Runtime.consoleAPICalled [v3-piercer] installed Object

No changes to evaluate.mjs or the browser-trace skill. Default path (no --browser-trace) is unchanged — no unify step runs.

Closes the design question raised in the prior eval comment about whether to unify the logs.

Two small fixes surfaced by v3.1 verification:

1. Static bug: --browser-trace hardcoded the local browse-daemon name as
   "autobrowse-main". Two parallel evaluate.mjs invocations (SKILL.md
   Step-3 multi-task sub-agents) would collide on that socket. Derive a
   per-process daemon name from a sha1 hash of the connectUrl
   (autobrowse-<8 hex chars>) so each Browserbase session gets its own
   local daemon. Deterministic, no deps beyond node:crypto.

2. QOL: MAX_TURNS is now overridable via the MAX_TURNS env var (default
   30). Useful for testing failure-first recovery without patching
   source. No effect on default invocations.

Verification (v3.1, /tmp/autobrowse-eval-v2):
- Part 1: joiner's page_id walk matches bisect-cdp exactly (no drift).
- Part 2: SKIP_METHODS audit — Network.dataReceived + *ExtraInfo etc.
  are filtered correctly; ExtraInfo headers remain accessible via
  drill-down (cdp/raw.ndjson, query.mjs). Known limitation: header-based
  hypotheses need drill-down today.
- Part 3: parallel-daemon contention fixed (this commit). Unit-level
  verified: different connectUrls → different daemon names.
- Part 4: failure-first single-arm on httpbin-form-incomplete
  - iter 1 (MAX_TURNS=15, --browser-trace): max_turns failure at $0.38;
    agent burned 4 turns on verification checks (turns 11-14) before
    submit clicked at turn 15. POST went through (200 OK in unified log)
    but agent ran out of turns before reading response.
  - Manual outer-agent step: read unified-events.jsonl, cited specific
    events (POST line + 4 verification turns), wrote evidence-grounded
    strategy.md update telling iter 2 to skip the verification loop.
  - iter 2: 12 turns, $0.27, status=completed, echoed_custname matches.
  - Net: unified-log evidence drove a real recovery on a failure-first
    task. ~$0.65 total v3.1 spend.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@aq17
Copy link
Copy Markdown
Contributor Author

aq17 commented May 29, 2026

v3.1 verification (4/4 PASS) — 921697c

Spent ~$0.65 validating the four unprovens flagged in the prior gut-check:

Part What it tested Result
1 unify-trace.mjs's page_id walk matches bisect-cdp.mjs ✅ PASS — no drift; same per-page URL assignment as bisect's pages[]
2 SKIP_METHODS list isn't filtering load-bearing signal ✅ PASS with note — Network.dataReceived (178 events) and *ExtraInfo (74 events) correctly filtered as noise; header-based hypotheses would need drill-down (cdp/raw.ndjson, query.mjs). Acceptable as documented limitation
3 Multi-task --browser-trace daemon contention ⚠️ STATIC BUG FOUNDevaluate.mjs hardcoded daemon name autobrowse-main; parallel sub-agents would collide. Fixed in 921697c: daemon name now derived from sha1(connectUrl).slice(0,8), e.g. autobrowse-69ff4e4e. Unit-verified: different connectUrls → different names
4 Unified log actually drives multi-iter recovery ✅ PASS — failure-first single-arm on httpbin-form-incomplete: iter 1 hit max_turns=15 ($0.38, agent burned 4 turns on verification before submit). I (as outer agent) read unified-events.jsonl, cited specific events (Network.requestWillBeSent POST /post + 4 wasted verification turns), wrote evidence-grounded strategy.md. Iter 2: 12 turns, $0.27, status=completed, pass-check matches. The unified-log evidence drove a real recovery.

Also bundled a small QOL fix in the same commit: MAX_TURNS env-overridable (default 30 unchanged). Used for forcing iter-1 failure in Part 4; useful for any future failure-first testing.

PR commits now:

  • 78cee35 — initial integration
  • d602d44 — preflight + cross-trace docs (superseded by unified)
  • 5ad0a44 — multi-task sub-agent prompt edit
  • b67906f — unified event log + drill-down
  • 921697c — daemon-name uniqueness + MAX_TURNS env override

Ready for review.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 921697c. Configure here.

return Math.floor((ts - anchorCdp) * 1000 + startedMs);
}
return Math.floor(ts);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CDP events without params.timestamp silently dropped from unified stream

Medium Severity

cdpTsToMs reads only ev.params.timestamp, but several CDP event types handled in summarizeCdp have no timestamp field in params — including Page.frameNavigated, Console.messageAdded, Page.navigatedWithinDocument, Network.webSocketCreated, and Target.* events. These return null from cdpTsToMs and are filtered out at the if (ts_ms == null) continue guard. The summarizeCdp switch cases for these events are effectively dead code, and the unified stream silently loses navigation and console events that the file's own header comment claims to include.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 921697c. Configure here.

Pair to the descriptor-capture change landing in the browse CLI: every
page-driving browse call (click/fill/select/goto/wait/etc.) now
side-channels a rich node descriptor to
$O11Y_ROOT/$O11Y_RUN_ID/cdp/descriptors.ndjson when both env vars are
set. The CLI was already gated on O11Y_ROOT — adding O11Y_RUN_ID lets
it know which run subdirectory to write into without inferring from
the most-recently-modified dir.

Two lines change in the traced-path loop block:
  - export O11Y_RUN_ID="$RUN_ID" alongside O11Y_ROOT
  - documentation note on what descriptors.ndjson is and that the
    outer agent should skip it when reading the trace (it feeds
    downstream codegen, not hypothesis formation)

No effect when --browser-trace is unset or when the CLI doesn't yet
have the descriptor-capture utility — the new env var is harmless.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants