agent-benchmarks

Benchmark suites for the Lightpanda agent. Each suite lives under src/agent_benchmarks/<suite>/ and ships its own runner, grader, and README.

Suites

assistantbench — 214 live-web QA tasks from Yoran et al., EMNLP 2024. We run the 33-task validation split.
gaia — GAIA Level-1 validation split (53 tasks), graded with the paper's exact-match rubric. Requires an HF_TOKEN (gated dataset). 11 of the 53 tasks have attached PDFs/images/audio that Lightpanda can't read and will score 0 — matches the published-baseline scope.
webbench — WebBench (Halluminate × Skyvern), READ subset only (1,637 tasks across 448 sites). Text-only LLM judge (llm_judge) in the shape of Browserbase's v3Evaluator, judging task + agent answer + visited URLs (no screenshots). Default sample is one task per site (448 tasks). CREATE/UPDATE/DELETE/ FILE_MANIPULATION categories are out of scope (need real account auth on every site, side effects unverifiable from text snapshots).

Current results

Lightpanda native agent via zenai (built-in agent loop, prompt-cached, talking to the model directly — no MCP). Best run to date: single-run numbers captured 2026-06-10 with claude-sonnet-4-6, 4 workers, 1800s per-task timeout, on a lightpanda-agent build with improved tools and a faster goto (no networkidle wait).

Suite	Split	n	Strict	Notes
AssistantBench	validation	33	69.7% (soft 57.1%)	Paper's GPT-4 baseline ≈ 25% strict
GAIA Level 1	validation	53	83.0%	Paper's GPT-4+tools baseline ≈ 30%; published Claude Sonnet SOTA ≈ 82%

Zero timeouts on both. Per-task cost, computed from token counters at Sonnet 4.6 rates ($3 / $0.30 / $3.75 / $15 per M for input / cache-read / cache-write / output): GAIA $0.34, AssistantBench $1.94. This is a follow-up to the Benchmarking Lightpanda for agents post — see agent-benchmarks.md for the full iteration history (branch → cached → leaner → verify → fastgoto) and the engine/tool-surface comparison.

WebBench (READ), one-per-site sample (448), scored 25.4% strict in a separate 2026-05-20 gemini-3.5-flash run (LLM judge: gemini-3.1-pro-preview over task + final answer + visited URLs — not a canonical WebBench number; canonical uses HITL or a multimodal judge). Not re-run on Sonnet 4.6.

GAIA Level 1 includes all 11 attachment tasks: PNG/MP3/PY/TXT fed via --attach; DOCX/XLSX/PPTX extracted to text by the runner first.

Strict counts an answer correct iff its per-task score clears the suite's threshold (≥ 0.5 for AssistantBench's token-F1; ≡ 1.0 for GAIA's exact match; YES from the WebBench LLM judge). Each run writes a manifest.json next to predictions.jsonl recording the agent provider/model and the full launch argv, and graders splat that into scores.json for provenance.

Reproducing (native agent, Sonnet 4.6): uv run <suite>-run --provider anthropic --model claude-sonnet-4-6 --lightpanda <lightpanda-agent> --workers 4 --timeout 1800 from the browser repo root.

Cross-framework comparison: Lightpanda vs agent-browser vs browser-use, Claude+MCP

To isolate the browser tool surface as the only variable in a head-to-head against other agentic browsers, each one is driven by the same Claude Code session over MCP. The LLM (Claude Sonnet 4.6) is held constant; only the browser engine + its MCP tool surface differs. Currently wired up:

Lightpanda native MCP — built-in lightpanda mcp server, text-only, no rendering. Tool surface tailored to agent workflows (search / markdown / extract / structuredData).
agent-browser + Chromium — Vercel agent-browser ships no MCP server, so this repo includes a thin wrapper at competitors/agent-browser-mcp/ that exposes its CDP-style CLI commands as MCP tools, driving headless Chrome.
agent-browser + Lightpanda engine — same wrapper, same agent-browser tool surface, but AGENT_BROWSER_ENGINE=lightpanda swaps Chrome for Lightpanda underneath. Isolates "browser engine" from "tool surface" — comparing this row to the Chromium row above tells you what the engine swap costs/saves; comparing it to the Lightpanda native MCP row tells you what the tool-surface change costs/saves.
browser-use (Chromium) — ships its own stdio MCP server (python -m browser_use.mcp.server), drives full Chrome via cdp-use. The runner skips retry_with_browser_use_agent (would delegate to browser-use's own LLM and contaminate the comparison) and browser_screenshot (multimodal input, unfair vs the text-only competitors).

Single-run numbers captured 2026-05-19..2026-05-21 with claude-sonnet-4-6, 4 workers, 1800s per-task timeout, OAuth auth (Max subscription), aggressive system prompt with <ANSWER>...</ANSWER> envelope:

Suite	Lightpanda native MCP	agent-browser + Chromium	agent-browser + Lightpanda	browser-use (Chromium)
AssistantBench (33) strict	66.7%	57.6%	57.6%	39.4%
AssistantBench avg duration	765 s	1121 s	1035 s	1167 s
AssistantBench timeouts	4 / 33	10 / 33	7 / 33	6 / 33
GAIA Level 1 (53) strict	86.8%	84.9%	81.1%	47.2%
GAIA Level 1 avg duration	228 s	321 s	447 s	837 s
GAIA Level 1 timeouts	1 / 53	2 / 53	5 / 53	9 / 53

By AssistantBench difficulty (Lightpanda native / AB+Chromium / AB+Lightpanda / browser-use): Medium 85.7% / 85.7% / 85.7% / 71.4%; Hard 52.6% / 36.8% / 36.8% / 15.8%. The Lightpanda↔agent-browser gap on Hard (+15.8 pp for Lightpanda's native MCP) is concentrated on multi-source aggregation, where Lightpanda's search / markdown / extract / structuredData primitives are more efficient than agent-browser's lower-level CDP surface (open / snapshot / click / get) — and that gap survives the engine swap: running agent-browser's tool surface against Lightpanda-the-engine produces the same 36.8% on Hard as running it against Chrome. The tool surface, not the engine, is what moves the AssistantBench needle.

What the engine swap shows. On AssistantBench, swapping Chrome → Lightpanda inside agent-browser is a free speed/reliability win: identical 57.6% strict, but −86 s per task on average and 3 fewer timeouts. On GAIA, the swap pays a small accuracy tax (−3.8 pp, 81.1% vs 84.9%) and runs 126 s/task slower with 5 timeouts vs 2. The extra GAIA failures are concentrated on pages where the rendered text payload differs from Chrome's — archive.org snapshots, BBC, Reddit, and similar JS-dependent pages where Lightpanda's text-only rendering misses content Chrome would have shown. AssistantBench tasks lean on structured authoritative sources (Wikipedia, retailer pages, Yelp), where this gap doesn't bite.

The browser-use gap on both suites is wider. Two factors visible in the traces: (1) per-tool latency — browser_get_state / browser_navigate return larger payloads and run against full Chrome with extensions, capping useful tool calls around ~300 per 1800s budget vs ~600–800 for Lightpanda; (2) the timeout rate is much higher (AB 6/33 = 18%, GAIA 9/53 = 17%), concentrated on the same multi-source aggregation tasks where agent-browser also struggles but mostly still answers in time. GAIA tasks that fit a single source-page (Wikipedia, IMDB, Cornell LII) land well; the multi-hop academic-citation hunts hit the wall.

How the comparison is locked down

For the Lightpanda-MCP vs agent-browser-MCP comparison to mean anything, Claude must be restricted to ONLY the browser MCP — no WebSearch, WebFetch, Bash, Read, etc. We learned this the hard way: an earlier "clean" Lightpanda+MCP run actually used 67× WebSearch + 48× WebFetch + 14× Bash alongside the MCP tools, inflating scores by 20+ pp. The runner now enforces:

--strict-mcp-config — blocks other MCP sources (including the user's own ~/.claude.json).
--disallowed-tools <comprehensive-list> — explicit deny-list of every built-in Claude Code tool. Lockdown is load-bearing: --allowed-tools alone is not a hard restriction in -p mode (it's a permission-prompt bypass list, not a tool restriction).
--system-prompt (full replace, not append) — replaces Claude Code's default tool-use guidance so the model isn't reminded that Bash exists.
Subprocess env scrubbing — strips ANTHROPIC_API_KEY so claude -p uses the keychain OAuth token (Max subscription) instead of API-key billing.

Each prediction row's trace field captures every tool call ({tool, args}) from claude --output-format stream-json so the lockdown can be audited per-task. Across the 86 clean tasks: zero non-MCP tool calls beyond ToolSearch (the built-in MCP-schema loader Claude Code uses internally; harmless because it can't bypass --disallowed-tools).

What's comparable and what isn't

Comparison	Clean?
Lightpanda MCP vs agent-browser MCP under the same Claude session	✓ Yes — same brain, same prompt, same envelope, same parsing
Sonnet+MCP vs Flash+native-agent (e.g. 86.8% vs 75.5% GAIA)	✗ No — three variables change at once: model, agent loop, prompt+envelope

Per-model prompt design matters more than absolute prompt quality — the published Flash baseline above uses the prompt that's optimal for Flash (simple "Output ONLY the answer" + early-bail), and the Claude+MCP numbers use the prompt that's optimal for Sonnet (strict <ANSWER> envelope + persistence guidance).

Reproducing

# Prerequisites
npm install -g agent-browser && agent-browser install   # binary + Chrome
export HF_TOKEN=hf_...                                   # GAIA gated dataset
# claude CLI on PATH using Max OAuth — do NOT export ANTHROPIC_API_KEY

LP=/path/to/lightpanda/zig-out/bin/lightpanda

# Lightpanda MCP backend (built-in `lightpanda mcp`)
uv run assistantbench-mcp-run --backend lightpanda --lightpanda $LP \
  --workers 4 --model sonnet --timeout 1800
uv run gaia-mcp-run --backend lightpanda --lightpanda $LP \
  --workers 4 --model sonnet --timeout 1800

# agent-browser MCP backend (wrapper in competitors/agent-browser-mcp/)
uv run assistantbench-mcp-run --backend agent-browser \
  --agent-browser $(which agent-browser) \
  --workers 4 --model sonnet --timeout 1800
uv run gaia-mcp-run --backend agent-browser --workers 4 \
  --model sonnet --timeout 1800

# agent-browser MCP backend, but with Lightpanda as the underlying engine
# instead of Chrome. The wrapper sets AGENT_BROWSER_ENGINE=lightpanda and
# AGENT_BROWSER_EXECUTABLE_PATH=$LP on the per-worker agent-browser daemon,
# so every CDP call goes to Lightpanda. Same Claude brain, same MCP tool
# surface, just a different engine — isolates engine vs tool-surface.
uv run assistantbench-mcp-run --backend agent-browser-lightpanda \
  --agent-browser $(which agent-browser) --lightpanda $LP \
  --workers 4 --model sonnet --timeout 1800
uv run gaia-mcp-run --backend agent-browser-lightpanda \
  --agent-browser $(which agent-browser) --lightpanda $LP \
  --workers 4 --model sonnet --timeout 1800

# browser-use MCP backend (built-in `python -m browser_use.mcp.server`).
# Needs `uv sync` (browser-use is a project dep) and a system Chrome/Chromium
# on PATH. Each worker gets its own BROWSER_USE_CONFIG_DIR=/tmp/browser-use-mcp-<sess>
# tempdir for profile isolation (cleaned up at run end).
uv run assistantbench-mcp-run --backend browser-use \
  --workers 4 --model sonnet --timeout 1800
uv run gaia-mcp-run --backend browser-use \
  --workers 4 --model sonnet --timeout 1800

Results land in results/<suite>-mcp/<backend>/<timestamp>/.

Running agent-browser standalone (no MCP)

A second comparison path runs agent-browser's own chat mode directly, which uses agent-browser's internal LLM loop instead of Claude. Useful for comparing against agent-browser-as-shipped:

# Through Vercel AI Gateway
export AI_GATEWAY_API_KEY=vck_...
uv run assistantbench-ab-run --model google/gemini-3.5-flash
uv run gaia-ab-run --model google/gemini-3.5-flash

# Or direct to Google's OpenAI-compat endpoint (bypasses Vercel)
export GEMINI_DIRECT=1
uv run assistantbench-ab-run --model gemini-3.5-flash

Comparability across frameworks

Of the three suites, only GAIA and AssistantBench-strict produce numbers that can be compared directly to other frameworks' published results — both grade with deterministic rubrics from their respective papers (exact-match / token-F1) over a fixed gold answer.

WebBench is in the suite for internal Lightpanda comparison, not cross-framework comparison. Its canonical evaluation is human-in-the-loop: per the Halluminate technical report, "In order to evaluate the results of each agent trajectory, we employ a team of human annotators" who review the task, the agent's output, and a screen recording. There is no published text-only LLM-judge protocol, so our webbench-text-only variant is something we made up. It is comparable across Lightpanda runs that share judge_model and variant; it is not comparable to any number on webbench.ai or in another framework's blog.

Why WebBench (READ) but not WebVoyager

WebVoyager's canonical grader is GPT-4V over the last k screenshots of the trajectory, per their README: "We provide the task, the responses from WebVoyager, and the last k screenshots to the GPT-4V and ask it to judge whether the agent has successfully completed the task." Lightpanda has no rendering, so we can't run that grader. WebVoyager does ship a --text_only agent mode, but there is no documented text-only evaluation variant — only the multimodal one.

The thing that makes WebBench's READ subset still useful under a text-only grader, where WebVoyager would not be, is what the tasks ask the agent to produce:

WebBench READ tasks are data-extraction questions ("what is the price of X on this site?", "which results are listed on this page?"). The agent's deliverable is a textual answer, and that answer alone is what HITL annotators check too. Halluminate reports top automated agents at >75% on READ vs only 46.6% on non-READ — separating READ out is something the canonical evaluation already does.
WebVoyager tasks are mostly navigation / end-state tasks ("subscribe to this newsletter", "find and add this item to cart"). The deliverable is a UI state, not a string, and the canonical GPT-4V grader checks the screenshots to see whether that state was actually reached. A text-only judge looking at "task + answer + visited URLs" can only confirm that the agent claims it did the thing — not whether the page reflects it.

So for WebBench READ, the text-only judge is a defensible (if non-canonical) approximation of what HITL would check on the same trajectory. For WebVoyager it would be a meaningful step away from what the canonical grader actually scores. Combined with the redundancy (same judge shape, fewer tasks, narrower site set than WebBench: 643 / 15 vs 1,637 / 448), there was no reason to keep it.

If you want a number that is canonical-comparable to a published WebVoyager leaderboard, you'd need to either render screenshots (out of scope for Lightpanda) or import another framework's exact text-only judge prompt as a new llm_judge variant.

Setup

# Sync deps and create the venv
uv sync

# Build the lightpanda binary first (run from the browser repo root)
zig build -Doptimize=ReleaseFast

# Set the API key for whichever provider you'll use
export GOOGLE_API_KEY=...

# Strongly recommended: route the lightpanda `search` tool through Tavily.
# When unset, search falls back to scraping DuckDuckGo's HTML endpoint —
# noisier results, may rate-limit under concurrency. Google scraping was
# tried and dropped because Lightpanda's User-Agent and TLS fingerprint
# get blocked or fed a consent wall on essentially every query.
export TAVILY_API_KEY=tvly-...

Running

Each suite exposes <suite>-run and <suite>-grade as console scripts:

# From the browser repo root (so zig-out/bin/lightpanda resolves)
uv run assistantbench-run --limit 3
uv run assistantbench-grade results/assistantbench/<timestamp>/predictions.jsonl

Results land in results/<suite>/<UTC-timestamp>/.

Adding a new suite

Create src/agent_benchmarks/<suite>/ with __init__.py, run.py, grade.py, and a README.md.
Expose console scripts in pyproject.toml under [project.scripts]: <suite>-run = "agent_benchmarks.<suite>.run:main" and <suite>-grade = "agent_benchmarks.<suite>.grade:main".
uv sync to re-resolve.

Development

uv run ruff check src/
uv run ruff format src/

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
competitors/agent-browser-mcp		competitors/agent-browser-mcp
src/agent_benchmarks		src/agent_benchmarks
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

agent-benchmarks

Suites

Current results

Cross-framework comparison: Lightpanda vs agent-browser vs browser-use, Claude+MCP

How the comparison is locked down

What's comparable and what isn't

Reproducing

Running agent-browser standalone (no MCP)

Comparability across frameworks

Why WebBench (READ) but not WebVoyager

Setup

Running

Adding a new suite

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

agent-benchmarks

Suites

Current results

Cross-framework comparison: Lightpanda vs agent-browser vs browser-use, Claude+MCP

How the comparison is locked down

What's comparable and what isn't

Reproducing

Running agent-browser standalone (no MCP)

Comparability across frameworks

Why WebBench (READ) but not WebVoyager

Setup

Running

Adding a new suite

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages