Skip to content

haoming29/CUJBench

Repository files navigation

CUJBench: Benchmarking LLM Agents on Cross-Modal Failure Diagnosis from Browser to Backend

arXiv 2604.23455 Apache 2.0 license

CUJBench teaser showing scenario generation and evaluation harness

CUJBench is a benchmark for diagnosing failed critical user journeys by connecting browser-visible symptoms with backend observability evidence. Each scenario is packaged as a deterministic multi-modal snapshot behind a fixed 12-tool interface, so every agent run sees the same screenshots, network traces, logs, traces, metrics, and operational context.

The benchmark contains 87 labeled scenarios with layered ground truth for root-cause prediction, evidence citation, and reference-trajectory evaluation. The paper shows that six frontier models remain far from saturation on this task: overall accuracy is 19.7% with a ceiling of 52%.

Why CUJBench

  • First benchmark in this area to combine browser-visible failure evidence with backend observability in a diagnosis task
  • Deterministic snapshot evaluation with a fixed read-only tool surface for cross-run comparability
  • Layered ground truth with root-cause labels, evidence IDs, and reference diagnostic trajectories

The released artifact expands to scenarios/ with the active registry, taxonomy metadata, evaluation-set metadata, and 87 scenario directories containing cached tool responses and ground-truth annotations.

Quick Start

git clone [email protected]:haoming29/CUJBench.git
cd CUJBench
git lfs pull

Restore the corpus into a local workspace:

uv run cujbench restore --dest /path/to/workspace

This extracts a scenarios/ directory into the destination path. If you want to restore a non-default archive, pass --bundle /path/to/archive.tar.gz.

Browse Example Scenarios

If you want to inspect a few released scenarios without restoring the full bundle, see examples/. It contains five representative scenarios copied verbatim from the released review archive, together with a subset registry.json for navigation.

Repository Layout

CUJBench/
├── artifacts/        # Corpus bundle and manifest
├── docs/             # Documentation and evaluation guide
├── harness/          # Source code of the evaluation harness
├── examples/         # Small browseable subset of released scenarios
├── scripts/          # Shell helper for bundle restore compatibility
├── tests/            # Python test suite for the harness
├── .env.example      # Environment variable template
└── batch_config.json # Batch-eval configuration

Run Evaluation

Install the evaluation harness and configure your API key:

uv sync --group dev
cp .env.example .env

Set OPENROUTER_API_KEY in .env, then run a single evaluation:

uv run cujbench eval \
  --scenario-id SCEN-005-cart-failure \
  --baseline agent-full \
  --model anthropic/claude-sonnet-4.6

Other supported baselines:

  • agent-full
  • agent-browser-only
  • baseline-retrieval

Evaluation Baselines

Baseline Description
agent-full Tool-calling agent with access to the full CUJBench tool surface.
agent-browser-only Tool-calling agent restricted to browser and frontend evidence only.
baseline-retrieval BM25 retrieval over cached evidence followed by a single LLM diagnosis call.

For batch runs and more evaluation details, see docs/evaluation.md.

Restored Scenario Layout

scenarios/
├── registry.json
├── taxonomy/
├── eval_sets/
└── SCEN-xxx-.../
    ├── browser/
    │   └── screenshots/
    ├── context/
    │   └── alert.json
    ├── ground_truth/
    │   ├── root_cause.json
    │   └── reference_trajectory.json
    └── tool_cache/

Each scenario directory contains the frozen evidence exposed through the deterministic tool interface, together with the annotations used for scoring.

Scenario Taxonomy

Family Apps Representative signal Easy Medium Hard N
Baseline Both Healthy end-to-end CUJs with no injected failure 2 0 0 2
Browser proxy faults Both HAR, screenshots, DOM, console, and browser-visible timing or content anomalies 5 48 3 56
Backend flag faults OpenTelemetry Demo Recent changes, traces, logs, metrics, and backend service symptoms 0 4 0 4
Compound faults OpenTelemetry Demo Cross-modal evidence linking browser symptoms with backend change or telemetry signal 0 0 18 18
Frontend mutations Tractor Store Event-flow drift, listener failures, and browser state inconsistencies 0 3 4 7

Scenario ID conventions:

  • SCEN-0xx-* corresponds to OpenTelemetry Demo
  • SCEN-1xx-* corresponds to Tractor Store

For the full metadata map, see scenarios/registry.json after restoring the bundle.

Available Tools

Tool What it provides
view_screenshot Captured browser screenshots for the failed or degraded page state
get_browser_console Browser console logs, warnings, and JavaScript errors
get_network_requests Network request and response records, including timing and status details
get_dom_snapshot The captured DOM or HTML state of the rendered page
search_logs Backend service logs relevant to the scenario
get_traces Distributed tracing data across services involved in the failing flow
get_error_rate Error-rate metrics for relevant services or endpoints
get_recent_changes Recent deploy or change context associated with the scenario
get_service_topology Service dependency and call-topology context
get_cuj_report The execution report for the critical user journey test
get_browser_storage Browser storage state such as cookies and local storage
get_frontend_spans Browser-side spans or frontend telemetry associated with the run

Benchmark Components

  • Scenario Corpus: Released scenarios with cached tool responses and scoring annotations.
  • Evaluation Harness: Deterministic evaluation with agent-full, agent-browser-only, and baseline-retrieval.
  • CLI Workflows: Restore, single-run, and batch-evaluation commands for the public release.
  • 🚧 Scenario Generation And Packaging Pipeline (coming soon): Public capture, packaging, and bundle-construction workflow.

Citation

If you use CUJBench in research, cite the paper:

@misc{cujbench2026,
  title        = {CUJBench: Benchmarking LLM Agents on Cross-Modal Failure Diagnosis from Browser to Backend},
  author       = {Haoming Meng},
  year         = {2026},
  eprint       = {2604.23455},
  archivePrefix = {arXiv},
  primaryClass = {cs.SE},
  url          = {https://arxiv.org/abs/2604.23455v1}
}

License

Apache License 2.0. See LICENSE for details.

About

The full corpus of CUJBench: Benchmarking LLM Agents on Cross-Modal Failure Diagnosis from Browser to Backend

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors