CUJBench: Benchmarking LLM Agents on Cross-Modal Failure Diagnosis from Browser to Backend

CUJBench teaser showing scenario generation and evaluation harness

CUJBench is a benchmark for diagnosing failed critical user journeys by connecting browser-visible symptoms with backend observability evidence. Each scenario is packaged as a deterministic multi-modal snapshot behind a fixed 12-tool interface, so every agent run sees the same screenshots, network traces, logs, traces, metrics, and operational context.

The benchmark contains 87 labeled scenarios with layered ground truth for root-cause prediction, evidence citation, and reference-trajectory evaluation. The paper shows that six frontier models remain far from saturation on this task: overall accuracy is 19.7% with a ceiling of 52%.

Why CUJBench

First benchmark in this area to combine browser-visible failure evidence with backend observability in a diagnosis task
Deterministic snapshot evaluation with a fixed read-only tool surface for cross-run comparability
Layered ground truth with root-cause labels, evidence IDs, and reference diagnostic trajectories

The released artifact expands to scenarios/ with the active registry, taxonomy metadata, evaluation-set metadata, and 87 scenario directories containing cached tool responses and ground-truth annotations.

Quick Start

git clone [email protected]:haoming29/CUJBench.git
cd CUJBench
git lfs pull

Restore the corpus into a local workspace:

uv run cujbench restore --dest /path/to/workspace

This extracts a scenarios/ directory into the destination path. If you want to restore a non-default archive, pass --bundle /path/to/archive.tar.gz.

Browse Example Scenarios

If you want to inspect a few released scenarios without restoring the full bundle, see examples/. It contains five representative scenarios copied verbatim from the released review archive, together with a subset registry.json for navigation.

Repository Layout

CUJBench/
├── artifacts/        # Corpus bundle and manifest
├── docs/             # Documentation and evaluation guide
├── harness/          # Source code of the evaluation harness
├── examples/         # Small browseable subset of released scenarios
├── scripts/          # Shell helper for bundle restore compatibility
├── tests/            # Python test suite for the harness
├── .env.example      # Environment variable template
└── batch_config.json # Batch-eval configuration

Run Evaluation

Install the evaluation harness and configure your API key:

uv sync --group dev
cp .env.example .env

Set OPENROUTER_API_KEY in .env, then run a single evaluation:

uv run cujbench eval \
  --scenario-id SCEN-005-cart-failure \
  --baseline agent-full \
  --model anthropic/claude-sonnet-4.6

Other supported baselines:

agent-full
agent-browser-only
baseline-retrieval

Evaluation Baselines

Baseline	Description
`agent-full`	Tool-calling agent with access to the full CUJBench tool surface.
`agent-browser-only`	Tool-calling agent restricted to browser and frontend evidence only.
`baseline-retrieval`	BM25 retrieval over cached evidence followed by a single LLM diagnosis call.

For batch runs and more evaluation details, see docs/evaluation.md.

Restored Scenario Layout

scenarios/
├── registry.json
├── taxonomy/
├── eval_sets/
└── SCEN-xxx-.../
    ├── browser/
    │   └── screenshots/
    ├── context/
    │   └── alert.json
    ├── ground_truth/
    │   ├── root_cause.json
    │   └── reference_trajectory.json
    └── tool_cache/

Each scenario directory contains the frozen evidence exposed through the deterministic tool interface, together with the annotations used for scoring.

Scenario Taxonomy

Family	Apps	Representative signal	Easy	Medium	Hard	N
Baseline	Both	Healthy end-to-end CUJs with no injected failure	2	0	0	2
Browser proxy faults	Both	HAR, screenshots, DOM, console, and browser-visible timing or content anomalies	5	48	3	56
Backend flag faults	OpenTelemetry Demo	Recent changes, traces, logs, metrics, and backend service symptoms	0	4	0	4
Compound faults	OpenTelemetry Demo	Cross-modal evidence linking browser symptoms with backend change or telemetry signal	0	0	18	18
Frontend mutations	Tractor Store	Event-flow drift, listener failures, and browser state inconsistencies	0	3	4	7

Scenario ID conventions:

SCEN-0xx-* corresponds to OpenTelemetry Demo
SCEN-1xx-* corresponds to Tractor Store

For the full metadata map, see scenarios/registry.json after restoring the bundle.

Available Tools

Tool	What it provides
`view_screenshot`	Captured browser screenshots for the failed or degraded page state
`get_browser_console`	Browser console logs, warnings, and JavaScript errors
`get_network_requests`	Network request and response records, including timing and status details
`get_dom_snapshot`	The captured DOM or HTML state of the rendered page
`search_logs`	Backend service logs relevant to the scenario
`get_traces`	Distributed tracing data across services involved in the failing flow
`get_error_rate`	Error-rate metrics for relevant services or endpoints
`get_recent_changes`	Recent deploy or change context associated with the scenario
`get_service_topology`	Service dependency and call-topology context
`get_cuj_report`	The execution report for the critical user journey test
`get_browser_storage`	Browser storage state such as cookies and local storage
`get_frontend_spans`	Browser-side spans or frontend telemetry associated with the run

Benchmark Components

✅ Scenario Corpus: Released scenarios with cached tool responses and scoring annotations.
✅ Evaluation Harness: Deterministic evaluation with agent-full, agent-browser-only, and baseline-retrieval.
✅ CLI Workflows: Restore, single-run, and batch-evaluation commands for the public release.
🚧 Scenario Generation And Packaging Pipeline (coming soon): Public capture, packaging, and bundle-construction workflow.

Citation

If you use CUJBench in research, cite the paper:

@misc{cujbench2026,
  title        = {CUJBench: Benchmarking LLM Agents on Cross-Modal Failure Diagnosis from Browser to Backend},
  author       = {Haoming Meng},
  year         = {2026},
  eprint       = {2604.23455},
  archivePrefix = {arXiv},
  primaryClass = {cs.SE},
  url          = {https://arxiv.org/abs/2604.23455v1}
}

License

Apache License 2.0. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUJBench: Benchmarking LLM Agents on Cross-Modal Failure Diagnosis from Browser to Backend

Why CUJBench

Quick Start

Browse Example Scenarios

Repository Layout

Run Evaluation

Evaluation Baselines

Restored Scenario Layout

Scenario Taxonomy

Available Tools

Benchmark Components

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
artifacts		artifacts
docs		docs
examples		examples
harness		harness
scripts		scripts
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
batch_config.json		batch_config.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

CUJBench: Benchmarking LLM Agents on Cross-Modal Failure Diagnosis from Browser to Backend

Why CUJBench

Quick Start

Browse Example Scenarios

Repository Layout

Run Evaluation

Evaluation Baselines

Restored Scenario Layout

Scenario Taxonomy

Available Tools

Benchmark Components

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages