CUJBench is a benchmark for diagnosing failed critical user journeys by connecting browser-visible symptoms with backend observability evidence. Each scenario is packaged as a deterministic multi-modal snapshot behind a fixed 12-tool interface, so every agent run sees the same screenshots, network traces, logs, traces, metrics, and operational context.
The benchmark contains 87 labeled scenarios with layered ground truth for root-cause prediction, evidence citation, and reference-trajectory evaluation. The paper shows that six frontier models remain far from saturation on this task: overall accuracy is 19.7% with a ceiling of 52%.
- First benchmark in this area to combine browser-visible failure evidence with backend observability in a diagnosis task
- Deterministic snapshot evaluation with a fixed read-only tool surface for cross-run comparability
- Layered ground truth with root-cause labels, evidence IDs, and reference diagnostic trajectories
The released artifact expands to scenarios/ with the active registry, taxonomy metadata, evaluation-set metadata, and 87 scenario directories containing cached tool responses and ground-truth annotations.
git clone [email protected]:haoming29/CUJBench.git
cd CUJBench
git lfs pullRestore the corpus into a local workspace:
uv run cujbench restore --dest /path/to/workspaceThis extracts a scenarios/ directory into the destination path. If you want to
restore a non-default archive, pass --bundle /path/to/archive.tar.gz.
If you want to inspect a few released scenarios without restoring the full
bundle, see examples/. It contains five representative scenarios
copied verbatim from the released review archive, together with a subset
registry.json for navigation.
CUJBench/
├── artifacts/ # Corpus bundle and manifest
├── docs/ # Documentation and evaluation guide
├── harness/ # Source code of the evaluation harness
├── examples/ # Small browseable subset of released scenarios
├── scripts/ # Shell helper for bundle restore compatibility
├── tests/ # Python test suite for the harness
├── .env.example # Environment variable template
└── batch_config.json # Batch-eval configuration
Install the evaluation harness and configure your API key:
uv sync --group dev
cp .env.example .envSet OPENROUTER_API_KEY in .env, then run a single evaluation:
uv run cujbench eval \
--scenario-id SCEN-005-cart-failure \
--baseline agent-full \
--model anthropic/claude-sonnet-4.6Other supported baselines:
agent-fullagent-browser-onlybaseline-retrieval
| Baseline | Description |
|---|---|
agent-full |
Tool-calling agent with access to the full CUJBench tool surface. |
agent-browser-only |
Tool-calling agent restricted to browser and frontend evidence only. |
baseline-retrieval |
BM25 retrieval over cached evidence followed by a single LLM diagnosis call. |
For batch runs and more evaluation details, see docs/evaluation.md.
scenarios/
├── registry.json
├── taxonomy/
├── eval_sets/
└── SCEN-xxx-.../
├── browser/
│ └── screenshots/
├── context/
│ └── alert.json
├── ground_truth/
│ ├── root_cause.json
│ └── reference_trajectory.json
└── tool_cache/
Each scenario directory contains the frozen evidence exposed through the deterministic tool interface, together with the annotations used for scoring.
| Family | Apps | Representative signal | Easy | Medium | Hard | N |
|---|---|---|---|---|---|---|
| Baseline | Both | Healthy end-to-end CUJs with no injected failure | 2 | 0 | 0 | 2 |
| Browser proxy faults | Both | HAR, screenshots, DOM, console, and browser-visible timing or content anomalies | 5 | 48 | 3 | 56 |
| Backend flag faults | OpenTelemetry Demo | Recent changes, traces, logs, metrics, and backend service symptoms | 0 | 4 | 0 | 4 |
| Compound faults | OpenTelemetry Demo | Cross-modal evidence linking browser symptoms with backend change or telemetry signal | 0 | 0 | 18 | 18 |
| Frontend mutations | Tractor Store | Event-flow drift, listener failures, and browser state inconsistencies | 0 | 3 | 4 | 7 |
Scenario ID conventions:
SCEN-0xx-*corresponds to OpenTelemetry DemoSCEN-1xx-*corresponds to Tractor Store
For the full metadata map, see scenarios/registry.json after restoring the bundle.
| Tool | What it provides |
|---|---|
view_screenshot |
Captured browser screenshots for the failed or degraded page state |
get_browser_console |
Browser console logs, warnings, and JavaScript errors |
get_network_requests |
Network request and response records, including timing and status details |
get_dom_snapshot |
The captured DOM or HTML state of the rendered page |
search_logs |
Backend service logs relevant to the scenario |
get_traces |
Distributed tracing data across services involved in the failing flow |
get_error_rate |
Error-rate metrics for relevant services or endpoints |
get_recent_changes |
Recent deploy or change context associated with the scenario |
get_service_topology |
Service dependency and call-topology context |
get_cuj_report |
The execution report for the critical user journey test |
get_browser_storage |
Browser storage state such as cookies and local storage |
get_frontend_spans |
Browser-side spans or frontend telemetry associated with the run |
- ✅ Scenario Corpus: Released scenarios with cached tool responses and scoring annotations.
- ✅ Evaluation Harness: Deterministic evaluation with
agent-full,agent-browser-only, andbaseline-retrieval. - ✅ CLI Workflows: Restore, single-run, and batch-evaluation commands for the public release.
- 🚧 Scenario Generation And Packaging Pipeline (coming soon): Public capture, packaging, and bundle-construction workflow.
If you use CUJBench in research, cite the paper:
@misc{cujbench2026,
title = {CUJBench: Benchmarking LLM Agents on Cross-Modal Failure Diagnosis from Browser to Backend},
author = {Haoming Meng},
year = {2026},
eprint = {2604.23455},
archivePrefix = {arXiv},
primaryClass = {cs.SE},
url = {https://arxiv.org/abs/2604.23455v1}
}Apache License 2.0. See LICENSE for details.
