WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

WebRISE is a benchmark for evaluating whether MLLM-generated web artifacts actually work, rather than only look plausible. WebRISE compiles task requirements into Interaction Contract Graphs (ICGs) of observable UI states, user-intent transitions, and DOM/visual assertions, then evaluates generated HTML pages through browser execution.

Overview

WebRISE targets executable, interactive web artifacts generated from multimodal specifications. It evaluates behavior at the level of requirement-induced state transitions, supporting diagnosis of explicit user-facing functions and implicit product-level constraints such as state synchronization, boundary feedback, stale-state cleanup, and state preservation.

The released benchmark includes:

442 web tasks across diverse domains and scenarios.
5 input modalities: Text, Markdown, Sketch, Image, and Video.
5,495 interaction transitions.
5,271 requirement checks.
Contract-guided browser evaluation with DOM and visual evidence.

Benchmark Design

Requirement-induced state contracts: Each task is represented by an ICG that links explicit and implicit requirements to observable states, user-intent transitions, DOM/visual assertions, and coverage mappings.
Contract-guided adaptive execution: The ICG specifies what to verify, while an adaptive browser agent decides how to execute each transition on diverse generated pages.
Implementation-agnostic observations: The agent acts over indexed DOM observations instead of fixed CSS selectors, reference DOM paths, or hand-written scripts.
DOM/visual dual oracle: DOM assertions capture process and element-level evidence, while visual postconditions verify user-visible state changes from screenshots.
Diagnostic metrics: WebRISE reports state reachability, transition validity, explicit/implicit requirement coverage, and auxiliary visual quality diagnostics.

Main Results

Each modality cell reports T / R / V, where T is transition validity, R is overall requirement coverage, and V is the auxiliary visual score.

Model	Text	MD	Sketch	Image	Video	Overall
Qwen3.6-35B-A3B	26.8 / 30.5 / 78.2	15.5 / 19.2 / 80.8	41.2 / 45.4 / 77.0	46.6 / 49.6 / 71.7	49.5 / 52.2 / 72.8	50.5
Qwen3.5-122B-A10B	38.0 / 41.2 / 56.8	42.5 / 45.9 / 72.0	38.0 / 42.3 / 74.0	40.2 / 43.8 / 70.7	42.8 / 47.1 / 71.3	51.1
Qwen3.5-27B	36.3 / 40.0 / 59.9	41.7 / 45.5 / 72.1	38.6 / 42.7 / 76.8	42.6 / 46.7 / 70.6	43.1 / 46.9 / 71.8	51.7
Qwen3.5-397B-A17B	45.7 / 49.2 / 64.8	51.1 / 54.5 / 75.7	46.8 / 50.5 / 78.9	48.4 / 51.4 / 72.8	49.3 / 52.8 / 72.1	57.6
Kimi-K2.5	48.5 / 51.9 / 68.9	57.0 / 59.6 / 73.8	47.8 / 50.4 / 79.9	56.9 / 59.1 / 72.6	58.6 / 60.3 / 72.9	61.2
Qwen3.6-27B	47.9 / 50.9 / 75.3	57.5 / 60.1 / 83.0	50.4 / 53.3 / 87.2	55.2 / 57.8 / 74.1	54.2 / 57.2 / 74.1	62.5
Kimi-K2.6	44.6 / 47.3 / 83.1	51.7 / 54.9 / 87.1	47.8 / 51.5 / 86.3	58.5 / 60.4 / 73.2	63.7 / 65.4 / 73.5	63.3
Claude Opus 4.6	43.3 / 45.5 / 56.6	54.3 / 56.3 / 73.9	52.3 / 55.0 / 72.2	57.7 / 59.5 / 70.2	52.6 / 54.9 / 70.7	58.3
Gemini 3 Flash	44.7 / 48.2 / 71.9	50.0 / 54.1 / 79.3	46.1 / 49.3 / 85.4	54.1 / 57.5 / 72.4	45.6 / 48.5 / 70.8	58.5
Claude Opus 4.7	48.8 / 50.9 / 68.3	54.5 / 56.5 / 76.2	49.7 / 52.4 / 77.4	57.0 / 58.5 / 70.5	65.0 / 66.1 / 72.7	61.6
Gemini 3.1 Pro	50.7 / 53.6 / 69.7	58.9 / 61.5 / 79.2	52.2 / 54.9 / 84.8	54.5 / 57.1 / 72.2	52.0 / 54.9 / 71.6	61.9
Qwen3.6-Plus	49.3 / 51.9 / 68.2	51.7 / 54.6 / 74.5	53.8 / 56.4 / 86.3	57.5 / 59.4 / 73.8	61.7 / 63.4 / 74.8	62.5
GPT-5.4	59.7 / 61.4 / 78.4	60.5 / 62.2 / 79.8	57.8 / 60.3 / 86.6	60.0 / 62.1 / 71.5	63.1 / 64.8 / 73.7	66.8
GPT-5.5	60.3 / 62.3 / 85.6	64.4 / 66.1 / 83.3	60.6 / 62.9 / 86.1	61.8 / 63.4 / 74.1	65.6 / 66.3 / 73.9	69.1

Repository Structure

.
├── generation/
│   ├── icg_pipeline.py
│   └── gen_input/
├── inference/
│   └── generate_html.py
├── evaluation/
│   ├── eval_agentmode.py
│   ├── dom_observation.py
│   ├── dom_assert.py
│   ├── dom_scorer.py
│   ├── scorer.py
│   ├── metrics.py
│   ├── test_assets/
│   └── browser-use/
├── requirements.txt
└── .env.example

Dataset

The WebRISE data release is hosted on Hugging Face:

Dataset: IIGroup/WebRISE

Download the dataset with the Hugging Face CLI:

hf download IIGroup/WebRISE \
  --repo-type dataset

The dataset is organized as:

<DATASET_DIR>/
├── requirements_full.json
└── data/
    └── <TASK_ID>/
        ├── icg.json
        ├── <artifact>.html
        └── ...

Setup

git clone https://github.com/IIGROUP/WebRISE.git
cd WebRISE

pip install -r requirements.txt
python -m playwright install chromium

cp .env.example .env

Set API keys and judge model options in .env or in your shell. Common settings include:

OPENAI_API_KEY=your_api_key
OPENAI_BASE_URL=your_optional_base_url
MODEL_NAME=your_generation_model
WEBRISE_DATA_ROOT=path/to/dataset/data
WEB_EVAL_MODEL_AGENT=your_agent_model
WEB_EVAL_MODEL_SCORER=your_scoring_model

Run Model Inference

Generate HTML artifacts from released modality inputs:

MODEL_NAME=your_model_name \
WEBRISE_DATA_ROOT=path/to/dataset/data \
TASK_MODE_FILTER=TASK_ID:Text \
python inference/generate_html.py

By default, outputs are written to:

inference_outputs/<RUN_NAME>/<MODEL>/<MODALITY>/

Run Evaluation

Evaluate a generated HTML artifact against its ICG:

python evaluation/eval_agentmode.py \
  --html path/to/generated.html \
  --icg path/to/icg.json \
  --output eval_runs/example

You can also use the lightweight shell wrapper:

bash evaluation/eval_agentmode.sh \
  path/to/generated.html \
  path/to/icg.json

Utilities: Build Input Modalities

The released dataset already contains modality inputs. These scripts are mainly useful when regenerating or auditing modality assets.

DATA_ROOT=path/to/dataset/data
TASK_ID=your_task_id

python generation/gen_input/build_text_md_sketch_inputs.py \
  --data-root "$DATA_ROOT" \
  --tasks "$TASK_ID"

python generation/gen_input/build_image_inputs.py \
  "$TASK_ID" \
  --data-root "$DATA_ROOT"

python generation/gen_input/build_video_inputs.py \
  "$TASK_ID" \
  --data-root "$DATA_ROOT" \
  --passed-only

Citation

If you use WebRISE in your research, please cite our paper:

@misc{meng2026webriserequirementinducedstateevaluation,
      title={WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts}, 
      author={Yuxin Meng and Yuhan Suo and Junjie Wang and Yuhan Sun and Yiyao Yu and Ruixu Zhang and Ruining Hu and Yubin Wang and Shouwei Ruan and Bin Wang and Yuxiang Zhang and Yujiu Yang},
      year={2026},
      eprint={2606.03220},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.03220}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

Overview

Benchmark Design

Main Results

Repository Structure

Dataset

Setup

Run Model Inference

Run Evaluation

Utilities: Build Input Modalities

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
evaluation		evaluation
generation		generation
inference		inference
.env.example		.env.example
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

Overview

Benchmark Design

Main Results

Repository Structure

Dataset

Setup

Run Model Inference

Run Evaluation

Utilities: Build Input Modalities

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages