ci: split scheduled pipelines into weekly Eval Report and daily E2E Test by KayMKM · Pull Request #756 · microsoft/winml-cli

KayMKM · 2026-05-26T09:38:58Z

Summary

Refactor the previous Modelkit E2E Test pipeline (which actually runs the full model registry and produces a markdown report) into two distinct pipelines with different cadences and scopes.

Renamed (no behaviour change)

Modelkit E2E Test.yml → Modelkit Eval Report.yml
templates/e2e-eval-jobs.yml → templates/eval-report-jobs.yml
Stage displayNames: E2E Eval — {QNN, OV, AMD} → Eval Report — …
Continues to run weekly on Friday 08:00 (UTC+8) against the full model registry with sharding, --list-json, --continue, --retry-failed, and report generation.

New: `Modelkit E2E Test.yml` (daily scheduled)

Schedule: 0 16 * * * UTC = 00:00 UTC+8 every day, staggered 8 h away from the weekly Eval Report cron.
Three parallel stages (QNN / OV / AMD), each running on its dedicated self-hosted agent.
Two phases per stage, both gated by queue-time parameters so a one-off run can be trimmed easily:
1. winml perf phase — runs winml perf once per (model × EP/device pair) against an inline models parameter. Default list covers one small representative model per supported task (P0 first, P1/P2 filling the remainder).
2. pytest e2e phase — runs a configurable list of tests/e2e/test_<name>_e2e.py suites (default: all 11). Tests use require_ep() to self-skip when the target EP is absent, so the same list is safe to run on all three agents.
Each winml perf step uses condition: always() so every combination runs and the stage fails on any non-zero exit. No matrix sharding, no report generation.
Reuses the eval-report setup helpers (parquet copy, uv venv, PipAuthenticate, pip install -e .[dev]).

Why not PR-gating?

E2E runs on self-hosted hardware are too long and too flaky (driver / firmware variance) to gate every PR. The daily cadence keeps regressions surfaced within ~24 h without blocking developer throughput. Per-PR validation continues to rely on the existing unit / integration suites.

Portal actions (not YAML-controllable)

Repoint the existing pipeline definition to Modelkit Eval Report.yml.
Create a new pipeline definition for Modelkit E2E Test.yml.
Do not add the new pipeline as a required branch-policy check on main — it is informational only.

Files

.pipelines/Modelkit Eval Report.yml (renamed)
.pipelines/templates/eval-report-jobs.yml (renamed)
.pipelines/Modelkit E2E Test.yml (new, daily)
.pipelines/templates/e2e-test-jobs.yml (new)

Refactor the previous 'Modelkit E2E Test' pipeline (which actually runs the full model registry and produces reports) into two pipelines with distinct purposes: Renamed (no behavior change): - 'Modelkit E2E Test.yml' -> 'Modelkit Eval Report.yml' - 'templates/e2e-eval-jobs.yml' -> 'templates/eval-report-jobs.yml' - Stage displayNames: 'E2E Eval -- {QNN,OV,AMD}' -> 'Eval Report -- ...' New (PR-gating e2e test): - 'Modelkit E2E Test.yml': pr trigger on main with drafts:false; three parallel stages (QNN/OV/AMD) running an inline 'models' parameter (prototype: facebook/convnext-tiny-224) across every EP/device pair on each agent. - 'templates/e2e-test-jobs.yml': single job per agent; reuses the eval-report env setup (parquet copy, uv venv, PipAuthenticate, install -e .[dev]); one 'winml perf' step per (model x pair) with condition: always() so all combinations run and the job fails on any non-zero exit. No matrix sharding, --list-json, --continue, --retry-failed, or report generation. Portal actions still required (not YAML-controllable): - Repoint existing pipeline definition to 'Modelkit Eval Report.yml'. - Create new pipeline definition for the new 'Modelkit E2E Test.yml'. - Enable 'Automatically cancel existing validation builds for previous iterations of a pull request' on the new pipeline. - During the prototype phase, do NOT add the new pipeline as a required branch-policy check on main -- failures show red on PR but do not block merge.

Convert the new "Modelkit E2E Test" pipeline from a PR gate to a daily scheduled run, and broaden its scope from winml perf only to winml perf plus a configurable list of pytest e2e suites. Pipeline (.pipelines/Modelkit E2E Test.yml): - Drop the `pr:` trigger. - Add daily schedule cron '0 16 * * *' (00:00 Beijing time, branch main, always: true), staggered 8h from the weekly Eval Report cron. - Add `runEval` (boolean, default true) so the winml perf phase can be toggled off from the queue UI. - Add `pytestTargets` (object, default = all 11 e2e files: analyze, inspect, build, compile, config, export, optimize, quantize, sys, perf, eval). Edit at queue time to do a minimal run; empty list skips the pytest phase. - Add `pytestTimeout` (number, default 1000) forwarded to pytest --timeout. - All 3 stages (QNN/OV/AMD) forward the new params into the template. Template (.pipelines/templates/e2e-test-jobs.yml): - Bump `timeoutInMinutes` 60 -> 360 to accommodate both phases. - Wrap the existing per-(model x pair) winml perf loop in `${{ if eq(parameters.runEval, true) }}`. - Replace per-model failure log prefix "E2E test" with "Eval" to disambiguate from pytest e2e steps. - Add a `${{ each target in parameters.pytestTargets }}` loop that runs `uv run --no-sync python -m pytest tests/e2e/test_<name>_e2e.py -m e2e --timeout=<pytestTimeout> --junitxml=...` with `condition: always()`. Tests use `require_ep()` to self-skip on irrelevant EPs, so it is safe to run all of them on every agent. - Append a `PublishTestResults@2` task (`condition: always()`, JUnit format, `mergeTestResults: true`, `failTaskOnFailedTests: false`) so junit XMLs surface in the ADO Tests tab without becoming a second source of failure on top of the pytest step itself.

Replace the single facebook/convnext-tiny-224 seed with 13 curated (model, task) entries covering the 13 tasks in hub_models.json. Selection rules: - optimum_supported == true (run_eval needs ORT export) - P0 priority preferred; P1/P2 used to fill tasks where no P0 exists - Within a task, prefer smaller / canonical / well-downloaded models - Avoid niche or personal fine-tunes Final 13 rows (image-classification keeps the original convnext-tiny seed): image-classification facebook/convnext-tiny-224 feature-extraction openai/clip-vit-base-patch32 zero-shot-classification openai/clip-vit-base-patch32 zero-shot-image-classification openai/clip-vit-base-patch32 object-detection facebook/detr-resnet-50 fill-mask google-bert/bert-base-multilingual-cased masked-lm google-bert/bert-base-multilingual-cased depth-estimation Intel/dpt-hybrid-midas image-feature-extraction facebook/dinov2-small question-answering deepset/roberta-base-squad2 sentence-similarity sentence-transformers/all-MiniLM-L6-v2 text-classification cross-encoder/ms-marco-MiniLM-L4-v2 token-classification dslim/bert-base-NER

DingmaomaoBJTU

Nice refactoring — splitting scheduled eval-report from a fast PR-gating e2e test is a clear win. Well-documented YAML. A few suggestions below.

…peline

The autouse fixture in test_config_e2e.py patched winml.modelkit.sysinfo.resolve_device (re-export), but resolve_check_device_ep in sysinfo/device.py calls its module-local resolve_device, so the mock was a no-op on that path. EPs not installed on the host (qnn, vitisai, migraphx, nv_tensorrt_rtx) hit the real availability check and failed. Patch winml.modelkit.sysinfo.device.resolve_device as well, and make the mock EP-aware by returning the EP's supported devices from EP_SUPPORTED_DEVICES.

…osoft/winml-cli into yuesu/refactor_e2e_pipeline

…tion TestPerfHuggingFace.test_benchmark_ep_gpu and test_benchmark_ep_npu now pass require_utilization=False, matching the existing exemption used by test_benchmark_gpu_monitor, test_benchmark_npu_monitor, and test_benchmark_ep_device_{gpu,npu}. PDH GPU/NPU engine counters are not bumped reliably by every EP for short runs (e.g. OpenVINO on Intel iGPU routes compute via its own path, bypassing DXGI command queues that PDH samples). The structural checks (section present, device_kind, adapter_luid) still run; only the strict mean_pct > 0 check is dropped.

Comment out cross-encoder/nli-deberta-v3-small (zero-shot-classification) and facebook/detr-resnet-50 (object-detection): neither passes on every supported EP yet, so they are out of scope for this round of the Modelkit E2E Test pipeline. Also comment out the entire P1/P2 block (depth-estimation, image-feature-extraction, question-answering, sentence-similarity, text-classification, token-classification) — not in scope for this round; re-enable as coverage expands.

…osoft/winml-cli into yuesu/refactor_e2e_pipeline

KayMKM marked this pull request as ready for review May 26, 2026 09:39

KayMKM requested a review from a team as a code owner May 26, 2026 09:39

KayMKM marked this pull request as draft May 26, 2026 09:40

KayMKM added 4 commits May 27, 2026 14:48

Merge branch 'main' into yuesu/refactor_e2e_pipeline

d1a2c05

Merge branch 'main' into yuesu/refactor_e2e_pipeline

c70fc07

DingmaomaoBJTU reviewed May 29, 2026

View reviewed changes

Comment thread .pipelines/templates/e2e-test-jobs.yml

Comment thread .pipelines/Modelkit E2E Test.yml

Comment thread .pipelines/templates/e2e-test-jobs.yml

KayMKM added 2 commits June 1, 2026 13:40

Merge remote-tracking branch 'origin/main' into yuesu/refactor_e2e_pi…

7c2fdb8

…peline

fix ep error

4b13191

zhenchaoni approved these changes Jun 1, 2026

View reviewed changes

KayMKM changed the title ~~ci: split eval-report pipeline from PR-gating e2e test~~ ci: split scheduled pipelines into weekly Eval Report and daily E2E Test Jun 1, 2026

KayMKM added 7 commits June 1, 2026 14:54

add comments

cd66089

update model list

f802af7

add clean cache

ed10e90

fix

b69b270

fix

6c83421

Merge remote-tracking branch 'origin/main' into yuesu/refactor_e2e_pi…

454890a

…peline

KayMKM commented Jun 2, 2026

View reviewed changes

Comment thread .pipelines/Modelkit E2E Test.yml

KayMKM marked this pull request as ready for review June 2, 2026 07:51

Merge branch 'main' into yuesu/refactor_e2e_pipeline

ba39aed