ci: split scheduled pipelines into weekly Eval Report and daily E2E Test#756
Merged
Conversation
Refactor the previous 'Modelkit E2E Test' pipeline (which actually runs the full model registry and produces reports) into two pipelines with distinct purposes:
Renamed (no behavior change):
- 'Modelkit E2E Test.yml' -> 'Modelkit Eval Report.yml'
- 'templates/e2e-eval-jobs.yml' -> 'templates/eval-report-jobs.yml'
- Stage displayNames: 'E2E Eval -- {QNN,OV,AMD}' -> 'Eval Report -- ...'
New (PR-gating e2e test):
- 'Modelkit E2E Test.yml': pr trigger on main with drafts:false; three parallel stages (QNN/OV/AMD) running an inline 'models' parameter (prototype: facebook/convnext-tiny-224) across every EP/device pair on each agent.
- 'templates/e2e-test-jobs.yml': single job per agent; reuses the eval-report env setup (parquet copy, uv venv, PipAuthenticate, install -e .[dev]); one 'winml perf' step per (model x pair) with condition: always() so all combinations run and the job fails on any non-zero exit. No matrix sharding, --list-json, --continue, --retry-failed, or report generation.
Portal actions still required (not YAML-controllable):
- Repoint existing pipeline definition to 'Modelkit Eval Report.yml'.
- Create new pipeline definition for the new 'Modelkit E2E Test.yml'.
- Enable 'Automatically cancel existing validation builds for previous iterations of a pull request' on the new pipeline.
- During the prototype phase, do NOT add the new pipeline as a required branch-policy check on main -- failures show red on PR but do not block merge.
Convert the new "Modelkit E2E Test" pipeline from a PR gate to a daily
scheduled run, and broaden its scope from winml perf only to winml perf
plus a configurable list of pytest e2e suites.
Pipeline (.pipelines/Modelkit E2E Test.yml):
- Drop the `pr:` trigger.
- Add daily schedule cron '0 16 * * *' (00:00 Beijing time, branch
main, always: true), staggered 8h from the weekly Eval Report cron.
- Add `runEval` (boolean, default true) so the winml perf phase can be
toggled off from the queue UI.
- Add `pytestTargets` (object, default = all 11 e2e files: analyze,
inspect, build, compile, config, export, optimize, quantize, sys,
perf, eval). Edit at queue time to do a minimal run; empty list
skips the pytest phase.
- Add `pytestTimeout` (number, default 1000) forwarded to pytest
--timeout.
- All 3 stages (QNN/OV/AMD) forward the new params into the template.
Template (.pipelines/templates/e2e-test-jobs.yml):
- Bump `timeoutInMinutes` 60 -> 360 to accommodate both phases.
- Wrap the existing per-(model x pair) winml perf loop in
`${{ if eq(parameters.runEval, true) }}`.
- Replace per-model failure log prefix "E2E test" with "Eval" to
disambiguate from pytest e2e steps.
- Add a `${{ each target in parameters.pytestTargets }}` loop that
runs `uv run --no-sync python -m pytest tests/e2e/test_<name>_e2e.py
-m e2e --timeout=<pytestTimeout> --junitxml=...` with
`condition: always()`. Tests use `require_ep()` to self-skip on
irrelevant EPs, so it is safe to run all of them on every agent.
- Append a `PublishTestResults@2` task (`condition: always()`, JUnit
format, `mergeTestResults: true`, `failTaskOnFailedTests: false`)
so junit XMLs surface in the ADO Tests tab without becoming a
second source of failure on top of the pytest step itself.
Replace the single facebook/convnext-tiny-224 seed with 13 curated (model, task) entries covering the 13 tasks in hub_models.json. Selection rules: - optimum_supported == true (run_eval needs ORT export) - P0 priority preferred; P1/P2 used to fill tasks where no P0 exists - Within a task, prefer smaller / canonical / well-downloaded models - Avoid niche or personal fine-tunes Final 13 rows (image-classification keeps the original convnext-tiny seed): image-classification facebook/convnext-tiny-224 feature-extraction openai/clip-vit-base-patch32 zero-shot-classification openai/clip-vit-base-patch32 zero-shot-image-classification openai/clip-vit-base-patch32 object-detection facebook/detr-resnet-50 fill-mask google-bert/bert-base-multilingual-cased masked-lm google-bert/bert-base-multilingual-cased depth-estimation Intel/dpt-hybrid-midas image-feature-extraction facebook/dinov2-small question-answering deepset/roberta-base-squad2 sentence-similarity sentence-transformers/all-MiniLM-L6-v2 text-classification cross-encoder/ms-marco-MiniLM-L4-v2 token-classification dslim/bert-base-NER
Collaborator
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
Nice refactoring — splitting scheduled eval-report from a fast PR-gating e2e test is a clear win. Well-documented YAML. A few suggestions below.
zhenchaoni
approved these changes
Jun 1, 2026
The autouse fixture in test_config_e2e.py patched winml.modelkit.sysinfo.resolve_device (re-export), but resolve_check_device_ep in sysinfo/device.py calls its module-local resolve_device, so the mock was a no-op on that path. EPs not installed on the host (qnn, vitisai, migraphx, nv_tensorrt_rtx) hit the real availability check and failed. Patch winml.modelkit.sysinfo.device.resolve_device as well, and make the mock EP-aware by returning the EP's supported devices from EP_SUPPORTED_DEVICES.
KayMKM
commented
Jun 2, 2026
The autouse fixture in test_config_e2e.py patched winml.modelkit.sysinfo.resolve_device (re-export), but resolve_check_device_ep in sysinfo/device.py calls its module-local resolve_device, so the mock was a no-op on that path. EPs not installed on the host (qnn, vitisai, migraphx, nv_tensorrt_rtx) hit the real availability check and failed. Patch winml.modelkit.sysinfo.device.resolve_device as well, and make the mock EP-aware by returning the EP's supported devices from EP_SUPPORTED_DEVICES.
…osoft/winml-cli into yuesu/refactor_e2e_pipeline
…tion
TestPerfHuggingFace.test_benchmark_ep_gpu and test_benchmark_ep_npu now pass require_utilization=False, matching the existing exemption used by test_benchmark_gpu_monitor, test_benchmark_npu_monitor, and test_benchmark_ep_device_{gpu,npu}.
PDH GPU/NPU engine counters are not bumped reliably by every EP for short runs (e.g. OpenVINO on Intel iGPU routes compute via its own path, bypassing DXGI command queues that PDH samples). The structural checks (section present, device_kind, adapter_luid) still run; only the strict mean_pct > 0 check is dropped.
Comment out cross-encoder/nli-deberta-v3-small (zero-shot-classification) and facebook/detr-resnet-50 (object-detection): neither passes on every supported EP yet, so they are out of scope for this round of the Modelkit E2E Test pipeline. Also comment out the entire P1/P2 block (depth-estimation, image-feature-extraction, question-answering, sentence-similarity, text-classification, token-classification) — not in scope for this round; re-enable as coverage expands.
…osoft/winml-cli into yuesu/refactor_e2e_pipeline
zhenchaoni
approved these changes
Jun 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactor the previous Modelkit E2E Test pipeline (which actually runs the full model registry and produces a markdown report) into two distinct pipelines with different cadences and scopes.
Renamed (no behaviour change)
Modelkit E2E Test.yml→Modelkit Eval Report.ymltemplates/e2e-eval-jobs.yml→templates/eval-report-jobs.ymldisplayNames:E2E Eval — {QNN, OV, AMD}→Eval Report — …--list-json,--continue,--retry-failed, and report generation.New:
Modelkit E2E Test.yml(daily scheduled)Schedule:
0 16 * * *UTC = 00:00 UTC+8 every day, staggered 8 h away from the weekly Eval Report cron.Three parallel stages (QNN / OV / AMD), each running on its dedicated self-hosted agent.
Two phases per stage, both gated by queue-time parameters so a one-off run can be trimmed easily:
winml perfphase — runswinml perfonce per(model × EP/device pair)against an inlinemodelsparameter. Default list covers one small representative model per supported task (P0 first, P1/P2 filling the remainder).tests/e2e/test_<name>_e2e.pysuites (default: all 11). Tests userequire_ep()to self-skip when the target EP is absent, so the same list is safe to run on all three agents.Each
winml perfstep usescondition: always()so every combination runs and the stage fails on any non-zero exit. No matrix sharding, no report generation.Reuses the eval-report setup helpers (parquet copy,
uvvenv,PipAuthenticate,pip install -e .[dev]).Why not PR-gating?
E2E runs on self-hosted hardware are too long and too flaky (driver / firmware variance) to gate every PR. The daily cadence keeps regressions surfaced within ~24 h without blocking developer throughput. Per-PR validation continues to rely on the existing unit / integration suites.
Portal actions (not YAML-controllable)
Modelkit Eval Report.yml.Modelkit E2E Test.yml.main— it is informational only.Files
.pipelines/Modelkit Eval Report.yml(renamed).pipelines/templates/eval-report-jobs.yml(renamed).pipelines/Modelkit E2E Test.yml(new, daily).pipelines/templates/e2e-test-jobs.yml(new)