Skip to content

fix(perf): unify HF and ONNX paths through PerfBenchmark#659

Open
xieofxie wants to merge 8 commits into
mainfrom
hualxie/unify_perf
Open

fix(perf): unify HF and ONNX paths through PerfBenchmark#659
xieofxie wants to merge 8 commits into
mainfrom
hualxie/unify_perf

Conversation

@xieofxie
Copy link
Copy Markdown
Contributor

Summary

Fixes #596.

  • winml perf -m hf/model and winml perf -m model.onnx previously ran two completely different pipelines: HF went through the full AOT build (export → optimize → quantize → compile) via PerfBenchmark, while .onnx files bypassed the pipeline entirely and ran a raw ORT JIT load through _run_onnx_benchmark. Same user-facing command, non-comparable numbers, and several flags (--no-quantize, --rebuild, --ignore-cache, --precision) silently no-oped on the ONNX path.
  • Both inputs now flow through PerfBenchmark, which dispatches to WinMLAutoModel.from_pretrained or .from_onnx. The is_onnx branch in _load_model (previously dead code) is now the live entry point, so an .onnx file runs optimize → [quantize] → [compile] like the HF flow minus export.
  • Delete _run_onnx_benchmark and the duplicate hardware-monitor / stats-collection logic it carried. The CLI keeps is_onnx only for the file-exists check, the --shape-config warning (shapes are baked into pre-exported ONNX), and feeding --op-tracing the raw input path.
  • Refresh docstrings on the perf command, PerfBenchmark._load_model, and the loop helpers to drop stale references; update the CLI test to assert ONNX inputs route through PerfBenchmark.run.

`winml perf -m hf/model` and `winml perf -m model.onnx` previously ran
two completely different pipelines: HF went through the full AOT build
(export -> optimize -> quantize -> compile) via PerfBenchmark, while
.onnx files bypassed the pipeline entirely and ran a raw ORT JIT load
through _run_onnx_benchmark. Same user-facing command, different code
path, non-comparable numbers, and several CLI flags (--no-quantize,
--rebuild, --ignore-cache, --precision) silently no-oped on the ONNX
path.

Both paths now flow through PerfBenchmark, which dispatches to
WinMLAutoModel.from_pretrained or .from_onnx based on the input. The
ONNX branch in _load_model (previously dead code) is now the live entry
point, so an .onnx file goes through optimize -> [quantize] -> [compile]
just like the HF flow, minus the export stage.

- Delete _run_onnx_benchmark and its private helpers' stale references.
- Drop the is_onnx dispatcher branch in the CLI; keep is_onnx only for
  the file-exists check, the --shape-config warning (shapes are baked
  into a pre-exported ONNX), and feeding --op-tracing the raw input.
- Refresh docstrings on the perf command and PerfBenchmark._load_model.
- Update the CLI test to assert ONNX inputs route through
  PerfBenchmark.run; refresh e2e docstrings.
@xieofxie xieofxie requested a review from a team as a code owner May 19, 2026 03:10
@xieofxie
Copy link
Copy Markdown
Contributor Author

could wait for perf e2e

@xieofxie
Copy link
Copy Markdown
Contributor Author

could wait for perf e2e

Done and tested in qnn

Copy link
Copy Markdown
Collaborator

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-motivated change. Unifying both paths through PerfBenchmark removes ~100 lines of duplicated benchmark logic and makes latency numbers directly comparable — solid improvement.

A few inline comments below, mostly nits and one suggestion for robustness.

Comment thread src/winml/modelkit/commands/perf.py
Comment thread src/winml/modelkit/commands/perf.py
Comment thread tests/unit/commands/test_perf_cli.py
@xieofxie
Copy link
Copy Markdown
Contributor Author

--ep cpu needs fix..

@xieofxie
Copy link
Copy Markdown
Contributor Author

WinMLAutoModel.from_onnx has at least two issues:

  • when perf --device gpu without ep, it will analyze on all eps but will only perf on one ep
  • for same model path, if run first with --device cpu, it will cache a cpu model (openvino for example), when running again with --device gpu, it will still load the cpu cache but could not run (dml for example)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: HF model ID uses AOT build pipeline, ONNX file uses raw JIT — inconsistent results and divergent code paths

2 participants