refactor(session): unify op-tracing under EPMonitor; fix WinML EP by tezheng · Pull Request #397 · microsoft/winml-cli

tezheng · 2026-04-27T04:37:31Z

Summary

Fixes the broken --op-tracing path under onnxruntime-windowsml. Previously QNNProfiler created its own ort.InferenceSession via the explicit-providers API, which only finds the QNN DLL when bundled by onnxruntime-qnn. Under onnxruntime-windowsml (where the DLL is registered via WinML, not bundled) the profiler silently fell back to CPU and produced no profiling data.
Collapses two parallel per-EP hierarchies into one. EPMonitor ABC gains two optional hooks (get_session_options, get_provider_options); QNNMonitor becomes the real implementation; the optracing/ package is deleted entirely. WinMLSession is now the sole owner of ORT session creation; monitors plug in via hooks, never own sessions.
WinMLSession.perf() accepts monitor=EPMonitor and yields a PerfContext(stats, monitor). The CLI benchmark loop and standalone scripts share one primitive.

Design

Full design at docs/design/session/monitor/:

1_prd.md v2.2 — requirements (SC-1..6, FR-1..12, NFR-1..7, C-1..6, R-1..6 + mitigations)
2_coreloop.md v2.2 — module structure, API design, lifecycle walkthroughs
iterations/01.md–11.md — brainstorming trail

Architectural pattern: Hook-based Plugin + Template Method + Observer.

Load-bearing invariants:

Singular monitor= per perf() call (no multi-monitor)
Monitor factory is explicit if/elif dispatch in commands/perf.py — no registry, no plugin loader
Teardown ordering: session reset → gc.collect() → monitor.__exit__ (QNN flushes the profiling CSV only on ort.InferenceSession destruction; reversing this produces an empty CSV)
OpTraceResult.to_dict() nested schema (metadata/summary/operators/statistics/artifacts) preserved exactly; status and error added as additive top-level keys (downstream report consumers unchanged)
EPMonitor.__exit__ MUST NOT suppress caller exceptions; auto-reset of a compiled session emits a WARNING (no silent mutation)

Migration footprint

Path	Action
`src/winml/modelkit/optracing/`	DELETED entirely
`optracing/result.py` (`OpTraceResult`, `OperatorMetrics`)	→ `session/monitor/op_metrics.py`
`optracing/report.py`	→ `session/monitor/report.py`
`optracing/qnn/{csv_parser,qhas_parser,viewer}.py`	→ `session/monitor/qnn/`
`WinMLSession._init_winml_eps_once`	extracted to module-level `ep_registry.ensure_initialized()`
`tests/unit/optracing/` (incl. `fixtures/`)	→ `tests/unit/session/monitor/`

Branch composition (notes for reviewers)

The branch contains 34 commits, structured in three layers:

1bea4cf8 (1 commit) — pre-existing cherry-pick from feat/mvp-v2. Provides scaffolding the op-tracing work plugs into: _run_simple_loop / _run_monitored_loop / _print_model_info helpers in commands/perf.py, _suppress_native_output ctx mgr in session.py, and the live_chart → _live_chart rename. Op-tracing commits are hard-dependent on these.
fbe0da64..3adee047 (19 commits) — the op-tracing refactor proper: design docs land, helpers relocate, hooks added, QNNMonitor rewritten from placeholder, optracing/ package demolished, stale CI matrix entry removed.
356ba5b4..a0003e46 (14 commits) — review-driven fixes addressing 6 critical and many important findings from a multi-agent code review:
- 356ba5b4 drop hardcoded SDK paths; require QNN_SDK_ROOT (Cardinal Rule 1)
- e20f0dea __init_subclass__ guard against requires_session_teardown shadowing
- 65544029 ensure session/monitor/__init__.py exists
- 43d4a7e9 OpTraceResult.status as Literal[TraceStatus] (closed set)
- a44613fe unify to_dict() schema between pre/post __exit__
- 754fa02b mtime-gate _find_schematic CWD fallback (rejects stale schematics in CI)
- af48c886 document tempdir no-cleanup contract; expose output_dir property
- d8160fe8 surface WinML EP probe failures at WARNING (NFR-2)
- 0c0d1257 ep_registry: surface registration failures with EP name + exc class
- 3e5c5b5b match NFR-3 auto-reset log wording verbatim
- 46d767e8 auto-infer QNN monitor from --device npu; case-insensitive --ep
- d9d798bc route op-tracing through monitored path; hard-fail on unsupported inputs (--op-tracing with ONNX file → click.UsageError; status="no_data"/"parse_failed" → sys.exit(4))
- c5cbee98 close test coverage gaps from review
- a0003e46 add hardware-gated CLI E2E for SC-1

Reviewers focused on the op-tracing delta should read the 33 commits on top of 1bea4cf8; the first commit is scaffolding from a separate effort.

Verification

uv run pytest tests/unit/session/ tests/unit/session/monitor/ tests/unit/commands/: 433 passed, 7 skipped (1 hardware-gated SC-1 E2E + 6 pre-existing skips)
uv run pytest tests/unit/: 3877 passed, 74 skipped, 2 xfailed — no regressions vs main
uv run ruff check: clean across all touched files
Cardinal Rule 1 audit (grep -r "D:\QC\|C:\Qualcomm" src/): zero matches

Notable behavior changes

onnxruntime-windowsml users can now run wmk perf --device npu --op-tracing basic directly — no --ep qnn needed (auto-inferred from device).
ONNX direct file inputs (-m foo.onnx --op-tracing ...) raise click.UsageError at parse time. HF model IDs and built model directories are supported. Direct-ONNX op-tracing is a future PR.
Op-tracing failures are now hard errors (exit 4 with descriptive message) rather than soft yellow warnings. status="basic_fallback" (detail mode degrading to basic CSV when QNN SDK absent) is the only acceptable degraded-success.
Detail mode (--op-tracing detail) requires QNN_SDK_ROOT env var pointing to a QAIRT SDK. Without it, degrades cleanly to status="basic_fallback" with a yellow notice.

Out of scope

DML / OpenVINO / TensorRT EPMonitor implementations (this PR reshapes the base class and reworks QNN only)
Multiple simultaneous EP monitors per session
--op-tracing for direct ONNX file inputs

Pre-existing failure (not introduced)

tests/regression/test_design_gaps.py::TestB4ConfigModelTypeNoTask::test_model_type_bert_no_task_succeeds fails on main and on this branch identically — unrelated to this PR.

Cherry-picked from feat/mvp-v2 (11 commits squashed): - Lazy imports: _LAZY_IMPORTS dict pattern across 7 modules, 0.13s CLI startup - LazyGroup CLI: filesystem scan + AST help parsing, no module imports - Rich Live analyze: per-node animated stacked bars with callbacks - Rich config output: CONFIG GENERATION header with I/O specs - Build StageLive: per-stage progress with EP analyzer bars - Inspect redesign: Rich formatted _inspect_model_v2 - Perf ONNX direct path: bypasses HF build for .onnx files - Config registry short-circuit: skips Optimum when input_tensors registered - Circular import fix (onnx <-> compiler) via lazy detection - CodeQL alerts resolved, test adaptations, README rewrite

Breaks reverse-coupling for QNNMonitor.is_available() by providing a module-level entry point that wraps the WinMLEPRegistry singleton. Idempotent; safe to call multiple times.

Adds three optional members with safe defaults: - requires_session_teardown: ClassVar[bool] = False - get_session_options() -> {} - get_provider_options() -> {} Existing subclasses (VitisAI, OpenVINO, QNN placeholder, NullEPMonitor) inherit defaults unchanged.

Move dataclasses from optracing/result.py. Additive changes: - model: str -> str | None (allows None for standalone profiling) - New fields: status (default 'ok'), error (default None) - to_dict() preserves nested schema; adds top-level status/error keys Old import path retained as re-export shim; removed in later task.

Moves display_op_trace_report + write_op_trace_json. Old path retained as re-export shim. Existing test file moved with git mv (history preserved).

Moves csv_parser.py, qhas_parser.py, viewer.py + test fixtures from optracing/qnn/ to session/monitor/qnn/, the canonical home for QNNMonitor support code. Old paths retained as backward-compat shims. Also updates test_integration.py fixture path and test_qnn_profiler.py import + patch targets to reference the new module location. Constraint: Old import paths must remain importable for any code already referencing optracing.qnn.* Scope-risk: narrow Confidence: high

Frozen dataclass aggregating PerfStats + EPMonitor. Prep for session.perf(monitor=...) signature change in a later task.

Infrastructure for session.perf(monitor=...) to contribute session-level config entries via add_session_config_entry. Empty by default; populated transiently during perf() context (wired in a later task).

- perf(warmup, monitor=None) yields PerfContext(stats, monitor) - Auto-reset on option conflict (WARNING log) - Teardown ordering: reset -> gc.collect -> monitor.__exit__ -> restore - Exception transparency via sys.exc_info() - Nested perf() raises RuntimeError - Existing callers migrated to ctx.stats Constraint: Teardown order is load-bearing - reset must fire before monitor.__exit__ so QNN CSV flushes before parsing. Confidence: high Scope-risk: moderate (touches benchmark hot path)

Previously perf() mutated _perf_stats and _provider_options BEFORE calling mon.__enter__(). In a @contextmanager generator, code before 'yield' runs outside any try/finally, so an __enter__ exception propagated without restoring state — leaving the session stuck (nested-perf error on every subsequent perf() call). Fix: move mutations into the try block; track mon_entered; skip mon.__exit__ if __enter__ never succeeded; skip requires_session_teardown reset if no data was produced. Directive: setup mutations MUST remain inside the try block; do not refactor back to pre-yield mutations without a reentrant state manager. Confidence: high Scope-risk: narrow (one method, preserved spec-compliant teardown ordering)

- get_session_options() contributes disable_cpu_ep_fallback, ep.context_* - get_provider_options() contributes backend_path + profiling_level (user overrides honored for non-profiling keys; profiling_level and profiling_file_path are owner-enforced via explicit post-merge set) - is_available() works with both onnxruntime-qnn AND onnxruntime-windowsml (latter via ep_registry.ensure_initialized + get_ep_devices) - __exit__ parses CSV -> OpTraceResult; retries once on Windows file-handle lag - detail mode: runs QHAS viewer if SDK available; falls back to basic (status='basic_fallback') with WARNING log - No os.chdir (glob fallback to locate schematic.bin) - Double __enter__ raises RuntimeError Legacy placeholder tests in tests/unit/session/test_ep_monitor.py are removed (class TestQNNMonitor); their replacement lives at tests/unit/session/monitor/test_qnn_monitor.py (20 tests). Constraint: profiling_level and profiling_file_path MUST NOT be user- overridable (C-3 in PRD) -- enforced via pop-then-set, not dict literal duplicate keys (which would trip ruff F601) Confidence: high Scope-risk: narrow (single module + obsolete legacy test class)

The single test (test_is_qnn_profiling_available_returns_bool) is fully subsumed by three new tests in tests/unit/session/monitor/test_qnn_monitor.py (test_is_available_via_bundled, test_is_available_via_winml, test_is_available_neither).

- Add _resolve_ep_monitor(ep, op_tracing, output_dir) dispatch helper to commands/perf.py; explicit if/elif dispatch, no registry - Add op_tracing field to BenchmarkConfig to thread the level through _run_benchmark_monitored without changing the CLI signature - Integrate monitor= into main session.perf() call; EP monitor now observes the user's actual benchmark iterations (not a separate pass) - Remove the third context-manager (ep_monitor as ep_mon) from the monitored-loop with-block; session.perf() owns the lifecycle - Delete the standalone op-tracing block (~55 lines) that called QNNProfiler/get_tracer via the old optracing registry - Add post-benchmark report block that reads ctx.monitor.result (OpTraceResult) via session.monitor.report.display/write helpers - Move test file: tests/unit/optracing/test_perf_optracing_cli.py to tests/unit/commands/test_perf_optracing.py - Expand test coverage: add TestResolveEpMonitor (8 tests) covering NullEPMonitor fallback, VitisAI availability, QNN hard-fail (NFR-2), unsupported-EP RuntimeError, and level propagation Op-tracing now observes the user's actual benchmark iterations rather than a separate synthetic profiling pass. Deliberate semantic change per US-5. Constraint: _run_onnx_benchmark does not yet integrate _resolve_ep_monitor (op-tracing on ONNX direct path is out of scope for this task) Rejected: Add monitor= to BenchmarkConfig | BenchmarkConfig is a pure data struct; using a separate field (op_tracing) is cleaner Confidence: high Scope-risk: moderate Directive: The ep_monitor is entered/exited by session.perf() now; do not add a third context manager for it

- QNNProfiler replaced by QNNMonitor (session/monitor/qnn_monitor.py) - OpTracer ABC + registry collapsed into the EPMonitor hierarchy - Tests migrated to tests/unit/session/monitor/ - Clean optracing/__init__.py: removed OpTracer, get_tracer, register_tracer, is_qnn_profiling_available exports; retains shims for result/report only Remaining optracing/ shims (result.py, report.py, qnn/*.py) stay for now to preserve external import paths during transition; removed in the next commit once all imports are verified clean. Confidence: high Scope-risk: narrow (deletions only, no behavior change)

All functionality relocated to session/monitor/. Shims removed now that no caller imports from the old paths. Also removes 'winml.modelkit.optracing' from tests/cli/test_import_time.py's parametrize list — the module no longer exists to import-test. The op-tracing refactor's source delete phase is now complete. Remaining tasks: rewire WinMLSession._init_winml_eps_once (Task 14), relocate design docs (Task 15), final E2E verification (Task 16). Confidence: high Scope-risk: narrow (deletion + single test param removal)

Redundant with the module-level ensure_initialized() function in session/ep_registry.py added in an earlier task. WinMLSession.__init__ now calls the module function directly, eliminating the classmethod and its associated _eps_initialized class attribute. Updated three patch sites (tests/conftest.py, test_qairt_session.py, test_is_compatible.py) from the deleted classmethod target to winml.modelkit.session.ep_registry.ensure_initialized. Confidence: high Scope-risk: narrow

Per docs/standards/design-doc-spec.md §1.5.1 transitional commitment — implementation has landed (optracing/ package deleted in Task 13), so docs move to their spec-compliant location under the target module directory. - Version bumped 2.1 -> 2.2 on both PRD and coreloop - Transitional Location note removed (no longer applicable) - design-doc-spec.md §7.4 exemplar path updated - Related docs in plan file updated Closes spec §1.5.1 transitional exception for the op-tracing refactor. Confidence: high Scope-risk: narrow (docs-only, no source code impact)

The tests/unit/optracing directory was deleted in Task 13 of this refactor; our new tests live under tests/unit/session/monitor/ which is already covered by the 'commands' group's tests/unit/session path. Leaving the dead reference in the 'remaining' group either fails the CI job (pytest errors on missing path) or leaks dead config. Remove it. Confidence: high Scope-risk: narrow (CI config only)

find_qnn_sdk() previously walked _COMMON_SDK_PATHS that included a developer-machine path (D:\QC) and an installer default (C:\Qualcomm\AIStack\qairt). Both violate Cardinal Rule 1 (no hardcoded paths) and could mask misconfiguration on other machines. Resolve only from the QNN_SDK_ROOT env var. Update the WARNING messages in run_basic_viewer and run_qhas_viewer so users learn what to set when the viewer cannot be located. Detail mode degrades to basic CSV post-processing per design FR-5 / status='basic_fallback'. Add unit tests covering env unset, env points to dir, env points to nonexistent path.

The requires_session_teardown ClassVar is load-bearing per design C-2: WinMLSession.perf() destroys the ORT InferenceSession before collecting monitor data when this flag is True (QNN flushes its CSV only on session destroy). A subclass silently shadowing it with a non-bool would break the invariant without a runtime error. Add EPMonitor.__init_subclass__ that rejects non-bool class-level shadowing at class-definition time. Runtime instance shadowing in __init__ is not catchable here without metaclass tricks, but making the invariant *visible* is the goal. Add a unit test that constructs a bad subclass and asserts TypeError.

The session/monitor/ package directory was missing __init__.py, making it an implicit namespace package. Add a lightweight init matching the project's existing __init__.py style (no re-exports; external code imports submodules directly).

Introduce TraceStatus = Literal[...] alias covering the closed set of values emitted by EP monitors (ok | no_data | parse_failed | basic_fallback | not_run). Annotate OpTraceResult.status with the alias so static checkers reject typos like 'no_dada' that would currently slip through as a successful status and silently misclassify in commands/perf.py. Runtime behavior is unchanged — Literal is purely a static narrowing, and to_dict()/JSON serialization continue to emit the bare string. Also propagate the alias to QNNMonitor: _make_failure_result(status) parameter and the local 'status' var in _parse_artifacts are typed as TraceStatus.

Pre-exit, QNNMonitor.to_dict() previously returned a flat {ep, device, status} stub while post-exit it returned the nested OpTraceResult schema (metadata/summary/operators/...). Consumers that key on 'metadata' would KeyError if they probed the monitor before the context manager exited. Delegate the pre-exit branch to _make_failure_result(status='not_run'), which is already the post-exit failure path. Same nested shape, just empty body and status='not_run'. Update test_to_dict_before_enter to assert the new shape and add test_to_dict_pre_exit_returns_nested_schema to lock the contract.

A *_schematic.bin from a prior CI run sitting in the process CWD would previously be silently consumed when the QNN SDK failed to drop one in the output dir. QHAS would then report metrics for the WRONG graph with status='ok' — silent data corruption. Gate the CWD fallback by mtime against the profiling CSV: the schematic must be at least as new as the CSV (with a 5s tolerance for filesystem clock skew) to be accepted. Otherwise return None and let QHAS bail with status='basic_fallback', which is the honest answer. Add three tests: - test_find_schematic_rejects_stale_cwd_candidate - test_find_schematic_accepts_fresh_cwd_candidate - test_find_schematic_prefers_output_dir_over_cwd

…t_dir The tempdir created when output_dir=None (qnn_profile_*) was undocumented and unreachable by callers, even though we deliberately leak it so QNN profiling artifacts can be inspected post-run. Document the no-cleanup contract in the class docstring and __init__ docstring, and expose a read-only output_dir property so callers can locate / clean / archive the artifacts they care about. Tests: - test_output_dir_property_exposes_path - test_output_dir_property_for_default_tempdir (asserts qnn_profile_ prefix) - test_output_dir_property_is_read_only (no setter)

The bare 'except Exception' around ensure_initialized() + get_ep_devices() swallowed broken Windows App SDK installs, denied registry access, and missing-DLL failures, downgrading them to a silent 'feature unavailable'. This is exactly the D-1 silent-failure problem the PR was meant to fix. Narrow the inner ImportError catch to the import statement only, and promote the catch-all around ensure_initialized()/get_ep_devices() from DEBUG to WARNING with the exception class included. Callers can still probe availability safely (no exception escapes), but real environmental problems are now visible. Constraint: is_available must remain non-throwing per design (callers probe) Confidence: high Scope-risk: narrow

Promote silent-fallback DEBUG logs to WARNING so users can diagnose real environmental failures (broken Windows App SDK, denied registration, missing DLL): - _discover_eps catch-all: include exception class in WARNING. - _fix_winrt_runtime: DEBUG -> WARNING (function only runs in known-needed init path, so a failure here matters). - register_to_ort: include EP name + exception class in WARNING; track per-EP failures in new _registration_failures dict + registration_failures property so callers can introspect. - ensure_initialized: DEBUG -> WARNING with exception class. No latch on failure — subsequent calls retry registration. - get_ort_available_providers: DEBUG -> WARNING (same rationale). Constraint: ensure_initialized must remain non-throwing (NFR-2 callers probe) Confidence: high Scope-risk: narrow Directive: registration_failures property returns a copy — do not expose internal dict directly

Drop the redundant 'session.perf(): ' prefix so the verbatim NFR-3 phrase ('auto-resetting compiled session to apply monitor session/provider options') appears as a substring of the WARNING log. The (monitor=...) suffix is retained as load-bearing diagnostic context. Update test_auto_reset_fires_when_options_contributed to pin to the exact NFR-3 substring so future regressions are caught immediately. Confidence: high Scope-risk: narrow

The headline SC-1 invocation `wmk perf --device npu --op-tracing basic` must engage QNNMonitor without requiring an explicit `--ep qnn`. Two gaps caused the silent no-op: * `_resolve_ep_monitor` matched `ep == "qnn"` literally and case-sensitively, so `--ep QNN` and `--ep Qnn` fell through to the unsupported-EP raise. * With no `--ep` at all, `config.ep` was `None` so the helper raised "Op-tracing not available for EP 'None'" -- masking the real intent ("--device npu means QNN"). Fix: normalize ep with `(ep or "").lower()`, accept device='npu' as auto-inference for QNN when no ep is given, and split the unavailable-vs-unsupported error paths so the QNN-missing case points the user at install steps. Tests cover auto-inference, case-insensitive matching across `qnn / QNN / Qnn / qNN`, descriptive errors for both branches.

…pported inputs Three converging gaps caused `wmk perf --op-tracing basic` to be a silent no-op or a soft-warn instead of producing the SC-1 artifacts: 1. Dispatch in `_run_benchmark` only routed to the monitored path when `--monitor` was set. Without `--monitor`, `session.perf` was called with no monitor=, so op-tracing never engaged. Fix: route to the monitored path whenever op_tracing OR monitor. 2. `_run_benchmark_monitored` early-returned to the simple path when HWMonitor was unavailable -- defeating op-tracing too. Fix: run with the EP monitor only (no live HW chart) when HW telemetry is missing; keep the warning for the --monitor case. 3. Post-benchmark report read `benchmark._perf_ctx` but only the monitored path set it. Fix: populate `_perf_ctx` on every path so reporting works regardless of which dispatcher was hit. Additionally: * Hard-fail `--op-tracing` with `.onnx` file inputs at parse time via `click.UsageError`. `_run_onnx_benchmark` does not yet thread the EP monitor through session.perf, so allowing the run would waste a benchmark and end with "No profiling data". Fast clear error is better than slow silent failure. * Replace the post-benchmark soft "yellow warning" with NFR-2 hard-fail (exit 4) on `status='no_data'` and `status='parse_failed'`. `status='basic_fallback'` remains exit 0 with a yellow notice (degraded-success). Each branch carries an actionable diagnostic. Tests cover: ONNX+op-tracing UsageError, no_data exit 4, parse_failed exit 4 with parser message, basic_fallback exit 0 with degraded notice.

Audit-driven test tightening on the op-tracing refactor. Each addition covers a code path that the PR review (A2-I1, A2-I3, A2-I7, A2-I8) called out as silent-regression-prone: * test_to_dict_preserves_nested_schema (A2-I1): assert num_samples and timestamp are present in metadata. A regression dropping either field would have passed the previous "check the easy keys" assertion. * test_ensure_initialized_calls_registry_once (A2-I3): tighten from call_count >= 1 to == 3 across both get_instance and register_to_ort. Pins the wrapper-side contract; the singleton's _registered_eps skip-list provides the actual no-op idempotency. * test_detail_mode_falls_back_to_basic_when_qhas_unavailable (A2-I7): cover FR-5 — detail mode with no *_qnn.log produces status=basic_fallback while preserving CSV-derived operators and summary. Without this test, a regression that raised or downgraded to status=ok would slip through. * test_parse_artifacts_retries_when_csv_absent + test_parse_artifacts_no_retry_when_csv_present_on_first_check (A2-I8): pin R-2 — the 50ms time.sleep retry must fire on missing CSV and must NOT fire when the CSV is already present. Constraint: tests stay pytest-only with code-generated fixtures (Cardinal Rule 2); CSV content is read from the existing optrace_resnet50.csv fixture, never hand-rolled. Confidence: high Scope-risk: narrow

PRD §10.5 / coreloop §8.4 mandate this test: test_cli_op_tracing_basic_on_qnn (skip if no QNN NPU): runs wmk perf -m resnet50 --device npu --op-tracing basic, asserts CSV produced, *_op_trace.json written, at least one operator entry. This is the only end-to-end proof that SC-1 holds: the headline invocation produces real per-operator trace data on a QNN NPU rather than silently falling back to CPU (the original SC-1 bug). Doubly-gated: * @pytest.mark.skipif on WINML_TEST_NPU=1 — explicit opt-in matching the existing project pattern (see tests/unit/models/auto/conftest.py). * QNNMonitor.is_available() runtime probe — pytest.skip cleanly on machines that opted in but do not actually have the EP installed. Constraint: Cardinal Rule 3 forbids skip/xfail to hide failures; both gates are hardware-availability gates, not failure suppression. Constraint: only HF model IDs go through the integrated op-tracing path (commands/perf.py rejects --op-tracing on direct .onnx inputs); the test uses microsoft/resnet-50 to match existing perf E2E coverage. Confidence: high Scope-risk: narrow Not-tested: detail-mode CLI on real hardware (covered by the unit-level test_basic_fallback_status_exits_0_with_notice plus the new test_detail_mode_falls_back_to_basic_when_qhas_unavailable).

_resolve_ep_monitor previously only auto-inferred QNN when device == "npu" literal. The default --device auto and programmatic device=None / device="" callers fell through to the hard-fail branch with a cryptic "EP None on device 'auto'" message, even though --op-tracing is itself a strong intent signal that the user wants QNN-only profiling. Expand the auto-infer set from {"npu"} to {"npu", "auto", ""}. Explicit --device cpu / --device gpu still hard-fails — those are deliberate user choices that should not be silently overridden. Constraint: --op-tracing currently only supports QNN; auto-resolution must not contradict an explicit user device choice. Confidence: high Scope-risk: narrow Directive: do NOT add "cpu" or "gpu" to the auto-infer set — they must remain explicit no-NPU signals.

QNNMonitor was setting backend_path="QnnHtp.dll" (relative) and disable_cpu_ep_fallback=1 as "sensible defaults". Both broke the WinML-registered QNN path: 1. backend_path override REPLACED WinML's absolute C:\Program Files\WindowsApps\...\QnnHtp.dll path with a bare filename, so QNN's library loader couldn't find the DLL. EPContext nodes then had no implementation -> NotImplemented errors. This is structurally identical to the original D-1 silent CPU fallback — different mechanism, same outcome. 2. disable_cpu_ep_fallback=1 rejected the LEGITIMATE QDQ partitioning that EPContext-wrapped quantized models use: Q/DQ at boundaries genuinely need to run on CPU; the EPContext (the actual model) runs on QNN. Forbidding any CPU node killed all such models. The fix: get_provider_options() now passes through extra_provider_options plus only the two C-3 owner-enforced profiling keys. Callers needing backend_path / htp_* (e.g. bundled onnxruntime-qnn users) supply via extra_provider_options at construction. WinML-registered QNN keeps its correct registered defaults. get_session_options() retains ep.context_enable + ep.context_embed_mode (EPContext caching) but drops session.disable_cpu_ep_fallback. The "no silent CPU fallback" guarantee is provided by add_provider_for_devices upstream — if the QNN device is absent, session creation fails loudly there. Verified empirically: facebook/convnext-base-224 now runs on QNN with 73% avg NPU util at 3.12ms latency under onnxruntime-windowsml. Constraint: WinML-registered QNN provides its own absolute backend_path and HTP defaults; supplying our own would overwrite (last-writer-wins in ORT's option merge). Constraint: EPContext + QDQ models legitimately partition Q/DQ to CPU. Confidence: high Scope-risk: moderate (changes hook contributions; tests updated) Directive: do NOT add backend_path / htp_* / disable_cpu_ep_fallback back as defaults. They must remain pure pass-through from extra_provider_options.

WinMLSession.compile() takes one of two paths: 1. Explicit-EP: when self._ep is set -> add_provider_for_devices([dev], opts) This passes self._provider_options through to ORT. 2. Policy-based: when self._ep is None -> set_provider_selection_policy(policy) This ignores self._provider_options entirely. When op-tracing is engaged via QNNMonitor, the monitor contributes profiling_level + profiling_file_path through get_provider_options(). But WinMLAutoModel constructs WinMLSession without an explicit ep, so self._ep is None -> path 2 is taken -> profiling options are silently dropped -> no profiling CSV is produced even though the model runs on QNN. This is the third regression of the same root pattern in this PR: "monitor contributes config the session doesn't consume." Fix: introduce EPMonitor.ep_name ClassVar (default None). When the monitor declares an ep_name AND self._ep is None, perf() pins self._ep for the duration of the perf() context (saved in the same try/finally chain that already restores _provider_options). QNNMonitor.ep_name = "qnn"; NullEPMonitor / VitisAIMonitor / OpenVinoMonitor inherit None and behave unchanged. Verified empirically: facebook/convnext-base-224 with --op-tracing basic now produces a 1 MB profiling CSV and a populated op_trace.json with per-operator cycle counts (LayerNormalization, Gelu, Conv, etc.) under onnxruntime-windowsml. Constraint: ORT's set_provider_selection_policy and add_provider_for_devices are mutually exclusive; only the latter consumes provider options. Constraint: WinMLAutoModel constructs WinMLSession without ep, so the monitor must inform the session at perf-time (not at construction). Confidence: high Scope-risk: narrow (additive ClassVar; restoration in existing teardown chain) Directive: when adding new EPMonitor subclasses that need their own provider options to take effect, set ep_name to the corresponding short EP name.

…it value Op-tracing produces a usable per-operator trace from a single inference; running the normal default of 100 iterations just inflates the profiling CSV proportionally (operators are averaged across iterations regardless). The wall-clock cost of the extra 99 iterations is non-trivial on QNN (~3 ms each on convnext-base-224 -> ~300 ms saved) and the larger CSV slows the post-hoc parse step too. When --op-tracing is set AND the user did not explicitly pass --iterations, collapse to 1. Detection uses click.Context.get_parameter_source so the help text still shows [default: 100] (which is what stands without op-tracing) and a user explicitly passing --iterations 100 alongside --op-tracing is honored. Warmup is intentionally NOT smart-defaulted: it serves a distinct purpose (letting QNN HTP warm its compiled-binary cache before measurement) and that need persists with op-tracing. Default warmup=10 + iterations=1 = 11 total inferences, all profiled. Verified empirically: facebook/convnext-base-224 --op-tracing basic without --iterations now produces a 450KB CSV (was 1MB at 20 iter / 5MB at 100 iter) with the same op-level cycle data. Constraint: must distinguish "user didn't specify" from "user passed the default value explicitly" — handled via click.ParameterSource.DEFAULT. Confidence: high Scope-risk: narrow (CLI-only behaviour change; gated on --op-tracing) Directive: do NOT add similar smart defaults for --warmup; warmup serves a different purpose (cache stabilisation) that op-tracing needs too.

tezheng · 2026-04-28T06:40:15Z

C:/Program Files/Git/azp run

DingmaomaoBJTU

Code Review: refactor(session): unify op-tracing under EPMonitor

Good architectural refactoring. The EPMonitor ABC pattern with hook-based plugin integration into WinMLSession.perf() is clean and extensible. The teardown ordering invariant (session reset -> gc.collect() -> monitor.exit) is well-documented and correctly implemented.

Strengths:

Clear separation of concerns: monitors contribute options via hooks, session owns ORT lifecycle
NullObject pattern eliminates null checks in benchmark loop
Comprehensive status/error reporting via TraceStatus enum
Well-documented class vars with init_subclass validation

Concerns (see inline comments):

Potential resource leak in QNNMonitor temp directory lifecycle
Race condition window in _find_schematic CWD fallback
Missing retry robustness for Windows file-handle lag
Large scope: 23k+ additions including docs/design artifacts that should likely be separate PRs

DingmaomaoBJTU

Inline findings (could not attach as line comments due to diff resolution):

src/winml/modelkit/session/monitor/qnn_monitor.py L108 - Resource Leak: \ empfile.mkdtemp\ creates a directory that is explicitly never cleaned up. In CI or automated benchmarking this could accumulate unbounded temp dirs. Consider adding a \cleanup()\ method or emitting a \logger.info\ with the temp path.

src/winml/modelkit/session/session.py L89 - Hardcoded fallback: \�uto\ maps to \PREFER_NPU\ which assumes Qualcomm hardware. This couples the generic session class to a specific vendor. Consider adding a comment explaining why or making it configurable.

src/winml/modelkit/session/session.py L664 - Teardown ordering: The nested try/finally is correct but subtle. Suggest adding a brief inline comment (e.g. # C-2 invariant: teardown before parse) to help future maintainers understand why this order matters.

src/winml/modelkit/session/monitor/qnn_monitor.py L243 - Fragile retry: The 50ms sleep for Windows file-handle flush lag is a fixed heuristic. Consider exponential backoff (50ms, 100ms, 200ms) or documenting the NVMe/SSD assumption.

src/winml/modelkit/session/monitor/qnn_monitor.py L277 - TOCTOU concern: The mtime comparison in _find_schematic\ CWD fallback has a race window in CI with parallel runs. Consider logging both timestamps when the fallback triggers.

src/winml/modelkit/session/monitor/ep_monitor.py L65 - Incomplete validation: _init_subclass_\ catches class-level shadowing but instance-level shadowing bypasses this. Since this flag governs the teardown invariant, consider _slots_\ or _setattr_\ guard.

src/winml/modelkit/session/monitor/qnn/csv_parser.py L59 - Silent data loss risk: \csv.DictReader\ silently handles schema changes. Consider validating the header row against expected column names.

src/winml/modelkit/session/monitor/qnn/viewer.py L140 - nit: Consider adding \ imeout=120\ to \subprocess.run\ to prevent indefinite hangs.

Zheng Te added 30 commits April 14, 2026 15:10

docs: add op-tracing refactor design docs + spec

fbe0da6

feat(session): add ensure_initialized() module function to ep_registry

75ac922

Breaks reverse-coupling for QNNMonitor.is_available() by providing a module-level entry point that wraps the WinMLEPRegistry singleton. Idempotent; safe to call multiple times.

feat(monitor): relocate report helpers → session/monitor/report

301d1f8

Moves display_op_trace_report + write_op_trace_json. Old path retained as re-export shim. Existing test file moved with git mv (history preserved).

feat(session): add PerfContext dataclass

0f9e951

Frozen dataclass aggregating PerfStats + EPMonitor. Prep for session.perf(monitor=...) signature change in a later task.

feat(session): add _active_session_option_entries state

5f7b6b9

Infrastructure for session.perf(monitor=...) to contribute session-level config entries via add_session_config_entry. Empty by default; populated transiently during perf() context (wired in a later task).

Zheng Te added 4 commits April 27, 2026 11:47

tezheng requested a review from a team as a code owner April 27, 2026 04:37

xieofxie mentioned this pull request Apr 27, 2026

fix(perf): op-tracing preflight check before benchmark run #330

Closed

1 task

Zheng Te added 4 commits April 27, 2026 14:46

ci: empty commit to retrigger workflows

5eeb7c4

DingmaomaoBJTU reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(session): unify op-tracing under EPMonitor; fix WinML EP#397

refactor(session): unify op-tracing under EPMonitor; fix WinML EP#397
tezheng wants to merge 39 commits into
mainfrom
feat/op-tracing-refactor

tezheng commented Apr 27, 2026

Uh oh!

tezheng commented Apr 28, 2026

Uh oh!

DingmaomaoBJTU left a comment

Uh oh!

DingmaomaoBJTU left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tezheng commented Apr 27, 2026

Summary

Design

Migration footprint

Branch composition (notes for reviewers)

Verification

Notable behavior changes

Out of scope

Pre-existing failure (not introduced)

Uh oh!

tezheng commented Apr 28, 2026

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Code Review: refactor(session): unify op-tracing under EPMonitor

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Inline findings (could not attach as line comments due to diff resolution):

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants