Skip to content

refactor(session): unify op-tracing under EPMonitor; fix WinML EP#397

Open
tezheng wants to merge 39 commits into
mainfrom
feat/op-tracing-refactor
Open

refactor(session): unify op-tracing under EPMonitor; fix WinML EP#397
tezheng wants to merge 39 commits into
mainfrom
feat/op-tracing-refactor

Conversation

@tezheng
Copy link
Copy Markdown
Collaborator

@tezheng tezheng commented Apr 27, 2026

Summary

  • Fixes the broken --op-tracing path under onnxruntime-windowsml. Previously QNNProfiler created its own ort.InferenceSession via the explicit-providers API, which only finds the QNN DLL when bundled by onnxruntime-qnn. Under onnxruntime-windowsml (where the DLL is registered via WinML, not bundled) the profiler silently fell back to CPU and produced no profiling data.
  • Collapses two parallel per-EP hierarchies into one. EPMonitor ABC gains two optional hooks (get_session_options, get_provider_options); QNNMonitor becomes the real implementation; the optracing/ package is deleted entirely. WinMLSession is now the sole owner of ORT session creation; monitors plug in via hooks, never own sessions.
  • WinMLSession.perf() accepts monitor=EPMonitor and yields a PerfContext(stats, monitor). The CLI benchmark loop and standalone scripts share one primitive.

Design

Full design at docs/design/session/monitor/:

Architectural pattern: Hook-based Plugin + Template Method + Observer.

Load-bearing invariants:

  • Singular monitor= per perf() call (no multi-monitor)
  • Monitor factory is explicit if/elif dispatch in commands/perf.py — no registry, no plugin loader
  • Teardown ordering: session reset → gc.collect()monitor.__exit__ (QNN flushes the profiling CSV only on ort.InferenceSession destruction; reversing this produces an empty CSV)
  • OpTraceResult.to_dict() nested schema (metadata/summary/operators/statistics/artifacts) preserved exactly; status and error added as additive top-level keys (downstream report consumers unchanged)
  • EPMonitor.__exit__ MUST NOT suppress caller exceptions; auto-reset of a compiled session emits a WARNING (no silent mutation)

Migration footprint

Path Action
src/winml/modelkit/optracing/ DELETED entirely
optracing/result.py (OpTraceResult, OperatorMetrics) session/monitor/op_metrics.py
optracing/report.py session/monitor/report.py
optracing/qnn/{csv_parser,qhas_parser,viewer}.py session/monitor/qnn/
WinMLSession._init_winml_eps_once extracted to module-level ep_registry.ensure_initialized()
tests/unit/optracing/ (incl. fixtures/) tests/unit/session/monitor/

Branch composition (notes for reviewers)

The branch contains 34 commits, structured in three layers:

  1. 1bea4cf8 (1 commit) — pre-existing cherry-pick from feat/mvp-v2. Provides scaffolding the op-tracing work plugs into: _run_simple_loop / _run_monitored_loop / _print_model_info helpers in commands/perf.py, _suppress_native_output ctx mgr in session.py, and the live_chart_live_chart rename. Op-tracing commits are hard-dependent on these.
  2. fbe0da64..3adee047 (19 commits) — the op-tracing refactor proper: design docs land, helpers relocate, hooks added, QNNMonitor rewritten from placeholder, optracing/ package demolished, stale CI matrix entry removed.
  3. 356ba5b4..a0003e46 (14 commits) — review-driven fixes addressing 6 critical and many important findings from a multi-agent code review:
    • 356ba5b4 drop hardcoded SDK paths; require QNN_SDK_ROOT (Cardinal Rule 1)
    • e20f0dea __init_subclass__ guard against requires_session_teardown shadowing
    • 65544029 ensure session/monitor/__init__.py exists
    • 43d4a7e9 OpTraceResult.status as Literal[TraceStatus] (closed set)
    • a44613fe unify to_dict() schema between pre/post __exit__
    • 754fa02b mtime-gate _find_schematic CWD fallback (rejects stale schematics in CI)
    • af48c886 document tempdir no-cleanup contract; expose output_dir property
    • d8160fe8 surface WinML EP probe failures at WARNING (NFR-2)
    • 0c0d1257 ep_registry: surface registration failures with EP name + exc class
    • 3e5c5b5b match NFR-3 auto-reset log wording verbatim
    • 46d767e8 auto-infer QNN monitor from --device npu; case-insensitive --ep
    • d9d798bc route op-tracing through monitored path; hard-fail on unsupported inputs (--op-tracing with ONNX file → click.UsageError; status="no_data"/"parse_failed"sys.exit(4))
    • c5cbee98 close test coverage gaps from review
    • a0003e46 add hardware-gated CLI E2E for SC-1

Reviewers focused on the op-tracing delta should read the 33 commits on top of 1bea4cf8; the first commit is scaffolding from a separate effort.

Verification

  • uv run pytest tests/unit/session/ tests/unit/session/monitor/ tests/unit/commands/: 433 passed, 7 skipped (1 hardware-gated SC-1 E2E + 6 pre-existing skips)
  • uv run pytest tests/unit/: 3877 passed, 74 skipped, 2 xfailed — no regressions vs main
  • uv run ruff check: clean across all touched files
  • Cardinal Rule 1 audit (grep -r "D:\QC\|C:\Qualcomm" src/): zero matches

Notable behavior changes

  • onnxruntime-windowsml users can now run wmk perf --device npu --op-tracing basic directly — no --ep qnn needed (auto-inferred from device).
  • ONNX direct file inputs (-m foo.onnx --op-tracing ...) raise click.UsageError at parse time. HF model IDs and built model directories are supported. Direct-ONNX op-tracing is a future PR.
  • Op-tracing failures are now hard errors (exit 4 with descriptive message) rather than soft yellow warnings. status="basic_fallback" (detail mode degrading to basic CSV when QNN SDK absent) is the only acceptable degraded-success.
  • Detail mode (--op-tracing detail) requires QNN_SDK_ROOT env var pointing to a QAIRT SDK. Without it, degrades cleanly to status="basic_fallback" with a yellow notice.

Out of scope

  • DML / OpenVINO / TensorRT EPMonitor implementations (this PR reshapes the base class and reworks QNN only)
  • Multiple simultaneous EP monitors per session
  • --op-tracing for direct ONNX file inputs

Pre-existing failure (not introduced)

tests/regression/test_design_gaps.py::TestB4ConfigModelTypeNoTask::test_model_type_bert_no_task_succeeds fails on main and on this branch identically — unrelated to this PR.

Zheng Te added 30 commits April 14, 2026 15:10
Cherry-picked from feat/mvp-v2 (11 commits squashed):
- Lazy imports: _LAZY_IMPORTS dict pattern across 7 modules, 0.13s CLI startup
- LazyGroup CLI: filesystem scan + AST help parsing, no module imports
- Rich Live analyze: per-node animated stacked bars with callbacks
- Rich config output: CONFIG GENERATION header with I/O specs
- Build StageLive: per-stage progress with EP analyzer bars
- Inspect redesign: Rich formatted _inspect_model_v2
- Perf ONNX direct path: bypasses HF build for .onnx files
- Config registry short-circuit: skips Optimum when input_tensors registered
- Circular import fix (onnx <-> compiler) via lazy detection
- CodeQL alerts resolved, test adaptations, README rewrite
Breaks reverse-coupling for QNNMonitor.is_available() by providing a
module-level entry point that wraps the WinMLEPRegistry singleton.
Idempotent; safe to call multiple times.
Adds three optional members with safe defaults:
- requires_session_teardown: ClassVar[bool] = False
- get_session_options() -> {}
- get_provider_options() -> {}

Existing subclasses (VitisAI, OpenVINO, QNN placeholder, NullEPMonitor)
inherit defaults unchanged.
Move dataclasses from optracing/result.py. Additive changes:
- model: str -> str | None (allows None for standalone profiling)
- New fields: status (default 'ok'), error (default None)
- to_dict() preserves nested schema; adds top-level status/error keys

Old import path retained as re-export shim; removed in later task.
Moves display_op_trace_report + write_op_trace_json. Old path retained
as re-export shim. Existing test file moved with git mv (history preserved).
Moves csv_parser.py, qhas_parser.py, viewer.py + test fixtures from
optracing/qnn/ to session/monitor/qnn/, the canonical home for
QNNMonitor support code. Old paths retained as backward-compat shims.

Also updates test_integration.py fixture path and test_qnn_profiler.py
import + patch targets to reference the new module location.

Constraint: Old import paths must remain importable for any code already
referencing optracing.qnn.*
Scope-risk: narrow
Confidence: high
Frozen dataclass aggregating PerfStats + EPMonitor. Prep for
session.perf(monitor=...) signature change in a later task.
Infrastructure for session.perf(monitor=...) to contribute session-level
config entries via add_session_config_entry. Empty by default; populated
transiently during perf() context (wired in a later task).
- perf(warmup, monitor=None) yields PerfContext(stats, monitor)
- Auto-reset on option conflict (WARNING log)
- Teardown ordering: reset -> gc.collect -> monitor.__exit__ -> restore
- Exception transparency via sys.exc_info()
- Nested perf() raises RuntimeError
- Existing callers migrated to ctx.stats

Constraint: Teardown order is load-bearing - reset must fire before
monitor.__exit__ so QNN CSV flushes before parsing.
Confidence: high
Scope-risk: moderate (touches benchmark hot path)
Previously perf() mutated _perf_stats and _provider_options BEFORE calling
mon.__enter__(). In a @contextmanager generator, code before 'yield' runs
outside any try/finally, so an __enter__ exception propagated without
restoring state — leaving the session stuck (nested-perf error on every
subsequent perf() call).

Fix: move mutations into the try block; track mon_entered; skip
mon.__exit__ if __enter__ never succeeded; skip requires_session_teardown
reset if no data was produced.

Directive: setup mutations MUST remain inside the try block; do not refactor
back to pre-yield mutations without a reentrant state manager.
Confidence: high
Scope-risk: narrow (one method, preserved spec-compliant teardown ordering)
- get_session_options() contributes disable_cpu_ep_fallback, ep.context_*
- get_provider_options() contributes backend_path + profiling_level
  (user overrides honored for non-profiling keys; profiling_level and
  profiling_file_path are owner-enforced via explicit post-merge set)
- is_available() works with both onnxruntime-qnn AND onnxruntime-windowsml
  (latter via ep_registry.ensure_initialized + get_ep_devices)
- __exit__ parses CSV -> OpTraceResult; retries once on Windows file-handle lag
- detail mode: runs QHAS viewer if SDK available; falls back to basic
  (status='basic_fallback') with WARNING log
- No os.chdir (glob fallback to locate schematic.bin)
- Double __enter__ raises RuntimeError

Legacy placeholder tests in tests/unit/session/test_ep_monitor.py are
removed (class TestQNNMonitor); their replacement lives at
tests/unit/session/monitor/test_qnn_monitor.py (20 tests).

Constraint: profiling_level and profiling_file_path MUST NOT be user-
  overridable (C-3 in PRD) -- enforced via pop-then-set, not dict literal
  duplicate keys (which would trip ruff F601)
Confidence: high
Scope-risk: narrow (single module + obsolete legacy test class)
The single test (test_is_qnn_profiling_available_returns_bool) is fully
subsumed by three new tests in tests/unit/session/monitor/test_qnn_monitor.py
(test_is_available_via_bundled, test_is_available_via_winml,
test_is_available_neither).
- Add _resolve_ep_monitor(ep, op_tracing, output_dir) dispatch helper
  to commands/perf.py; explicit if/elif dispatch, no registry
- Add op_tracing field to BenchmarkConfig to thread the level through
  _run_benchmark_monitored without changing the CLI signature
- Integrate monitor= into main session.perf() call; EP monitor now
  observes the user's actual benchmark iterations (not a separate pass)
- Remove the third context-manager (ep_monitor as ep_mon) from the
  monitored-loop with-block; session.perf() owns the lifecycle
- Delete the standalone op-tracing block (~55 lines) that called
  QNNProfiler/get_tracer via the old optracing registry
- Add post-benchmark report block that reads ctx.monitor.result
  (OpTraceResult) via session.monitor.report.display/write helpers
- Move test file: tests/unit/optracing/test_perf_optracing_cli.py
  to tests/unit/commands/test_perf_optracing.py
- Expand test coverage: add TestResolveEpMonitor (8 tests) covering
  NullEPMonitor fallback, VitisAI availability, QNN hard-fail (NFR-2),
  unsupported-EP RuntimeError, and level propagation

Op-tracing now observes the user's actual benchmark iterations rather
than a separate synthetic profiling pass. Deliberate semantic change
per US-5.

Constraint: _run_onnx_benchmark does not yet integrate _resolve_ep_monitor
  (op-tracing on ONNX direct path is out of scope for this task)
Rejected: Add monitor= to BenchmarkConfig | BenchmarkConfig is a pure
  data struct; using a separate field (op_tracing) is cleaner
Confidence: high
Scope-risk: moderate
Directive: The ep_monitor is entered/exited by session.perf() now;
  do not add a third context manager for it
- QNNProfiler replaced by QNNMonitor (session/monitor/qnn_monitor.py)
- OpTracer ABC + registry collapsed into the EPMonitor hierarchy
- Tests migrated to tests/unit/session/monitor/
- Clean optracing/__init__.py: removed OpTracer, get_tracer, register_tracer,
  is_qnn_profiling_available exports; retains shims for result/report only

Remaining optracing/ shims (result.py, report.py, qnn/*.py) stay for
now to preserve external import paths during transition; removed in the
next commit once all imports are verified clean.

Confidence: high
Scope-risk: narrow (deletions only, no behavior change)
All functionality relocated to session/monitor/. Shims removed now that
no caller imports from the old paths.

Also removes 'winml.modelkit.optracing' from tests/cli/test_import_time.py's
parametrize list — the module no longer exists to import-test.

The op-tracing refactor's source delete phase is now complete.
Remaining tasks: rewire WinMLSession._init_winml_eps_once (Task 14),
relocate design docs (Task 15), final E2E verification (Task 16).

Confidence: high
Scope-risk: narrow (deletion + single test param removal)
All functionality relocated to session/monitor/. Shims removed now that
no caller imports from the old paths.

Also removes 'winml.modelkit.optracing' from tests/cli/test_import_time.py's
parametrize list — the module no longer exists to import-test.

The op-tracing refactor's source delete phase is now complete.
Remaining tasks: rewire WinMLSession._init_winml_eps_once (Task 14),
relocate design docs (Task 15), final E2E verification (Task 16).

Confidence: high
Scope-risk: narrow (deletion + single test param removal)
Redundant with the module-level ensure_initialized() function in
session/ep_registry.py added in an earlier task. WinMLSession.__init__
now calls the module function directly, eliminating the classmethod and
its associated _eps_initialized class attribute.

Updated three patch sites (tests/conftest.py, test_qairt_session.py,
test_is_compatible.py) from the deleted classmethod target to
winml.modelkit.session.ep_registry.ensure_initialized.

Confidence: high
Scope-risk: narrow
Per docs/standards/design-doc-spec.md §1.5.1 transitional commitment —
implementation has landed (optracing/ package deleted in Task 13), so docs
move to their spec-compliant location under the target module directory.

- Version bumped 2.1 -> 2.2 on both PRD and coreloop
- Transitional Location note removed (no longer applicable)
- design-doc-spec.md §7.4 exemplar path updated
- Related docs in plan file updated

Closes spec §1.5.1 transitional exception for the op-tracing refactor.

Confidence: high
Scope-risk: narrow (docs-only, no source code impact)
The tests/unit/optracing directory was deleted in Task 13 of this
refactor; our new tests live under tests/unit/session/monitor/ which
is already covered by the 'commands' group's tests/unit/session path.

Leaving the dead reference in the 'remaining' group either fails the
CI job (pytest errors on missing path) or leaks dead config. Remove it.

Confidence: high
Scope-risk: narrow (CI config only)
find_qnn_sdk() previously walked _COMMON_SDK_PATHS that included a
developer-machine path (D:\QC) and an installer default
(C:\Qualcomm\AIStack\qairt). Both violate Cardinal Rule 1
(no hardcoded paths) and could mask misconfiguration on other machines.

Resolve only from the QNN_SDK_ROOT env var. Update the WARNING messages
in run_basic_viewer and run_qhas_viewer so users learn what to set when
the viewer cannot be located. Detail mode degrades to basic CSV
post-processing per design FR-5 / status='basic_fallback'.

Add unit tests covering env unset, env points to dir, env points to
nonexistent path.
The requires_session_teardown ClassVar is load-bearing per design C-2:
WinMLSession.perf() destroys the ORT InferenceSession before
collecting monitor data when this flag is True (QNN flushes its CSV
only on session destroy). A subclass silently shadowing it with a
non-bool would break the invariant without a runtime error.

Add EPMonitor.__init_subclass__ that rejects non-bool class-level
shadowing at class-definition time. Runtime instance shadowing in
__init__ is not catchable here without metaclass tricks, but making
the invariant *visible* is the goal.

Add a unit test that constructs a bad subclass and asserts TypeError.
The session/monitor/ package directory was missing __init__.py,
making it an implicit namespace package. Add a lightweight init
matching the project's existing __init__.py style (no re-exports;
external code imports submodules directly).
Introduce TraceStatus = Literal[...] alias covering the closed set of
values emitted by EP monitors (ok | no_data | parse_failed |
basic_fallback | not_run). Annotate OpTraceResult.status with the alias
so static checkers reject typos like 'no_dada' that would currently
slip through as a successful status and silently misclassify in
commands/perf.py.

Runtime behavior is unchanged — Literal is purely a static narrowing,
and to_dict()/JSON serialization continue to emit the bare string.

Also propagate the alias to QNNMonitor: _make_failure_result(status)
parameter and the local 'status' var in _parse_artifacts are typed as
TraceStatus.
Pre-exit, QNNMonitor.to_dict() previously returned a flat
{ep, device, status} stub while post-exit it returned the nested
OpTraceResult schema (metadata/summary/operators/...). Consumers that
key on 'metadata' would KeyError if they probed the monitor before the
context manager exited.

Delegate the pre-exit branch to _make_failure_result(status='not_run'),
which is already the post-exit failure path. Same nested shape, just
empty body and status='not_run'.

Update test_to_dict_before_enter to assert the new shape and add
test_to_dict_pre_exit_returns_nested_schema to lock the contract.
A *_schematic.bin from a prior CI run sitting in the process CWD would
previously be silently consumed when the QNN SDK failed to drop one in
the output dir. QHAS would then report metrics for the WRONG graph
with status='ok' — silent data corruption.

Gate the CWD fallback by mtime against the profiling CSV: the schematic
must be at least as new as the CSV (with a 5s tolerance for filesystem
clock skew) to be accepted. Otherwise return None and let QHAS bail
with status='basic_fallback', which is the honest answer.

Add three tests:
  - test_find_schematic_rejects_stale_cwd_candidate
  - test_find_schematic_accepts_fresh_cwd_candidate
  - test_find_schematic_prefers_output_dir_over_cwd
…t_dir

The tempdir created when output_dir=None (qnn_profile_*) was undocumented
and unreachable by callers, even though we deliberately leak it so QNN
profiling artifacts can be inspected post-run.

Document the no-cleanup contract in the class docstring and __init__
docstring, and expose a read-only output_dir property so callers can
locate / clean / archive the artifacts they care about.

Tests:
  - test_output_dir_property_exposes_path
  - test_output_dir_property_for_default_tempdir (asserts qnn_profile_ prefix)
  - test_output_dir_property_is_read_only (no setter)
The bare 'except Exception' around ensure_initialized() + get_ep_devices()
swallowed broken Windows App SDK installs, denied registry access, and
missing-DLL failures, downgrading them to a silent 'feature unavailable'.
This is exactly the D-1 silent-failure problem the PR was meant to fix.

Narrow the inner ImportError catch to the import statement only, and
promote the catch-all around ensure_initialized()/get_ep_devices() from
DEBUG to WARNING with the exception class included. Callers can still
probe availability safely (no exception escapes), but real environmental
problems are now visible.

Constraint: is_available must remain non-throwing per design (callers probe)
Confidence: high
Scope-risk: narrow
Promote silent-fallback DEBUG logs to WARNING so users can diagnose real
environmental failures (broken Windows App SDK, denied registration,
missing DLL):

- _discover_eps catch-all: include exception class in WARNING.
- _fix_winrt_runtime: DEBUG -> WARNING (function only runs in known-needed
  init path, so a failure here matters).
- register_to_ort: include EP name + exception class in WARNING; track
  per-EP failures in new _registration_failures dict + registration_failures
  property so callers can introspect.
- ensure_initialized: DEBUG -> WARNING with exception class. No latch on
  failure — subsequent calls retry registration.
- get_ort_available_providers: DEBUG -> WARNING (same rationale).

Constraint: ensure_initialized must remain non-throwing (NFR-2 callers probe)
Confidence: high
Scope-risk: narrow
Directive: registration_failures property returns a copy — do not expose
internal dict directly
Drop the redundant 'session.perf(): ' prefix so the verbatim NFR-3 phrase
('auto-resetting compiled session to apply monitor session/provider
options') appears as a substring of the WARNING log. The (monitor=...)
suffix is retained as load-bearing diagnostic context.

Update test_auto_reset_fires_when_options_contributed to pin to the exact
NFR-3 substring so future regressions are caught immediately.

Confidence: high
Scope-risk: narrow
Zheng Te added 4 commits April 27, 2026 11:47
The headline SC-1 invocation `wmk perf --device npu --op-tracing basic`
must engage QNNMonitor without requiring an explicit `--ep qnn`. Two
gaps caused the silent no-op:

  * `_resolve_ep_monitor` matched `ep == "qnn"` literally and
    case-sensitively, so `--ep QNN` and `--ep Qnn` fell through to
    the unsupported-EP raise.
  * With no `--ep` at all, `config.ep` was `None` so the helper
    raised "Op-tracing not available for EP 'None'" -- masking the
    real intent ("--device npu means QNN").

Fix: normalize ep with `(ep or "").lower()`, accept device='npu'
as auto-inference for QNN when no ep is given, and split the
unavailable-vs-unsupported error paths so the QNN-missing case
points the user at install steps.

Tests cover auto-inference, case-insensitive matching across
`qnn / QNN / Qnn / qNN`, descriptive errors for both branches.
…pported inputs

Three converging gaps caused `wmk perf --op-tracing basic` to be a
silent no-op or a soft-warn instead of producing the SC-1 artifacts:

  1. Dispatch in `_run_benchmark` only routed to the monitored path
     when `--monitor` was set. Without `--monitor`, `session.perf`
     was called with no monitor=, so op-tracing never engaged.
     Fix: route to the monitored path whenever op_tracing OR monitor.

  2. `_run_benchmark_monitored` early-returned to the simple path
     when HWMonitor was unavailable -- defeating op-tracing too. Fix:
     run with the EP monitor only (no live HW chart) when HW telemetry
     is missing; keep the warning for the --monitor case.

  3. Post-benchmark report read `benchmark._perf_ctx` but only the
     monitored path set it. Fix: populate `_perf_ctx` on every path
     so reporting works regardless of which dispatcher was hit.

Additionally:

  * Hard-fail `--op-tracing` with `.onnx` file inputs at parse time
    via `click.UsageError`. `_run_onnx_benchmark` does not yet thread
    the EP monitor through session.perf, so allowing the run would
    waste a benchmark and end with "No profiling data". Fast clear
    error is better than slow silent failure.

  * Replace the post-benchmark soft "yellow warning" with NFR-2
    hard-fail (exit 4) on `status='no_data'` and `status='parse_failed'`.
    `status='basic_fallback'` remains exit 0 with a yellow notice
    (degraded-success). Each branch carries an actionable diagnostic.

Tests cover: ONNX+op-tracing UsageError, no_data exit 4, parse_failed
exit 4 with parser message, basic_fallback exit 0 with degraded notice.
Audit-driven test tightening on the op-tracing refactor. Each addition
covers a code path that the PR review (A2-I1, A2-I3, A2-I7, A2-I8) called
out as silent-regression-prone:

* test_to_dict_preserves_nested_schema (A2-I1): assert num_samples and
  timestamp are present in metadata. A regression dropping either field
  would have passed the previous "check the easy keys" assertion.

* test_ensure_initialized_calls_registry_once (A2-I3): tighten from
  call_count >= 1 to == 3 across both get_instance and register_to_ort.
  Pins the wrapper-side contract; the singleton's _registered_eps
  skip-list provides the actual no-op idempotency.

* test_detail_mode_falls_back_to_basic_when_qhas_unavailable (A2-I7):
  cover FR-5 — detail mode with no *_qnn.log produces status=basic_fallback
  while preserving CSV-derived operators and summary. Without this test,
  a regression that raised or downgraded to status=ok would slip through.

* test_parse_artifacts_retries_when_csv_absent +
  test_parse_artifacts_no_retry_when_csv_present_on_first_check (A2-I8):
  pin R-2 — the 50ms time.sleep retry must fire on missing CSV and must
  NOT fire when the CSV is already present.

Constraint: tests stay pytest-only with code-generated fixtures (Cardinal
Rule 2); CSV content is read from the existing optrace_resnet50.csv
fixture, never hand-rolled.
Confidence: high
Scope-risk: narrow
PRD §10.5 / coreloop §8.4 mandate this test:

  test_cli_op_tracing_basic_on_qnn (skip if no QNN NPU): runs
  wmk perf -m resnet50 --device npu --op-tracing basic, asserts CSV
  produced, *_op_trace.json written, at least one operator entry.

This is the only end-to-end proof that SC-1 holds: the headline
invocation produces real per-operator trace data on a QNN NPU rather
than silently falling back to CPU (the original SC-1 bug).

Doubly-gated:
  * @pytest.mark.skipif on WINML_TEST_NPU=1 — explicit opt-in matching
    the existing project pattern (see tests/unit/models/auto/conftest.py).
  * QNNMonitor.is_available() runtime probe — pytest.skip cleanly on
    machines that opted in but do not actually have the EP installed.

Constraint: Cardinal Rule 3 forbids skip/xfail to hide failures; both
gates are hardware-availability gates, not failure suppression.
Constraint: only HF model IDs go through the integrated op-tracing
path (commands/perf.py rejects --op-tracing on direct .onnx inputs);
the test uses microsoft/resnet-50 to match existing perf E2E coverage.
Confidence: high
Scope-risk: narrow
Not-tested: detail-mode CLI on real hardware (covered by the unit-level
test_basic_fallback_status_exits_0_with_notice plus the new
test_detail_mode_falls_back_to_basic_when_qhas_unavailable).
Zheng Te added 4 commits April 27, 2026 14:46
_resolve_ep_monitor previously only auto-inferred QNN when
device == "npu" literal. The default --device auto and programmatic
device=None / device="" callers fell through to the hard-fail branch
with a cryptic "EP None on device 'auto'" message, even though
--op-tracing is itself a strong intent signal that the user wants
QNN-only profiling.

Expand the auto-infer set from {"npu"} to {"npu", "auto", ""}.
Explicit --device cpu / --device gpu still hard-fails — those are
deliberate user choices that should not be silently overridden.

Constraint: --op-tracing currently only supports QNN; auto-resolution
must not contradict an explicit user device choice.
Confidence: high
Scope-risk: narrow
Directive: do NOT add "cpu" or "gpu" to the auto-infer set — they
must remain explicit no-NPU signals.
QNNMonitor was setting backend_path="QnnHtp.dll" (relative) and
disable_cpu_ep_fallback=1 as "sensible defaults". Both broke the
WinML-registered QNN path:

1. backend_path override REPLACED WinML's absolute
   C:\Program Files\WindowsApps\...\QnnHtp.dll path with a bare
   filename, so QNN's library loader couldn't find the DLL. EPContext
   nodes then had no implementation -> NotImplemented errors. This is
   structurally identical to the original D-1 silent CPU fallback —
   different mechanism, same outcome.

2. disable_cpu_ep_fallback=1 rejected the LEGITIMATE QDQ partitioning
   that EPContext-wrapped quantized models use: Q/DQ at boundaries
   genuinely need to run on CPU; the EPContext (the actual model) runs
   on QNN. Forbidding any CPU node killed all such models.

The fix: get_provider_options() now passes through extra_provider_options
plus only the two C-3 owner-enforced profiling keys. Callers needing
backend_path / htp_* (e.g. bundled onnxruntime-qnn users) supply via
extra_provider_options at construction. WinML-registered QNN keeps its
correct registered defaults.

get_session_options() retains ep.context_enable + ep.context_embed_mode
(EPContext caching) but drops session.disable_cpu_ep_fallback. The
"no silent CPU fallback" guarantee is provided by add_provider_for_devices
upstream — if the QNN device is absent, session creation fails loudly
there.

Verified empirically: facebook/convnext-base-224 now runs on QNN with
73% avg NPU util at 3.12ms latency under onnxruntime-windowsml.

Constraint: WinML-registered QNN provides its own absolute backend_path
and HTP defaults; supplying our own would overwrite (last-writer-wins
in ORT's option merge).
Constraint: EPContext + QDQ models legitimately partition Q/DQ to CPU.
Confidence: high
Scope-risk: moderate (changes hook contributions; tests updated)
Directive: do NOT add backend_path / htp_* / disable_cpu_ep_fallback
back as defaults. They must remain pure pass-through from extra_provider_options.
WinMLSession.compile() takes one of two paths:
  1. Explicit-EP: when self._ep is set -> add_provider_for_devices([dev], opts)
     This passes self._provider_options through to ORT.
  2. Policy-based: when self._ep is None -> set_provider_selection_policy(policy)
     This ignores self._provider_options entirely.

When op-tracing is engaged via QNNMonitor, the monitor contributes
profiling_level + profiling_file_path through get_provider_options(). But
WinMLAutoModel constructs WinMLSession without an explicit ep, so self._ep
is None -> path 2 is taken -> profiling options are silently dropped ->
no profiling CSV is produced even though the model runs on QNN.

This is the third regression of the same root pattern in this PR:
"monitor contributes config the session doesn't consume."

Fix: introduce EPMonitor.ep_name ClassVar (default None). When the monitor
declares an ep_name AND self._ep is None, perf() pins self._ep for the
duration of the perf() context (saved in the same try/finally chain that
already restores _provider_options).

QNNMonitor.ep_name = "qnn"; NullEPMonitor / VitisAIMonitor / OpenVinoMonitor
inherit None and behave unchanged.

Verified empirically: facebook/convnext-base-224 with --op-tracing basic
now produces a 1 MB profiling CSV and a populated op_trace.json with
per-operator cycle counts (LayerNormalization, Gelu, Conv, etc.) under
onnxruntime-windowsml.

Constraint: ORT's set_provider_selection_policy and add_provider_for_devices
are mutually exclusive; only the latter consumes provider options.
Constraint: WinMLAutoModel constructs WinMLSession without ep, so the
monitor must inform the session at perf-time (not at construction).
Confidence: high
Scope-risk: narrow (additive ClassVar; restoration in existing teardown chain)
Directive: when adding new EPMonitor subclasses that need their own provider
options to take effect, set ep_name to the corresponding short EP name.
…it value

Op-tracing produces a usable per-operator trace from a single inference;
running the normal default of 100 iterations just inflates the profiling
CSV proportionally (operators are averaged across iterations regardless).
The wall-clock cost of the extra 99 iterations is non-trivial on QNN
(~3 ms each on convnext-base-224 -> ~300 ms saved) and the larger CSV
slows the post-hoc parse step too.

When --op-tracing is set AND the user did not explicitly pass --iterations,
collapse to 1. Detection uses click.Context.get_parameter_source so the
help text still shows [default: 100] (which is what stands without
op-tracing) and a user explicitly passing --iterations 100 alongside
--op-tracing is honored.

Warmup is intentionally NOT smart-defaulted: it serves a distinct purpose
(letting QNN HTP warm its compiled-binary cache before measurement) and
that need persists with op-tracing. Default warmup=10 + iterations=1 =
11 total inferences, all profiled.

Verified empirically: facebook/convnext-base-224 --op-tracing basic
without --iterations now produces a 450KB CSV (was 1MB at 20 iter / 5MB
at 100 iter) with the same op-level cycle data.

Constraint: must distinguish "user didn't specify" from "user passed the
default value explicitly" — handled via click.ParameterSource.DEFAULT.
Confidence: high
Scope-risk: narrow (CLI-only behaviour change; gated on --op-tracing)
Directive: do NOT add similar smart defaults for --warmup; warmup
serves a different purpose (cache stabilisation) that op-tracing needs too.
@tezheng
Copy link
Copy Markdown
Collaborator Author

tezheng commented Apr 28, 2026

C:/Program Files/Git/azp run

Copy link
Copy Markdown
Collaborator

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: refactor(session): unify op-tracing under EPMonitor

Good architectural refactoring. The EPMonitor ABC pattern with hook-based plugin integration into WinMLSession.perf() is clean and extensible. The teardown ordering invariant (session reset -> gc.collect() -> monitor.exit) is well-documented and correctly implemented.

Strengths:

  • Clear separation of concerns: monitors contribute options via hooks, session owns ORT lifecycle
  • NullObject pattern eliminates null checks in benchmark loop
  • Comprehensive status/error reporting via TraceStatus enum
  • Well-documented class vars with init_subclass validation

Concerns (see inline comments):

  • Potential resource leak in QNNMonitor temp directory lifecycle
  • Race condition window in _find_schematic CWD fallback
  • Missing retry robustness for Windows file-handle lag
  • Large scope: 23k+ additions including docs/design artifacts that should likely be separate PRs

Copy link
Copy Markdown
Collaborator

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline findings (could not attach as line comments due to diff resolution):

src/winml/modelkit/session/monitor/qnn_monitor.py L108 - Resource Leak: \ empfile.mkdtemp\ creates a directory that is explicitly never cleaned up. In CI or automated benchmarking this could accumulate unbounded temp dirs. Consider adding a \cleanup()\ method or emitting a \logger.info\ with the temp path.

src/winml/modelkit/session/session.py L89 - Hardcoded fallback: \�uto\ maps to \PREFER_NPU\ which assumes Qualcomm hardware. This couples the generic session class to a specific vendor. Consider adding a comment explaining why or making it configurable.

src/winml/modelkit/session/session.py L664 - Teardown ordering: The nested try/finally is correct but subtle. Suggest adding a brief inline comment (e.g. # C-2 invariant: teardown before parse) to help future maintainers understand why this order matters.

src/winml/modelkit/session/monitor/qnn_monitor.py L243 - Fragile retry: The 50ms sleep for Windows file-handle flush lag is a fixed heuristic. Consider exponential backoff (50ms, 100ms, 200ms) or documenting the NVMe/SSD assumption.

src/winml/modelkit/session/monitor/qnn_monitor.py L277 - TOCTOU concern: The mtime comparison in _find_schematic\ CWD fallback has a race window in CI with parallel runs. Consider logging both timestamps when the fallback triggers.

src/winml/modelkit/session/monitor/ep_monitor.py L65 - Incomplete validation: _init_subclass_\ catches class-level shadowing but instance-level shadowing bypasses this. Since this flag governs the teardown invariant, consider _slots_\ or _setattr_\ guard.

src/winml/modelkit/session/monitor/qnn/csv_parser.py L59 - Silent data loss risk: \csv.DictReader\ silently handles schema changes. Consider validating the header row against expected column names.

src/winml/modelkit/session/monitor/qnn/viewer.py L140 - nit: Consider adding \ imeout=120\ to \subprocess.run\ to prevent indefinite hangs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants