update eval scripts: add ONNX size tracking and output sanitization by DingmaomaoBJTU · Pull Request #755 · microsoft/winml-cli

DingmaomaoBJTU · 2026-05-26T09:36:34Z

Summary

Add _compute_onnx_size() to measure combined ONNX + .data file sizes and include onnx_size_bytes in eval results
Add _sanitize_output() to strip CLI chrome (Rich tables, device/IO banners) from eval_result.json, keeping only error-relevant content
Minor formatting fixes in reporter.py

DingmaomaoBJTU

Nice additions - ONNX size tracking and output sanitization are useful for keeping eval_result.json lean. A few observations below.

DingmaomaoBJTU

Useful additions - ONNX size tracking and output sanitization will make eval results much cleaner. A few suggestions below.

- Add _compute_onnx_size() to measure combined ONNX + .data file sizes - Add _sanitize_output() to strip CLI chrome (Rich tables, banners) from eval_result.json, keeping only error-relevant content - Pass onnx_size_bytes and sanitize_fn through to build_eval_result() - Minor formatting fixes in reporter.py

Capture hardware details (devices, EPs, backends) by running `winml sys --format json` and embedding the result under the `winml_sys` key in environment.json.

The sanitize_fn strips perf metric lines (latency, throughput, etc.) from stdout/stderr. Store the original output in raw_stdout/raw_stderr fields so downstream consumers can still access the full perf data.

- Fix displaced docstring in generate_html_report (was after import) - Anchor _sanitize_output patterns to line start to avoid stripping legitimate error messages containing pattern substrings - Use idiomatic pathlib for .data companion file construction

xieofxie · 2026-06-04T02:20:26Z

+# Lines that carry no diagnostic value in eval_result.json.
+# Matching is case-insensitive, applied per-line.
+_NOISE_PATTERNS = (
+    "benchmarking onnx",


a little strange.. any way we add a parameter in eval command to just drop them?

timenick · 2026-06-04T02:26:21Z

+)
+
+# Box-drawing characters used by Rich tables.
+_BOX_CHARS = frozenset("─│┌┐└┘├┤┬┴┼")


_BOX_CHARS only covers Unicode's LIGHT box-drawing style (─│┌┐└┘├┤┬┴┼), but winml perf uses Rich's default Table(), which renders with box.HEAVY_HEAD. I rendered a default Rich table locally and 3 of the 5 lines start with chars not in this set:

Row First char In _BOX_CHARS?

top border ┏━━━━━┳━━━━━┓ ┏ (U+250F) ❌

header row ┃ Avg ┃ P90 ┃ ┃ (U+2503) ❌

head separator ┡━━━━━╇━━━━━┩ ┡ (U+2521) ❌

data row │ 12.5 │ 15.2 │ │ ✅

bottom border └─────┴─────┘ └ ✅

Net result: eval_result.json keeps the top border + header text + head separator while stripping data rows and the bottom border — arguably uglier than no sanitization at all (orphaned half-table chrome).

Cheap fix — use a Unicode block range check instead of a hand-curated set:

if 0x2500 <= ord(stripped[0]) <= 0x257F: # Unicode "Box Drawing" block continue

That covers all four Rich styles (light, heavy, double, rounded) in one rule and won''t silently drift the next time someone changes the table style.

🤖 Generated with GitHub Copilot CLI

timenick · 2026-06-04T02:26:21Z

+        low = stripped.lower()
+        if any(low.startswith(pat) for pat in _NOISE_PATTERNS):
+            continue
+        kept.append(stripped)


Appending stripped (post-.strip()) discards leading indentation, which destroys the structure of multi-line errors — the very content the docstring promises to preserve ("All classifier patterns are error-related and survive this filter"). For example, a traceback:

File "foo.py", line 5, in bar raise RuntimeError("x")

becomes a visually-flat:

File "foo.py", line 5, in bar raise RuntimeError("x")

which is noticeably harder to read.

Suggested fix — keep stripped only for the box/noise classifier checks, then append the original line (lightly trimmed):

if not stripped: continue if 0x2500 <= ord(stripped[0]) <= 0x257F: continue low = stripped.lower() if any(low.startswith(pat) for pat in _NOISE_PATTERNS): continue kept.append(line.rstrip())

🤖 Generated with GitHub Copilot CLI

timenick · 2026-06-04T02:26:21Z

+    "latency (ms)",
+    "throughput:",
+    "results saved to",
+    "inputs:",


A couple of patterns in _NOISE_PATTERNS can swallow legitimate diagnostic content with the current low.startswith(pat) matching:

"inputs:" / "outputs:" — these silently strip lines like Inputs: expected (1, 224, 224, 3), got (1, 3, 224, 224) (a real shape-mismatch error), which is exactly the kind of "error-relevant" content a sanitizer is supposed to keep. Same concern for "device:" if a downstream error ever reads something like Device: <name> is not available, falling back to CPU.

"samples/sec" — dead pattern under startswith semantics. Throughput: 80 samples/sec is already covered by "throughput:" above; no winml perf line literally starts with samples/sec.

Tightening options (cheapest first):

Just drop inputs: / outputs: / samples/sec. The remaining patterns are unambiguous CLI chrome.

Switch to exact-prefix-with-boundary: only strip when the line is pat followed by space or end, e.g. low == pat or low.startswith(pat + " "), so error lines that happen to start with Inputs: but carry non-trivial content survive.

🤖 Generated with GitHub Copilot CLI

- Improve _compute_onnx_size to parse ONNX proto for all external data files instead of only checking the conventional .data suffix - Add debug logging when winml sys times out or fails (replaces bare pass) - Add --raw-output flag to skip output sanitization in eval_result.json

Replaces per-line linear scan with a pre-compiled regex for better performance on large outputs.

DingmaomaoBJTU requested a review from a team as a code owner May 26, 2026 09:36

github-advanced-security AI found potential problems May 26, 2026

View reviewed changes

Comment thread scripts/e2e_eval/run_eval.py Fixed

DingmaomaoBJTU commented May 29, 2026

View reviewed changes

Comment thread scripts/e2e_eval/run_eval.py

Comment thread scripts/e2e_eval/run_eval.py

Comment thread scripts/e2e_eval/utils/reporter.py

Comment thread scripts/e2e_eval/run_eval.py

DingmaomaoBJTU commented May 30, 2026

View reviewed changes

Comment thread scripts/e2e_eval/run_eval.py Outdated

Comment thread scripts/e2e_eval/run_eval.py

Comment thread scripts/e2e_eval/run_eval.py

Comment thread scripts/e2e_eval/utils/reporter.py

DingmaomaoBJTU added 4 commits June 2, 2026 20:40

add winml sys output to environment.json

9c152a0

Capture hardware details (devices, EPs, backends) by running `winml sys --format json` and embedding the result under the `winml_sys` key in environment.json.

fix: preserve raw perf data before output sanitization

7d00b37

The sanitize_fn strips perf metric lines (latency, throughput, etc.) from stdout/stderr. Store the original output in raw_stdout/raw_stderr fields so downstream consumers can still access the full perf data.

fix: add explanatory comment to except clause (CodeQL)

f21a123

DingmaomaoBJTU force-pushed the qiowu/update-eval-scripts branch 3 times, most recently from 52fa381 to a7b8a0c Compare June 2, 2026 13:07

fix: address PR review comments

b3e0cf9

- Fix displaced docstring in generate_html_report (was after import) - Anchor _sanitize_output patterns to line start to avoid stripping legitimate error messages containing pattern substrings - Use idiomatic pathlib for .data companion file construction

DingmaomaoBJTU force-pushed the qiowu/update-eval-scripts branch from a7b8a0c to b3e0cf9 Compare June 2, 2026 13:08

Merge branch 'main' into qiowu/update-eval-scripts

137ea51

xieofxie reviewed Jun 4, 2026

View reviewed changes

timenick reviewed Jun 4, 2026

View reviewed changes

DingmaomaoBJTU added 2 commits June 4, 2026 10:34

perf: compile noise patterns into a single regex

2652af1

Replaces per-line linear scan with a pre-compiled regex for better performance on large outputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update eval scripts: add ONNX size tracking and output sanitization#755

update eval scripts: add ONNX size tracking and output sanitization#755
DingmaomaoBJTU wants to merge 8 commits into
mainfrom
qiowu/update-eval-scripts

DingmaomaoBJTU commented May 26, 2026

Uh oh!

Uh oh!

DingmaomaoBJTU left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xieofxie Jun 4, 2026

Uh oh!

timenick Jun 4, 2026

Uh oh!

timenick Jun 4, 2026

Uh oh!

timenick Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Row	First char	In `_BOX_CHARS`?
top border `┏━━━━━┳━━━━━┓`	`┏` (U+250F)	❌
header row `┃ Avg ┃ P90 ┃`	`┃` (U+2503)	❌
head separator `┡━━━━━╇━━━━━┩`	`┡` (U+2521)	❌
data row `│ 12.5 │ 15.2 │`	`│`	✅
bottom border `└─────┴─────┘`	`└`	✅

Conversation

DingmaomaoBJTU commented May 26, 2026

Summary

Uh oh!

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xieofxie Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

timenick Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

timenick Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

timenick Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants