Skip to content

update eval scripts: add ONNX size tracking and output sanitization#755

Open
DingmaomaoBJTU wants to merge 8 commits into
mainfrom
qiowu/update-eval-scripts
Open

update eval scripts: add ONNX size tracking and output sanitization#755
DingmaomaoBJTU wants to merge 8 commits into
mainfrom
qiowu/update-eval-scripts

Conversation

@DingmaomaoBJTU
Copy link
Copy Markdown
Collaborator

Summary

  • Add _compute_onnx_size() to measure combined ONNX + .data file sizes and include onnx_size_bytes in eval results
  • Add _sanitize_output() to strip CLI chrome (Rich tables, device/IO banners) from eval_result.json, keeping only error-relevant content
  • Minor formatting fixes in reporter.py

@DingmaomaoBJTU DingmaomaoBJTU requested a review from a team as a code owner May 26, 2026 09:36
Comment thread scripts/e2e_eval/run_eval.py Fixed
Copy link
Copy Markdown
Collaborator Author

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice additions - ONNX size tracking and output sanitization are useful for keeping eval_result.json lean. A few observations below.

Comment thread scripts/e2e_eval/run_eval.py
Comment thread scripts/e2e_eval/run_eval.py
Comment thread scripts/e2e_eval/utils/reporter.py
Comment thread scripts/e2e_eval/run_eval.py
Copy link
Copy Markdown
Collaborator Author

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful additions - ONNX size tracking and output sanitization will make eval results much cleaner. A few suggestions below.

Comment thread scripts/e2e_eval/run_eval.py Outdated
Comment thread scripts/e2e_eval/run_eval.py
Comment thread scripts/e2e_eval/run_eval.py
Comment thread scripts/e2e_eval/utils/reporter.py
- Add _compute_onnx_size() to measure combined ONNX + .data file sizes
- Add _sanitize_output() to strip CLI chrome (Rich tables, banners) from
  eval_result.json, keeping only error-relevant content
- Pass onnx_size_bytes and sanitize_fn through to build_eval_result()
- Minor formatting fixes in reporter.py
Capture hardware details (devices, EPs, backends) by running
`winml sys --format json` and embedding the result under the
`winml_sys` key in environment.json.
The sanitize_fn strips perf metric lines (latency, throughput, etc.)
from stdout/stderr. Store the original output in raw_stdout/raw_stderr
fields so downstream consumers can still access the full perf data.
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the qiowu/update-eval-scripts branch 3 times, most recently from 52fa381 to a7b8a0c Compare June 2, 2026 13:07
- Fix displaced docstring in generate_html_report (was after import)
- Anchor _sanitize_output patterns to line start to avoid stripping
  legitimate error messages containing pattern substrings
- Use idiomatic pathlib for .data companion file construction
@DingmaomaoBJTU DingmaomaoBJTU force-pushed the qiowu/update-eval-scripts branch from a7b8a0c to b3e0cf9 Compare June 2, 2026 13:08
# Lines that carry no diagnostic value in eval_result.json.
# Matching is case-insensitive, applied per-line.
_NOISE_PATTERNS = (
"benchmarking onnx",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little strange.. any way we add a parameter in eval command to just drop them?

)

# Box-drawing characters used by Rich tables.
_BOX_CHARS = frozenset("─│┌┐└┘├┤┬┴┼")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_BOX_CHARS only covers Unicode's LIGHT box-drawing style (─│┌┐└┘├┤┬┴┼), but winml perf uses Rich's default Table(), which renders with box.HEAVY_HEAD. I rendered a default Rich table locally and 3 of the 5 lines start with chars not in this set:

Row First char In _BOX_CHARS?
top border ┏━━━━━┳━━━━━┓ (U+250F)
header row ┃ Avg ┃ P90 ┃ (U+2503)
head separator ┡━━━━━╇━━━━━┩ (U+2521)
data row │ 12.5 │ 15.2 │
bottom border └─────┴─────┘

Net result: eval_result.json keeps the top border + header text + head separator while stripping data rows and the bottom border — arguably uglier than no sanitization at all (orphaned half-table chrome).

Cheap fix — use a Unicode block range check instead of a hand-curated set:

if 0x2500 <= ord(stripped[0]) <= 0x257F:  # Unicode "Box Drawing" block
    continue

That covers all four Rich styles (light, heavy, double, rounded) in one rule and won''t silently drift the next time someone changes the table style.

🤖 Generated with GitHub Copilot CLI

low = stripped.lower()
if any(low.startswith(pat) for pat in _NOISE_PATTERNS):
continue
kept.append(stripped)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appending stripped (post-.strip()) discards leading indentation, which destroys the structure of multi-line errors — the very content the docstring promises to preserve ("All classifier patterns are error-related and survive this filter"). For example, a traceback:

  File "foo.py", line 5, in bar
    raise RuntimeError("x")

becomes a visually-flat:

File "foo.py", line 5, in bar
raise RuntimeError("x")

which is noticeably harder to read.

Suggested fix — keep stripped only for the box/noise classifier checks, then append the original line (lightly trimmed):

if not stripped:
    continue
if 0x2500 <= ord(stripped[0]) <= 0x257F:
    continue
low = stripped.lower()
if any(low.startswith(pat) for pat in _NOISE_PATTERNS):
    continue
kept.append(line.rstrip())

🤖 Generated with GitHub Copilot CLI

"latency (ms)",
"throughput:",
"results saved to",
"inputs:",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of patterns in _NOISE_PATTERNS can swallow legitimate diagnostic content with the current low.startswith(pat) matching:

  • "inputs:" / "outputs:" — these silently strip lines like Inputs: expected (1, 224, 224, 3), got (1, 3, 224, 224) (a real shape-mismatch error), which is exactly the kind of "error-relevant" content a sanitizer is supposed to keep. Same concern for "device:" if a downstream error ever reads something like Device: <name> is not available, falling back to CPU.
  • "samples/sec" — dead pattern under startswith semantics. Throughput: 80 samples/sec is already covered by "throughput:" above; no winml perf line literally starts with samples/sec.

Tightening options (cheapest first):

  1. Just drop inputs: / outputs: / samples/sec. The remaining patterns are unambiguous CLI chrome.
  2. Switch to exact-prefix-with-boundary: only strip when the line is pat followed by space or end, e.g. low == pat or low.startswith(pat + " "), so error lines that happen to start with Inputs: but carry non-trivial content survive.

🤖 Generated with GitHub Copilot CLI

- Improve _compute_onnx_size to parse ONNX proto for all external data
  files instead of only checking the conventional .data suffix
- Add debug logging when winml sys times out or fails (replaces bare pass)
- Add --raw-output flag to skip output sanitization in eval_result.json
Replaces per-line linear scan with a pre-compiled regex for better
performance on large outputs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants