Skip to content

[BUG] Fix char-level latency crash when hypothesis has trailing whitespace#39

Open
sarapapi wants to merge 3 commits into
mainfrom
fix_space_tokenization_issues
Open

[BUG] Fix char-level latency crash when hypothesis has trailing whitespace#39
sarapapi wants to merge 3 commits into
mainfrom
fix_space_tokenization_issues

Conversation

@sarapapi

@sarapapi sarapapi commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

When computing latency at character level, _tokenize was calling .strip() on final_text before encoding with CJSegmenter. This caused a character count mismatch: ideal_delays is built from the original final_text (one entry per character, including trailing whitespace), but mweralign only received the stripped version. After resegmentation, the total character count across segments was 1 less than the length of ideal_delays, triggering the assertion in _split_delays_by_segmented_text.

The fix trims ideal_delays and computational_aware_delays in score() to exclude the entries corresponding to leading/trailing whitespace stripped by _tokenize, keeping the delay list consistent with what mweralign actually processes.

An empty-hypothesis guard is also added: if the hypothesis is empty (or reduces to empty after stripping), mweralign is skipped and each reference sentence is matched to an empty output with no delays. This is handled gracefully downstream by _do_score, which already skips sentences with empty delay lists.

@sarapapi sarapapi requested a review from mgaido91 July 3, 2026 08:57
@sarapapi sarapapi self-assigned this Jul 3, 2026
@sarapapi sarapapi added the bug Something isn't working label Jul 3, 2026
Comment thread simulstream/metrics/scorers/latency/mwersegmenter.py
@sarapapi

sarapapi commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Additional info: tested on the same logs for which it was failing before, and now it works

@mgaido91

mgaido91 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

cc @owaski do you have better ideas how to handle this? It is probably an extreme case, as this is for spaces at the end of the hypothesis.

@owaski

owaski commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

cc @owaski do you have better ideas how to handle this? It is probably an extreme case, as this is for spaces at the end of the hypothesis.

I think this fix is fine. As trailing space in CJK is meaningless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants