[BUG] Fix char-level latency crash when hypothesis has trailing whitespace#39
Open
sarapapi wants to merge 3 commits into
Open
[BUG] Fix char-level latency crash when hypothesis has trailing whitespace#39sarapapi wants to merge 3 commits into
sarapapi wants to merge 3 commits into
Conversation
mgaido91
reviewed
Jul 3, 2026
Contributor
Author
|
Additional info: tested on the same logs for which it was failing before, and now it works |
This reverts commit 9358b21.
Contributor
|
cc @owaski do you have better ideas how to handle this? It is probably an extreme case, as this is for spaces at the end of the hypothesis. |
Contributor
I think this fix is fine. As trailing space in CJK is meaningless. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When computing latency at character level,
_tokenizewas calling.strip()onfinal_textbefore encoding withCJSegmenter. This caused a character count mismatch:ideal_delaysis built from the originalfinal_text(one entry per character, including trailing whitespace), butmweralignonly received the stripped version. After resegmentation, the total character count across segments was 1 less than the length ofideal_delays, triggering the assertion in_split_delays_by_segmented_text.The fix trims
ideal_delaysandcomputational_aware_delaysinscore()to exclude the entries corresponding to leading/trailing whitespace stripped by_tokenize, keeping the delay list consistent with whatmweralignactually processes.An empty-hypothesis guard is also added: if the hypothesis is empty (or reduces to empty after stripping),
mweralignis skipped and each reference sentence is matched to an empty output with no delays. This is handled gracefully downstream by_do_score, which already skips sentences with empty delay lists.