Add support for fixed character-level StreamAtt history selection by sarapapi · Pull Request #40 · hlt-mt/simulstream

sarapapi · 2026-07-03T11:05:37Z

FixedWordsTextHistory counts words using the BOW prefix (▁) to identify word boundaries, which works well for space-separated languages but not for character-level languages (e.g., Chinese, Japanese) where BOW markers are sparse. When few tokens carry a BOW prefix, the word counter never reaches history_words, the text history is never trimmed, and the audio buffer grows unboundedly, causing AlignAtt to progressively cut all newly generated tokens.

This PR extends the FixedWords-style approach to character-level languages by introducing FixedCharsTextHistory, which counts every token as one unit (mirroring how character-level segmenters treat individual CJK characters). A dedicated config file (seamless_streamatt_char.yaml) is provided, pairing FixedCharsTextHistory with word_level_postprocess: False, required to prevent _strip_incomplete_words from discarding valid CJK tokens.

…level languages

mgaido91 · 2026-07-03T16:55:29Z

LGTM apart one minor comment, thanks

Add support for FixedWords StreamAtt history selection for character-…

c193425

…level languages

sarapapi requested a review from mgaido91 July 3, 2026 11:05

sarapapi self-assigned this Jul 3, 2026

sarapapi added the enhancement New feature or request label Jul 3, 2026

sarapapi added 2 commits July 3, 2026 13:56

Adding UTs for FixedWords and FixedChars

30e441a

Fix UT

39bd3b4

Address comment

d2a69d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for fixed character-level StreamAtt history selection#40

Add support for fixed character-level StreamAtt history selection#40
sarapapi wants to merge 4 commits into
mainfrom
fix_streamatt_char

sarapapi commented Jul 3, 2026

Uh oh!

mgaido91 commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sarapapi commented Jul 3, 2026

Uh oh!

mgaido91 commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants