Skip to content

fix(acceptance): CJK phrase boundaries and decode stop, so Japanese accepts advance clause by clause#669

Merged
FuJacob merged 3 commits into
mainfrom
fix/cjk-phrase-boundaries
Jun 11, 2026
Merged

fix(acceptance): CJK phrase boundaries and decode stop, so Japanese accepts advance clause by clause#669
FuJacob merged 3 commits into
mainfrom
fix/cjk-phrase-boundaries

Conversation

@FuJacob

@FuJacob FuJacob commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Summary

A Japanese phrase accept could arrive as one giant chunk (reported tail: 理解し、その内容を自分の言葉で表現する。 in a single Tab) because every layer only knew ASCII punctuation: phrase mode never stopped at or , a punctuation-led chunk skipped ICU segmentation and swallowed everything to the next whitespace, and SentenceBoundaryClassifier never registered a CJK sentence end so the decode stop policy let generations run to the full token budget. Phrase boundaries now include the CJK terminators (。!?。) and clause commas (、,、), word chunks bind trailing CJK punctuation to their word (読み、 is one Tab), punctuation-led runs peel as their own chunk (including opening brackets, so flat quoted runs like 「分かった」と言った no longer swallow whole), and generation stops at the end of a Japanese sentence like it does for English. All added codepoints occur only in CJK text and ASCII ,/brackets stay non-boundaries, so space-delimited scripts are byte-for-byte unchanged.

Validation

  • xcodebuild build-for-testing ... CODE_SIGNING_ALLOWED=NO then test-without-building for the full unit bundle: ** TEST SUCCEEDED **, 993 tests, 0 failures (4 pre-existing skips). New cases cover the reported clause (理解し、 stops at the comma), // stops, the 終わり。」 closer walk-back, the leading-punctuation cliff, the opening-bracket peel (「分かった」と言った), mixed closer-opener runs, halfwidth 、 。 」 parity, auto-accept-off re-peeling, and ASCII comma/bracket non-boundaries for English.
  • swiftlint lint --quiet on all changed files: exit 0, no findings.
  • Not yet validated end-to-end on device with a live Japanese session; the chunking rules are pure Support/ logic locked by the unit tests above.

Linked issues

None filed; follow-up to the Japanese IME report behind #668 (same user feedback thread, separate root cause).

Risk / rollout notes

  • Behavior change for CJK phrase-granularity users: Tab now advances clause by clause () and stops at , instead of accepting an entire space-less sentence per press. Word granularity gains the punctuation binding (資料、 is one chunk), loses the punctuation cliff, and peels opening brackets as their own chunk.
  • Generation for CJK text now early-stops at sentence ends via the existing decode stop policy, so Japanese/Chinese suggestions get shorter and cheaper at the source; English generation is unchanged.
  • Same pre-existing CI caveat as fix(insertion): IME-safe accept so Japanese/CJK suggestions land on Tab #668: main requires a cotabbyinference revision that may not be resolvable by CI until the package push lands.

Greptile Summary

This PR fixes CJK (Japanese/Chinese) text acceptance in phrase mode by threading phrase-boundary awareness through three layers: the decode stop policy, the phrase accumulator, and the word-chunk extractor. Before this change, a Japanese phrase could arrive as one giant Tab because every layer only knew ASCII punctuation.

  • SentenceBoundaryClassifier gains CJK sentence terminators (。!?。) so the decode stop policy fires at the end of a Japanese sentence instead of running to the token budget, and the closing-punctuation walk-back now skips CJK brackets so 終わり。」 registers as a sentence end.
  • SuggestionSessionReconciler adds endOfCJKPunctuationRun to bind trailing to the preceding word (読み、 is one chunk), a punctuation-led peel so opening brackets like don't swallow flat quoted runs, and endsAtPhraseBoundary replaces endsInSentenceTerminator to stop the phrase accumulator at ideographic commas as well as terminators.
  • CJK primitives are consolidated into a single Character extension so a future codepoint addition updates all three policies in one edit.

Confidence Score: 5/5

Safe to merge; all changes are additive, CJK-only string-parsing logic with no impact on ASCII/space-delimited text, validated by 993 passing unit tests.

All changes are additive CJK-only string-parsing logic, ASCII/space-delimited paths are byte-for-byte unchanged and covered by existing tests, and every new code path has dedicated unit tests including edge cases and policy-flag interactions.

No files require special attention beyond the minor doc-comment omission in SuggestionEngineModels.swift.

Important Files Changed

Filename Overview
Cotabby/Support/SuggestionSessionReconciler.swift Core chunking and phrase-accumulation logic updated; adds endOfCJKPunctuationRun, trailing-binding, punctuation-led peel, and endsAtPhraseBoundary. Interaction between CJK binding and autoAcceptTrailingPunctuation=false is intentional and now tested.
Cotabby/Support/SentenceBoundaryClassifier.swift Adds CJK sentence terminators to the decode-stop switch and extends isSentenceClosingPunctuation to include CJK closers; both changes are additive and well-scoped, with full test coverage.
Cotabby/Models/SuggestionEngineModels.swift Doc-comment update only; no logic change.
CotabbyTests/SentenceBoundaryClassifierTests.swift Adds 5 new tests for CJK terminators, closer walk-back, halfwidth parity, and ideographic comma non-termination; all appear correct.
CotabbyTests/SuggestionSessionReconcilerTests.swift Adds 17 new tests covering all new chunking rules, phrase-boundary stops, halfwidth parity, ASCII non-regression, and autoAcceptTrailingPunctuation interactions; coverage looks thorough.

Fix All in Codex Fix All in Claude Code

Reviews (3): Last reviewed commit: "Address Greptile review on #669: single-..." | Re-trigger Greptile

Comment thread Cotabby/Support/SuggestionSessionReconciler.swift
FuJacob added 3 commits June 11, 2026 07:15
A Japanese phrase accept arrived as one giant chunk: phrase mode only knew
the ASCII terminators (. ! ?), so the ideographic full stop never ended a
phrase and the clause comma was no boundary at all, and the same ASCII-only
assumption in SentenceBoundaryClassifier meant the decode stop policy never
fired for CJK text, so generations always ran to the token budget. A flat
Japanese tail also had a punctuation cliff: a chunk that starts with CJK
punctuation skips ICU word segmentation (punctuation does not begin a
space-less-script word) and swallowed everything up to the next whitespace
in a single accept, in word mode too.

Phrase boundaries now include the CJK sentence terminators and treat the
ideographic and fullwidth commas as clause stops, so Tab advances clause by
clause the way Japanese prose reads. Word chunking binds a trailing CJK
punctuation run to the word it follows and peels a punctuation-led run as
its own chunk, removing the cliff. The classifier recognizes the CJK
terminators (which, unlike the ASCII period, are unambiguous) and the CJK
closing brackets for its walk-back, so generation stops at the end of a
Japanese sentence like it does for English. All added codepoints occur only
in CJK text and ASCII "," stays a non-boundary, so space-delimited scripts
are byte-for-byte unchanged.
…unctuation

Auditing the CJK chunking surfaced one remaining cliff: a chunk starting at a
CJK opening bracket neither begins a space-less-script word nor binds to the
preceding one, so a flat quoted run like the tail of 彼は「分かった」と言った
was still swallowed whole by the whitespace scan. The punctuation-led peel now
takes opening brackets too (the trailing binding still stops before them,
since an opener belongs to the next word), and the halfwidth kana forms 、 」 「
join their fullwidth counterparts in the clause, closer, and opener sets, with
the halfwidth corner also added to the classifier's closer walk.

ASCII brackets and quotes are untouched by the peel (the sets stay CJK-only),
locked in by regression tests alongside the opener, mixed-run, and halfwidth
cases. Full unit bundle: 993 tests, 0 failures.
The CJK terminator and closer codepoint lists were restated in the
reconciler's phrase policy and again in SentenceBoundaryClassifier's closer
walk, so adding a codepoint required parallel edits with no compiler
enforcement. The two primitive sets are now one internal Character extension
in the reconciler file, and both the phrase policy and the classifier compose
from them.
@FuJacob

FuJacob commented Jun 11, 2026

Copy link
Copy Markdown
Owner Author

Addressed the review finding: the CJK terminator and closer sets are now declared once (internal Character extension) and composed by both the phrase policy and SentenceBoundaryClassifier, so a future codepoint is a single edit. Local validation on the rebased branch (top of merged #668): xcodegen no drift, swiftlint clean, build-for-testing succeeded, full unit bundle 1001 tests / 0 failures.

@FuJacob FuJacob force-pushed the fix/cjk-phrase-boundaries branch from 472af39 to 2ab9bb3 Compare June 11, 2026 14:38
@FuJacob FuJacob merged commit d36b29f into main Jun 11, 2026
@FuJacob FuJacob deleted the fix/cjk-phrase-boundaries branch June 11, 2026 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant