Skip to content

OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7)#1110

Draft
krickert wants to merge 3 commits into
OPENNLP-1850-1b-alignmentfrom
OPENNLP-1850-2a-tokenizer
Draft

OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7)#1110
krickert wants to merge 3 commits into
OPENNLP-1850-1b-alignmentfrom
OPENNLP-1850-2a-tokenizer

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Part 2a of the OPENNLP-1850 stack. Splits the former tokenizer PR (#1104) into the UAX #29 tokenizer (this PR), the Term model (2b), and the NormalizationProfile registry (2c), as requested in review.

Self-contained: the Unicode-conformant WordSegmenter/WordTokenizer/WordType/WordToken over the bundled Word_Break and Extended_Pictographic data, the official WordBreakTest.txt conformance suite (1944/1944), and the Unicode data LICENSE/NOTICE/rat-excludes.

WordBreakProperty and ExtendedPictographic load their data lazily and recoverably (double-checked accessor, no classpath resource I/O in a static {} block), per the same review point as on the foundation — so a resource the loader cannot see is a catchable exception at call time, not a class-poisoning ExceptionInInitializerError.

Base: OPENNLP-1850-1b-alignment (#1109). Stack: 1a → 1b → 2a (this) → 2b → 2c → DL → docs.

krickert added 3 commits June 24, 2026 07:45
…rdType (2a)

Splits the former tokenizer PR (#1104) into the UAX #29 tokenizer (this PR), the Term model (2b),
and the NormalizationProfile registry (2c), on review request. Self-contained: the conformant
WordSegmenter/WordTokenizer/WordType/WordToken over the bundled Word_Break and Extended_Pictographic
data (loaded lazily and recoverably via a double-checked accessor, no static-init resource I/O), the
official WordBreakTest conformance suite, and the Unicode data LICENSE/NOTICE/rat-excludes. Builds on
the alignment layer in 1b.
WordBreakProperty.parse threw an opaque StringIndexOutOfBoundsException on a non-comment line with
no ';' (substring(0, -1)), unlike the sibling ExtendedPictographic.parse which guards it. It now
throws IllegalStateException naming the offending line. Exposed parse() package-visibly;
WordBreakPropertyTest proves the red->green. Real Word_Break data still loads (conformance suite).
… shared Lazy holder

Both loaders' hand-rolled double-checked lazy accessor is replaced by the shared opennlp.tools.util.Lazy
(introduced with Confusables in the engine), so the recoverable lazy-load pattern lives in one place
instead of three near-identical copies. Behavior-preserving; conformance + loader tests green.
@krickert krickert force-pushed the OPENNLP-1850-1b-alignment branch from 9af6d92 to 9dc7d51 Compare June 24, 2026 11:54
@krickert krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from dc02b9e to dd1906d Compare June 24, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant