OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7) by krickert · Pull Request #1110 · apache/opennlp

krickert · 2026-06-23T15:18:40Z

Part 2a of the OPENNLP-1850 stack. Splits the former tokenizer PR (#1104) into the UAX #29 tokenizer (this PR), the Term model (2b), and the NormalizationProfile registry (2c), as requested in review.

Self-contained: the Unicode-conformant WordSegmenter/WordTokenizer/WordType/WordToken over the bundled Word_Break and Extended_Pictographic data, the official WordBreakTest.txt conformance suite (1944/1944), and the Unicode data LICENSE/NOTICE/rat-excludes.

WordBreakProperty and ExtendedPictographic load their data lazily and recoverably (double-checked accessor, no classpath resource I/O in a static {} block), per the same review point as on the foundation — so a resource the loader cannot see is a catchable exception at call time, not a class-poisoning ExceptionInInitializerError.

Base: OPENNLP-1850-1b-alignment (#1109). Stack: 1a → 1b → 2a (this) → 2b → 2c → DL → docs.

…rdType (2a) Splits the former tokenizer PR (#1104) into the UAX #29 tokenizer (this PR), the Term model (2b), and the NormalizationProfile registry (2c), on review request. Self-contained: the conformant WordSegmenter/WordTokenizer/WordType/WordToken over the bundled Word_Break and Extended_Pictographic data (loaded lazily and recoverably via a double-checked accessor, no static-init resource I/O), the official WordBreakTest conformance suite, and the Unicode data LICENSE/NOTICE/rat-excludes. Builds on the alignment layer in 1b.

WordBreakProperty.parse threw an opaque StringIndexOutOfBoundsException on a non-comment line with no ';' (substring(0, -1)), unlike the sibling ExtendedPictographic.parse which guards it. It now throws IllegalStateException naming the offending line. Exposed parse() package-visibly; WordBreakPropertyTest proves the red->green. Real Word_Break data still loads (conformance suite).

… shared Lazy holder Both loaders' hand-rolled double-checked lazy accessor is replaced by the shared opennlp.tools.util.Lazy (introduced with Confusables in the engine), so the recoverable lazy-load pattern lives in one place instead of three near-identical copies. Behavior-preserving; conformance + loader tests green.

This was referenced Jun 23, 2026

OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7) #1111

Draft

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104

Closed

krickert force-pushed the OPENNLP-1850-1b-alignment branch from 08de0d3 to 9af6d92 Compare June 24, 2026 11:20

krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from a450069 to dc02b9e Compare June 24, 2026 11:20

krickert added 3 commits June 24, 2026 07:45

krickert force-pushed the OPENNLP-1850-1b-alignment branch from 9af6d92 to 9dc7d51 Compare June 24, 2026 11:54

krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from dc02b9e to dd1906d Compare June 24, 2026 11:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7)#1110

OPENNLP-1850: UAX #29 word tokenizer — WordSegmenter, WordTokenizer, WordType (2a/7)#1110
krickert wants to merge 3 commits into
OPENNLP-1850-1b-alignmentfrom
OPENNLP-1850-2a-tokenizer

krickert commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

krickert commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant