OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4)#1104
OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4)#1104krickert wants to merge 7 commits into
Conversation
|
OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to
Supersedes #1101. |
There was a problem hiding this comment.
Pull request overview
Adds the next slice of OPENNLP-1850 by introducing a Unicode UAX #29 word segmenter/tokenizer implementation and a new layered normalization “Term” model (Term + TermAnalyzer), plus a language-to-normalization profile registry and the associated Unicode data/license attributions.
Changes:
- Implement UAX #29 word boundary segmentation (
WordSegmenter) and a word tokenizer (WordTokenizer) with typed tokens (WordType,WordToken), including Extended_Pictographic support. - Introduce the layered Term normalization stack (
Term,TermAnalyzer,Dimension) and a language-based registry (NormalizationProfile,NormalizationProfiles). - Add comprehensive JUnit tests (including official Unicode conformance) and update NOTICE/LICENSE/RAT exclusions for bundled Unicode data.
Reviewed changes
Copilot reviewed 25 out of 27 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/license/NOTICE.template | Expands Unicode data attribution text for additional bundled UCD/UTS resources. |
| rat-excludes | Excludes newly bundled Unicode data files from RAT header checks. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/TermAnalyzerTest.java | Tests for TermAnalyzer layering, ordering, lazy dimensions, and tokenization behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/NormalizationProfilesTest.java | Tests language-to-profile resolution and search analyzer behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/ConfusablesTest.java | Tests confusable skeleton folding behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordTokenizerTest.java | Tests tokenizer output, typed tokens, and max-length chopping behavior. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordSegmenterTest.java | Tests segmentation boundaries on representative UAX #29 cases. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordBreakPropertyTest.java | Tests Word_Break property lookup behavior and edge cases. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/WordBoundaryConformanceTest.java | Runs the official Unicode WordBreakTest.txt conformance suite against WordSegmenter. |
| opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/tokenize/uax29/ExtendedPictographicTest.java | Tests Extended_Pictographic membership checks and bounds safety. |
| opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/tokenize/uax29/ExtendedPictographic.txt | Bundled derived Unicode data for Extended_Pictographic property membership. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/TermAnalyzer.java | Implements configurable token segmentation + ordered normalization dimension pipeline. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Term.java | Represents a token with cached/lazy normalization layers. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NormalizationProfiles.java | Registry mapping language codes to normalization/stemming profiles with detection dispatch. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NormalizationProfile.java | Per-language profile record and searchAnalyzer() builder. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Dimension.java | Javadoc updates aligning Dimension docs with the new Term/TermAnalyzer model. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordType.java | Adds token categorization for downstream handling (scripts, numeric, emoji, etc.). |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordTokenizer.java | Implements UAX #29-based word tokenization with spans and optional typed streaming. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordToken.java | Typed token record (span + type) produced by WordTokenizer. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordSegmenter.java | Implements the UAX #29 word boundary algorithm with fast-path transition tables. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreakProperty.java | Loads and looks up Unicode Word_Break property values from bundled data. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/WordBreak.java | Enum for Word_Break property values + parser for property names in the data file. |
| opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/tokenize/uax29/ExtendedPictographic.java | Loads Extended_Pictographic membership from bundled data for WB3c behavior. |
| NOTICE | Updates top-level NOTICE with expanded Unicode attribution details. |
| LICENSE | Updates top-level LICENSE to include Unicode License V3 applicability for added data files. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
dab5605 to
67c922a
Compare
|
Thx for the PR. Here are some suggestions:
|
81aa6c5 to
36de08f
Compare
|
Term.at() ordering footgun. WordType.IDEOGRAPHIC javadoc overstates. WordTokenizer implements Tokenizer directly. Nit: IntList capacity doubling overflow. |
8c2451a to
3f06095
Compare
|
Status: rebased onto the updated foundation, no content change. |
| } catch (IOException e) { | ||
| throw new UncheckedIOException("Unable to read Extended_Pictographic data resource", e); | ||
| } |
3f06095 to
0ec5a36
Compare
090593f to
9f2622e
Compare
0ec5a36 to
b150056
Compare
| final int passRate = total == 0 ? 0 : passed * 100 / total; | ||
| System.out.println("UAX#29 word-break conformance: " + passed + "/" + total | ||
| + " (" + passRate + "%)"); | ||
| assertTrue(total > 1900, "expected the full conformance suite to load, ran only " + total); |
| private static void assign(int start, int end, byte ordinal, List<int[]> supplementary) { | ||
| final int bmpEnd = Math.min(end, 0xFFFF); | ||
| for (int codePoint = start; codePoint <= bmpEnd; codePoint++) { | ||
| BMP[codePoint] = ordinal; | ||
| } | ||
| if (end > 0xFFFF) { | ||
| supplementary.add(new int[] {Math.max(start, 0x10000), end, ordinal}); | ||
| } | ||
| } |
| public static int ordinalOf(int codePoint) { | ||
| if (codePoint >= 0 && codePoint <= 0xFFFF) { | ||
| return BMP[codePoint]; | ||
| } | ||
| return ordinalOfSupplementary(codePoint); | ||
| } |
| // A builder override wins; otherwise the dimension's own default normalizer. | ||
| final CharSequenceNormalizer normalizer = transforms.containsKey(dimension) | ||
| ? transforms.get(dimension) : dimension.defaultNormalizer(); | ||
| return normalizer.normalize(input).toString(); | ||
| } |
| @ParameterizedTest | ||
| @ValueSource(ints = {0x0021, 0x0040, 0x2014}) | ||
| void testUnassignedCodePointsAreOther(int codePoint) { | ||
| assertSame(WordBreak.OTHER, WordBreakProperty.of(codePoint)); | ||
| } |
e0ea17c to
7a3c25a
Compare
|
Same resource-loading concern as in #1103, and here it shows up twice. Both static {
try (InputStream in = WordBreakProperty.class.getResourceAsStream(RESOURCE)) {
...In an application container the classloader that loads these classes is often not the loader the app runs with (servlet containers, OSGi, shaded/relocated jars). When the class's loader can't see the resource, the static block throws and the class is permanently poisoned: Suggestion is the same as on the foundation PR: move the load behind a lazy, recoverable accessor (double-checked On review sizeLike the foundation PR, this one is also a lot for a human to QM-review in one pass. Of the +6,762 lines, 3,976 are bundled data (
Smaller PRs over one large one give the reviewer a real chance to read each layer rather than skim a 6.7k diff. If re-stacking is more churn than it's worth, at minimum the description should point reviewers past the data files to the ~2,800 lines that actually need eyes. Generally I'd favour the step-by-step route: smaller, conceptually-scoped PRs over one big one, even when they have to stack. |
Builds on the normalization foundation. - opennlp-runtime tokenize/uax29: the UAX #29 word segmenter and Tokenizer implementation (WordSegmenter, WordTokenizer, WordType, WordBreak, boundary engine) with bundled Unicode WordBreakProperty and emoji ExtendedPictographic data, validated against the official WordBreakTest conformance suite (1944/1944). - The layered Term model (Term, TermAnalyzer) that tokenizes then normalizes per token over the Dimension ladder, the per-language NormalizationProfile registry, and the confusable-fold coverage. - Extends the bundled-Unicode attribution (NOTICE, NOTICE.template, LICENSE, rat-excludes) to the WordBreakProperty / ExtendedPictographic / WordBreakTest data files, and restores Dimension's javadoc cross-links now that the Term layer is present.
- WordBoundaryConformanceTest: guard the conformance resource stream with Objects.requireNonNull and a clear message instead of an opaque NPE in InputStreamReader, and remove the unused NO_BOUNDARY constant. - NormalizationProfiles.forLanguage: fail loud on a null language argument at the public entry point, with a null-rejection test.
- Term.at: document that an unconfigured dimension is applied on top of normalized()
rather than in canonical pipeline order, with a non-commutative example.
- WordType.IDEOGRAPHIC: soften javadoc ('a token containing a Han ideograph', not 'a
single Han ideograph').
- WordTokenizer: note the deliberate choice to implement Tokenizer directly instead of
extending AbstractTokenizer.
- WordSegmenter.IntList: overflow-aware 1.5x growth instead of length*2.
…moji WordType classifies every Extended_Pictographic code point as EMOJI, which includes symbol-like characters (copyright, trademark, double-exclamation, arrows), so the word tokenizer keeps them rather than dropping them as punctuation. State this in the WordTokenizer javadoc and add a test.
NormalizationProfiles.detect now rejects a null text or detector with a clear NullPointerException instead of failing deeper inside language detection. The TermAnalyzer caseFold(Locale) builder step rejects a null locale up front. ExtendedPictographic names the missing resource in its read-failure message, matching WordBreakProperty.
…ordBreakProperty Throw a targeted IllegalStateException when a requested character-level Dimension has no default normalizer instead of NPEing. Mask the byte-backed Word_Break ordinals as unsigned on read (defensive for >127 ordinals) and bulk-fill the BMP range with Arrays.fill. Drop a stdout print from the conformance test and rename the punctuation/symbols test to match what it asserts.
…ndedPictographic Load the bundled Word_Break and Extended_Pictographic tables lazily on first use via a double-checked accessor instead of a static initializer, so a missing or unreadable resource surfaces as a catchable exception at call time rather than an ExceptionInInitializerError that permanently poisons the class for the lifetime of the classloader (a real risk under container/OSGi/shaded/modular setups). Mirrors the Confusables change in 1a.
7a3c25a to
0bf7f6c
Compare
|
Superseded by #1110 (UAX #29 tokenizer — 2a), #1111 (Term model — 2b), and #1112 (NormalizationProfile registry — 2c), splitting this PR into three smaller stacked layers as suggested in review. #1105 (DL) now bases on #1112. The resource-loading point is also addressed: |
|
@rzo1 Both points on the tokenizer PR are addressed. Resource loading (already done). Split into 2a / 2b / 2c (done). Split the tokenizer PR along the three concepts you identified:
Each layer builds and tests green on its own ( |
…rdType (2a) Splits the former tokenizer PR (#1104) into the UAX #29 tokenizer (this PR), the Term model (2b), and the NormalizationProfile registry (2c), on review request. Self-contained: the conformant WordSegmenter/WordTokenizer/WordType/WordToken over the bundled Word_Break and Extended_Pictographic data (loaded lazily and recoverably via a double-checked accessor, no static-init resource I/O), the official WordBreakTest conformance suite, and the Unicode data LICENSE/NOTICE/rat-excludes. Builds on the alignment layer in 1b.
The token analysis layer split out of the former tokenizer PR (#1104) on review request. A Term is one token projected through the ordered Dimension stack (original, NFC, NFKC, whitespace, dash, case fold, accent fold, confusable fold, stem, lemma), keeping its source Span and every intermediate form; TermAnalyzer segments with the UAX #29 WordTokenizer (from 2a) and applies the configured dimension prefix. Restores Dimension's {@link Term}/{@link TermAnalyzer} javadoc now that they exist. Builds on the tokenizer in 2a.
The language-to-settings registry split out of the former tokenizer PR (#1104) on review request. NormalizationProfiles maps a language code to its stemmer and diacritic fold (the way OpenNLP already selects a Snowball stemmer by language) and builds a search-oriented TermAnalyzer; NormalizationProfile is the per-language record. Builds on the Term model in 2b.
…rdType (2a) Splits the former tokenizer PR (#1104) into the UAX #29 tokenizer (this PR), the Term model (2b), and the NormalizationProfile registry (2c), on review request. Self-contained: the conformant WordSegmenter/WordTokenizer/WordType/WordToken over the bundled Word_Break and Extended_Pictographic data (loaded lazily and recoverably via a double-checked accessor, no static-init resource I/O), the official WordBreakTest conformance suite, and the Unicode data LICENSE/NOTICE/rat-excludes. Builds on the alignment layer in 1b.
The token analysis layer split out of the former tokenizer PR (#1104) on review request. A Term is one token projected through the ordered Dimension stack (original, NFC, NFKC, whitespace, dash, case fold, accent fold, confusable fold, stem, lemma), keeping its source Span and every intermediate form; TermAnalyzer segments with the UAX #29 WordTokenizer (from 2a) and applies the configured dimension prefix. Restores Dimension's {@link Term}/{@link TermAnalyzer} javadoc now that they exist. Builds on the tokenizer in 2a.
The language-to-settings registry split out of the former tokenizer PR (#1104) on review request. NormalizationProfiles maps a language code to its stemmer and diacritic fold (the way OpenNLP already selects a Snowball stemmer by language) and builds a search-oriented TermAnalyzer; NormalizationProfile is the per-language record. Builds on the Term model in 2b.
…rdType (2a) Splits the former tokenizer PR (#1104) into the UAX #29 tokenizer (this PR), the Term model (2b), and the NormalizationProfile registry (2c), on review request. Self-contained: the conformant WordSegmenter/WordTokenizer/WordType/WordToken over the bundled Word_Break and Extended_Pictographic data (loaded lazily and recoverably via a double-checked accessor, no static-init resource I/O), the official WordBreakTest conformance suite, and the Unicode data LICENSE/NOTICE/rat-excludes. Builds on the alignment layer in 1b.
The token analysis layer split out of the former tokenizer PR (#1104) on review request. A Term is one token projected through the ordered Dimension stack (original, NFC, NFKC, whitespace, dash, case fold, accent fold, confusable fold, stem, lemma), keeping its source Span and every intermediate form; TermAnalyzer segments with the UAX #29 WordTokenizer (from 2a) and applies the configured dimension prefix. Restores Dimension's {@link Term}/{@link TermAnalyzer} javadoc now that they exist. Builds on the tokenizer in 2a.
The language-to-settings registry split out of the former tokenizer PR (#1104) on review request. NormalizationProfiles maps a language code to its stemmer and diacritic fold (the way OpenNLP already selects a Snowball stemmer by language) and builds a search-oriented TermAnalyzer; NormalizationProfile is the per-language record. Builds on the Term model in 2b.
Part 2/4 of OPENNLP-1850. Stacked on the foundation branch (base is OPENNLP-1850-1-foundation, so the diff is only this slice).
UAX #29 word segmenter and Tokenizer impl with bundled WordBreakProperty/ExtendedPictographic data (conformance 1944/1944), the layered Term model (Term, TermAnalyzer), the NormalizationProfile registry, and the WordBreak data's License V3 attribution.