OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) by krickert · Pull Request #1103 · apache/opennlp

krickert · 2026-06-20T12:36:34Z

Part 1/4 of OPENNLP-1850, splitting #1101 into reviewable stacked PRs.

Dependency-free normalization layer: opennlp-api value types (NormalizedText, OffsetMap) plus the opennlp-runtime util/normalizer engine — CharClass/CodePointSet, the Unicode White_Space/Dash sets, the normalizer rungs, the Dimension step ladder, the TextNormalizer builder, and the bundled Unicode confusables.txt (UTS #39) with its License V3 attribution. No tokenizer, Term, or DL changes.

Stack: foundation (this) <- tokenizer <- DL <- docs.

krickert · 2026-06-20T12:37:29Z

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

krickert · 2026-06-20T12:42:48Z

@rzo1 split it up like we discussed. OPENNLP-1850 is now four stacked PRs instead of the one big #1101 (closed):

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103, normalization foundation (base main): the dependency-free layer, meaning the CharClass engine, the normalizer rungs, Dimension, TextNormalizer, and confusables. Lands first, lowest risk.
OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104, UAX OPENNLP-910: Add checkstyle #29 tokenizer + Term model (on OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103).
OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105, DL input normalization (on OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104): the behavioral change, isolated for the focused review you wanted. It only depends on the foundation, so once OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 lands it can re-target straight to main, no need to wait on the tokenizer.
OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106, docs (on OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105).

All your licensing points are addressed, riding with whichever stack ships the data file: attribution in NOTICE.template so it survives regen, full Unicode License V3 text in LICENSE, the four paths in rat-excludes, and the ExtendedPictographic.txt wording fixed to "filtered subset of emoji-data.txt," not "unmodified."

Each stack builds and tests green on its own. Thanks again for steering this, the split is much cleaner to review.

Copilot

Pull request overview

This PR introduces the “foundation” layer for OPENNLP-1850’s Unicode normalization stack: new value types in opennlp-api (normalized text + offset mapping) and a runtime normalization engine built around CodePointSet/CharClass, Unicode whitespace/dash reference tables, and a TextNormalizer builder, along with legal/NOTICE updates for bundling Unicode UTS #39 confusables.txt.

Changes:

Add core normalization primitives (CodePointSet, CharClass) and multiple CharSequenceNormalizer implementations (whitespace, dashes, quotes, digits, ellipsis, bullets, invisibles, case/accent folds, confusable skeleton).
Add API value types for normalization output and offset mapping (NormalizedText, OffsetMap).
Add extensive JUnit tests for the new normalization components and update legal/NOTICE/RAT configuration for the bundled Unicode data file.

Reviewed changes

Copilot reviewed 38 out of 39 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/license/NOTICE.template	Adds NOTICE template entry for bundled Unicode UTS #39 data file.
rat-excludes	Adds RAT exclusion entry for `confusables.txt`.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeWhitespaceTest.java	Tests Unicode whitespace reference data and membership behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeDashTest.java	Tests Unicode dash reference data and normalization defaults.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeCharSequenceNormalizerTest.java	Tests composition/behavior of multiple normalizers in pipelines.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/TextNormalizerTest.java	Tests `TextNormalizer` builder and default search pipeline.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/SetBasedNormalizerTest.java	Tests quote/digit/invisible/ellipsis/bullet normalizers.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/GermanUmlautCharSequenceNormalizerTest.java	Tests German umlaut/eszett transliteration normalizer.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/DimensionTest.java	Tests `Dimension` default normalizer wiring.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CodePointSetTest.java	Tests `CodePointSet` creation, parsing, and set operations.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CharClassTest.java	Tests `CharClass` operations including offset-mapped outputs.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CaseFoldCharSequenceNormalizerTest.java	Tests locale-safe and locale-specific case folding behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/AccentFoldCharSequenceNormalizerTest.java	Tests script-gated accent folding and stroke/ligature mapping.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/WhitespaceCharSequenceNormalizer.java	Adds Unicode whitespace collapsing+trimming normalizer.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/UnicodeWhitespace.java	Adds Unicode whitespace reference table + O(1) membership.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/UnicodeDash.java	Adds Unicode dash reference table + O(1) membership and defaults.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/TextNormalizer.java	Adds fluent builder for composing normalizer pipelines.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/QuoteCharSequenceNormalizer.java	Adds typographic quote folding to ASCII.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NfkcCharSequenceNormalizer.java	Adds NFKC normalizer wrapper.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NfcCharSequenceNormalizer.java	Adds NFC normalizer wrapper.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/InvisibleCharSequenceNormalizer.java	Adds stripping of invisible/bidi control characters.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/GermanUmlautCharSequenceNormalizer.java	Adds DIN-style German umlaut/eszett expansion normalizer.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/EllipsisCharSequenceNormalizer.java	Adds ellipsis/leader expansion to ASCII dots.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Dimension.java	Adds normalization “dimension” ladder with lazy defaults.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/DigitCharSequenceNormalizer.java	Adds Unicode decimal digit folding to ASCII digits.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/DashCharSequenceNormalizer.java	Adds Unicode dash folding to ASCII hyphen-minus.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/ConfusableSkeletonCharSequenceNormalizer.java	Adds UTS #39 confusable skeleton normalizer wrapper.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Confusables.java	Adds confusables resource loader and skeleton algorithm implementation.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CodePointSet.java	Adds immutable O(1) membership code point set + parser/fromFile.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CharClass.java	Adds cursor-based normalization/splitting/collapsing + offset mapping.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CaseFoldCharSequenceNormalizer.java	Adds locale-aware lowercasing normalizer with ROOT singleton.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/BulletCharSequenceNormalizer.java	Adds bullet replacement normalizer (bullets -> space).
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/AccentFoldCharSequenceNormalizer.java	Adds script-gated diacritic fold with optional stroke/ligature mapping.
opennlp-api/src/main/java/opennlp/tools/util/normalizer/OffsetMap.java	Adds normalized↔original offset mapping structure and builder.
opennlp-api/src/main/java/opennlp/tools/util/normalizer/NormalizedText.java	Adds `NormalizedText` record bundling original/normalized + `OffsetMap`.
opennlp-api/pom.xml	Adds a test-scoped JUnit dependency.
NOTICE	Adds NOTICE entry for bundled Unicode UTS #39 data file.
LICENSE	Adds Unicode License V3 text for bundled confusables data file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rzo1 · 2026-06-21T17:02:52Z

Thx for the PR. Here are some suggestions:

Dimension javadoc forward-references Term/TermAnalyzer, which only arrive in OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104. Standalone javadoc on this branch emits unresolved-reference warnings. Either downgrade those {@link} to {@code}, or confirm the stack always merges as a unit.
Offset mapping isn't reachable through the builder. normalizeMapped/OffsetMap exist only directly on CharClass, while TextNormalizer.build() composes plain CharSequenceNormalizers and discards offsets, so the NormalizedText/OffsetMap capability can't be reached via TextNormalizer. Is that intentional for now?
OffsetMap buffer growth uses Arrays.copyOf(buffer, buffer.length * 2), which overflows to NegativeArraySizeException past ~2^30 chars instead of a clean OOM. Overflow-aware growth would be tidier. Edge case, per-document use.
Confusables.load() has no per-line guard, so a malformed bundled line surfaces as ExceptionInInitializerError, whereas the user-facing CodePointSet.fromFile reports the offending line number. Minor inconsistency. A checksum or version assertion would also back the "unmodified upstream 17.0.0" NOTICE claim.
Nit: serialVersionUID = 1L on the new rungs vs random longs on the legacy ones, and TextNormalizer.builder() returns a TextNormalizer that is itself the mutable builder (a nested Builder would read cleaner).

krickert · 2026-06-21T18:01:30Z

You pointed out a gap in the API that I'm going to fix right now re: NormalizedText/OffsetMap.. Reworking it now - and I think the new way would be a much better shape that matches how it works in other packages.

Addressing all your points now - thanks for the feedback - keep it coming.

krickert · 2026-06-21T19:18:13Z

Dimension javadoc forward-references Term/TermAnalyzer.
Dimension references Term/TermAnalyzer with {@code}, not {@link}, so standalone javadoc on this branch produces no unresolved-reference warnings.

Offset mapping isn't reachable through the builder.
You found an offset in my impl (pun intended), and the root cause was the missing composition primitive: there was no way to combine the per-stage offset maps. I got rid of the OffsetMapping and added Alignment.andThen so an offset-carrying pipeline is now possible. Wiring it through TextNormalizer.build() for arbitrary
CharSequenceNormalizers is a follow-up (only the CharClass-family transforms can emit an alignment cheaply; java.text.Normalizer-based stages would need ICU-style edit callbacks), but the primitive it depends on is in place.

OffsetMap buffer growth overflows past ~2^30.
OffsetMap is removed. Its replacement, Alignment.Builder, grows overflow-aware
(length + (length >> 1), clamped to Integer.MAX_VALUE - 8), so it degrades to a clean OutOfMemoryError instead of NegativeArraySizeException. WordSegmenter.IntList got the same treatment (see #1104).

Confusables.load() has no per-line guard.
Fixed. The per-line parse is wrapped and rethrows an IllegalStateException naming the offending line number, mirroring CodePointSet.fromFile, instead of surfacing a raw ExceptionInInitializerError. (A bundled-file checksum/version assertion is a reasonable follow-up but is left out here.)

Nit: serialVersionUID = 1L vs random longs; builder() returns its own mutable builder.
Although I'm camp "1L" for various reasons, I don't mind either way. Changing that now.

krickert · 2026-06-22T02:54:57Z

Status since the last review. Offset-model items addressed; additive commits, so inline threads stay anchored.

buildAligned() + OffsetAwareNormalizer give the *Aligned API a real consumer: every per-code-point fold (whitespace, line-break-preserving whitespace, dashes, invisible-strip, quotes, digits, ellipsis, bullets, umlaut) composes into one AlignedText. Folds that route through java.text.Normalizer or JDK case mapping are rejected loudly, naming the rung.
Capital eszett U+1E9E folds to SS. buildAligned() reject text states the rule instead of a stale list. Confusables javadoc scoped to the skeleton plus equality test (restriction-level, mixed-script, bidi out of scope). Empty aligned pipeline normalizes input to one String.
Alignment.andThen leading-insertion is not a bug: Math.max(start, end) already yields the zero-width span. Added a test that proves it.
New tests: CharClass plain-vs-aligned parity battery, leading-insertion compose, capital-eszett offsets, buildAligned() rejection at every index and fold type, toNormalizedSpan no over-cover across deletions.

Copilot

Pull request overview

Copilot reviewed 44 out of 45 changed files in this pull request and generated no new comments.

…s, Dimension) Adds the dependency-free normalization layer that the tokenizer, Term model, and DL integration build on in later stacks. No segmentation, no behavioral changes. - opennlp-api: the offset-preserving value types NormalizedText and OffsetMap alongside the existing CharSequenceNormalizer contract. - opennlp-runtime util/normalizer: the CharClass / CodePointSet engine and the Unicode White_Space / Dash sets; the normalizer rungs (NFC, NFKC, whitespace, dash, case fold, accent fold, invisible, quote, digit, ellipsis, bullet, German umlaut, confusable skeleton); the Dimension enum that single-sources the char-level steps; and the TextNormalizer builder over them. - Bundles the Unicode confusables.txt (UTS #39) data file with its Unicode License V3 attribution across NOTICE, NOTICE.template, LICENSE, and rat-excludes.

- OffsetMap.toNormalizedOffset: use floor semantics so an original offset that falls inside a collapsed run maps to that run's single normalized character instead of the following one; correct the javadoc to match. - CharClass.collapse: document that each run collapses to one replacement with no trimming (an all-whitespace string becomes one space; compose with trim for emptiness). - Confusables: parse the prototype hex tokens with a cursor scan instead of a regex split, honoring the stated no-regex contract and avoiding Pattern compilation for every line during static init. - CaseFoldCharSequenceNormalizer: fail loud with a clear message on a null locale in both the factory and the constructor instead of defaulting; the locale-independent default is the no-arg getInstance(). - LICENSE: align the embedded Unicode License V3 copyright year with the bundled data file header and NOTICE (1991-2025). - Tests: add OffsetMapTest (the api module's first test) and ConfusableSkeletonTest, extend CharClassTest with offset cases for collapsed runs of mixed Unicode whitespace, tabs, newlines, single-char/all-whitespace/empty inputs, and surrogate pairs, and cover null-locale rejection.

…ennlp-api Replaces the never-released OffsetMap/NormalizedText (introduced earlier in this same unmerged work, so there is nothing to keep compatible) with Alignment, a bidirectional edit-sequence model (equal/replace runs) that maps span to span (toOriginalSpan/ toNormalizedSpan), composes with andThen, and represents deletions and length-growing folds correctly. AlignedText carries it. The CharClass engine (CharClass, CodePointSet, UnicodeWhitespace, UnicodeDash) moves from opennlp-runtime to opennlp-api so opennlp-dl can reach it without depending on the whole tools runtime. CharClass exposes normalize/collapse/collapsePreserving/trim/removeAll Aligned variants; the per-character offset machinery and appendMapped helper are gone.

Confusables.load parsed each line's hex inside the static initializer with no per-line guard, so a malformed bundled line surfaced as an opaque ExceptionInInitializerError. Wrap the per-line parse and rethrow with the line number, mirroring CodePointSet.fromFile.

Restore the coverage the removed *Mapped tests had, now against the Aligned variants, and extend it: mixed Unicode-whitespace / tab / newline collapse runs, empty/single/all-whitespace inputs, surrogate-pair offsets, normalize identity, supplementary pass-through and expansion to a supplementary replacement, leading/trailing deletions in removeAll, all-whitespace trim, and a collapsePreserving run without a kept line break. Alignment gains builder-growth, three-stage andThen composition, and reverse-span-across-expansion tests.

… TextNormalizer.Builder Give the new CharSequenceNormalizer rungs generated-style serialVersionUID values to match the existing classes instead of 1L, and split TextNormalizer's mutable builder state into a nested TextNormalizer.Builder so builder() returns a dedicated Builder rather than a TextNormalizer that is its own builder. Fluent usage (builder().nfc()...build()) is unchanged.

Add OffsetAwareNormalizer, a capability interface for rungs that can report an Alignment from their normalized output back to the input, and TextNormalizer.Builder.buildAligned() to compose a chain of them into one AlignedText that maps a span found in the fully normalized text back to original character offsets. The cursor-based rungs implement it: whitespace (collapse then trim), dashes, and invisible-control stripping, plus a new line-break-preserving whitespace rung that collapses horizontal runs to a space while keeping line and paragraph breaks as a single newline. Rungs that delegate to java.text.Normalizer (NFC/NFKC) or a fold table cannot report their edits and are rejected loudly by buildAligned(). This gives the previously engine-only Alignment.andThen and the CharClass *Aligned operations (collapseAligned, trimAligned, removeAllAligned, collapsePreservingAligned) real production consumers.

Quote, Digit, Ellipsis, Bullet, and GermanUmlaut now implement OffsetAwareNormalizer, so an offset-aware TextNormalizer.buildAligned() pipeline can include them. Each is a plain per-code-point substitution (the expanding ones such as the ellipsis to three dots, the eszett to ss, and the contracting supplementary digit to one ASCII digit record a replace edit), so the alignment maps a match back to the exact source code point. Only the folds that route through java.text.Normalizer (NFC, NFKC, accent fold, confusable skeleton) or JDK case mapping (case fold) still cannot report their edits and remain rejected by buildAligned().

Add the capital eszett (U+1E9E) to the German umlaut transliteration so it expands to SS, matching the lowercase eszett to ss. Generalize the buildAligned() rejection message and javadoc so they describe the rule (per-code-point folds are offset-aware; folds that route through java.text.Normalizer or JDK case mapping are not) instead of an enumerated list that fell out of date as more folds became offset-aware. Scope the Confusables javadoc to the skeleton transform and skeleton-equality test it implements, noting that the restriction-level, mixed-script, and bidirectional mechanisms of UTS #39 are out of scope. Use a single String for both sides of the identity AlignedText produced by an empty aligned pipeline so the alignment lengths cannot diverge from the stored original for a CharSequence whose length() differs from its toString(). Tests: capital-eszett fold and its 1->2 offset mapping; a leading-insertion andThen composition proving the zero-width-at-origin branch maps correctly; a systematic plain-equals-aligned parity battery across every CharClass operation; and buildAligned() rejection when the non-alignable rung is not first and for each kind of non-alignable fold.

Soften the Confusables class-javadoc opening to say it follows the skeleton algorithm of UTS #39, consistent with the paragraph scoping the rest of the report out. Normalize the input of an aligned pipeline to a String once, so the stored original and the per-stage alignment lengths agree even for a CharSequence whose length() differs from its toString(); the empty pipeline already did this. Add a forward-mapping test that a toNormalizedSpan over a span ending inside a deleted run stops at the last kept character rather than over-covering the next.

The fold matches the precomposed umlaut and eszett code points; a base letter followed by a combining diaeresis (U+0308) is not a member and passes through unchanged. Note in the javadoc that the input should be NFC-composed first if it may contain decomposed forms, and add a test that pins both behaviors.

Copilot

Pull request overview

Copilot reviewed 44 out of 45 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

rat-excludes:1

The excluded path for confusables.txt appears to be incorrect relative to where the PR documents the bundled file (opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/util/normalizer/confusables.txt). If the exclude path doesn't match the actual location scanned by RAT, the release/build RAT check will likely fail. Update this entry to the correct module-relative path (or add both paths if RAT runs from multiple roots).
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/TextNormalizer.java:1
The exception message reports the offending rung index as zero-based (rung 0, rung 1, ...). That’s easy to misinterpret in user-facing diagnostics. Consider either switching to one-based numbering in the message, or explicitly labeling it as zero-based (e.g., rung index 2 (0-based)) to avoid ambiguity.

+    for (int f = 0; f < finalLength; f++) {
+      final int middleStart = next.originalStart[f];
+      final int middleEnd = next.originalEnd[f];
+      final int start = middleStart < normalizedLength() ? originalStart[middleStart] : originalLength;
+      final int end = middleEnd > 0 ? originalEnd[middleEnd - 1] : 0;
+      starts[f] = start;
+      ends[f] = Math.max(start, end);
+    }


…index Document why andThen attributes an insertion that lands inside an expansion to that atomic original block (the only mapping that keeps originalStart/originalEnd sorted, so toOriginalSpan/toNormalizedSpan keep their binary search; a zero-width point there would corrupt the reverse mapping). Add tests for an interior insertion in a copied region (zero-width) and an insertion inside an expansion (block-attributed, consistent both ways). Label the buildAligned() rejection rung index as 0-based.

rzo1 · 2026-06-23T13:22:53Z

@mawiesne heads up on the diff size before you dive in: of the +15,543 lines, 9,994 are the bundled confusables.txt data file (the UTS #39 Unicode confusables table) — that's a license-checked data blob, not a code change to review line by line. The actual reviewable Java is ~5.5k lines, and roughly half of that is tests.

So the real surface area is a lot smaller than the headline number suggests.

rzo1 · 2026-06-23T13:24:49Z

One design point worth addressing here (and a convention I'd like us to hold to generally): Confusables loads its data eagerly in a static initializer:

private static final Map<Integer, String> PROTOTYPES = load();   // runs at class-load time

In an application container the classloader that loads this class is often not the loader the app runs with (servlet containers, OSGi, shaded/relocated jars, modular setups). When that happens, getResourceAsStream can't see the bundled resource, load() throws, and the failure surfaces as an ExceptionInInitializerError which then permanently poisons the class: every subsequent use throws NoClassDefFoundError with the original cause gone. It's eager and unrecoverable, and it happens at class-load rather than at first real use.

Suggestion: make it lazy and recoverable instead; a double-checked prototypes() accessor that calls load() on first use, so a failure throws a normal exception you can catch/retry rather than killing the class for the lifetime of the loader.

To be clear this isn't specific to Confusables: it's a general rule for any resource/model loading: we should never do classpath resource loading inside a static {} block / static-final initializer. (The inline List.of(...) static blocks in UnicodeWhitespace/UnicodeDash are fine since no I/O, no classloader risk.)

Let's keep new code lazy here, and treat existing eager-static loaders elsewhere as something to migrate over time.

rzo1 · 2026-06-23T13:28:47Z

Still a huge PR. If we want an actual human-based QM review rather than a rubber stamp, this is far too big as it stands.

@mawiesne on reviewability: even after subtracting the 10k data file, this "foundation" PR still welds together two independent concepts, and I think we can make it noticeably easier to review by splitting it into two stacked PRs.

1a, normalizer engine + data. CharClass/CodePointSet/UnicodeWhitespace/UnicodeDash, the rungs (NFC, NFKC, dash, quote, digit, German umlaut, confusable skeleton, etc.), Dimension, the non-aligned TextNormalizer, plus confusables.txt and all its LICENSE/NOTICE/rat-excludes bookkeeping. This is where the license review belongs and it's mostly mechanical per-code-point substitution, so it's quick to read.
1b, the offset/alignment layer. Alignment, AlignedText, OffsetAwareNormalizer, buildAligned(), and the *Aligned CharClass variants with their tests. This is the conceptually hard part (binary-search span mapping, expansion/deletion edge cases) and carries the densest tests, so it deserves a focused read on its own ~800 lines instead of being skimmed inside a 15k diff. It was also added after the initial submission, so it splits out cleanly along the history.

That keeps each PR well under ~1.5k lines of real code and lets the reviewer spend their attention budget where it actually matters. If re-stacking the branch is more churn than it's worth, a lighter alternative is to keep it as one PR but call out in the description "skip confusables.txt; the alignment layer is files X/Y/Z" so the review path is obvious.

Curious which you'd prefer, @mawiesne and @krickert?

krickert · 2026-06-23T13:40:06Z

I'm making the extra branches now. I wanna make the review as digestive as possible.

krickert · 2026-06-23T14:04:39Z

Superseded by #1108 (engine — 1a) and #1109 (offset/alignment layer — 1b), splitting this foundation PR in two as suggested in review. #1104 (tokenizer) now bases on #1109. All three bundled-data loaders (Confusables, WordBreakProperty, ExtendedPictographic) are now lazy and recoverable (no classpath resource I/O in a static initializer). Each layer builds and tests green on its own.

krickert · 2026-06-23T14:08:39Z

@rzo1 Thanks — both of your points on the foundation are addressed.

Split into 1a + 1b (done). I split the foundation along the history exactly where you suggested:

OPENNLP-1850: Unicode normalization engine — CharClass, rungs, Dimension, confusables (1a/5) #1108 — engine (1a): CharClass/CodePointSet/UnicodeWhitespace/UnicodeDash, the per-code-point rungs, Dimension, the non-aligned TextNormalizer, and confusables.txt with all its LICENSE/NOTICE/rat-excludes bookkeeping. Mostly mechanical substitution, and where the license review belongs.
OPENNLP-1850: Offset/alignment layer — Alignment, AlignedText, buildAligned (1b/5) #1109 — offset/alignment layer (1b): Alignment, AlignedText, OffsetAwareNormalizer, buildAligned(), the *Aligned CharClass variants, and the dense span-mapping tests (binary-search mapping, expansion/deletion edge cases). The conceptually hard ~800 lines, isolated for a focused read.

#1104 (tokenizer) now bases on #1109. So the stack is now 1a → 1b → tokenizer → DL → docs, each well under your ~1.5k-real-code target, and the 10k-line confusables.txt data file is contained in 1a. I closed #1103 pointing at the two replacements.

Static-initializer resource loading (done, and generalized). Agreed on the rule. All three bundled-data loaders that did classpath I/O in a static {} block now load lazily on first use through a double-checked accessor, so a resource the loader can't see surfaces as a catchable exception at call time rather than an ExceptionInInitializerError that poisons the class:

Confusables (1a / OPENNLP-1850: Unicode normalization engine — CharClass, rungs, Dimension, confusables (1a/5) #1108)
WordBreakProperty and ExtendedPictographic (tokenizer / OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104) — the latter two wrap their tables in a small immutable holder loaded via the same pattern.

The List.of(...) static blocks in UnicodeWhitespace/UnicodeDash are left as-is (no I/O, no classloader risk), as you noted.

Each layer builds and tests green on its own (mvn -pl … -am verify, plus checkstyle + forbiddenapis across the full reactor).

krickert mentioned this pull request Jun 20, 2026

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105

Draft

This was referenced Jun 20, 2026

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104

Closed

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106

Draft

OPENNLP-1850: Improve Whitespace UTF normalization #1101

Closed

krickert marked this pull request as draft June 20, 2026 14:42

krickert requested a review from Copilot June 20, 2026 14:56

Copilot started reviewing on behalf of krickert June 20, 2026 14:57 View session

krickert requested review from atarora, jzonthemtn, mawiesne and rzo1 June 20, 2026 14:57

Copilot AI reviewed Jun 20, 2026

View reviewed changes

krickert requested a review from Copilot June 22, 2026 03:00

Copilot started reviewing on behalf of krickert June 22, 2026 03:00 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

krickert added 9 commits June 21, 2026 23:48

krickert added 2 commits June 21, 2026 23:48

krickert force-pushed the OPENNLP-1850-1-foundation branch from 090593f to 9f2622e Compare June 22, 2026 03:51

rzo1 requested a review from Copilot June 23, 2026 10:48

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Copilot started reviewing on behalf of rzo1 June 23, 2026 11:19 View session

This was referenced Jun 23, 2026

OPENNLP-1850: Unicode normalization engine — CharClass, rungs, Dimension, confusables (1a/5) #1108

Draft

OPENNLP-1850: Offset/alignment layer — Alignment, AlignedText, buildAligned (1b/5) #1109

Draft

krickert closed this Jun 23, 2026

Uh oh!

Conversation

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rzo1 commented Jun 21, 2026

Uh oh!

krickert commented Jun 21, 2026

Uh oh!

krickert commented Jun 21, 2026

Uh oh!

krickert commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

rzo1 commented Jun 23, 2026

Uh oh!

rzo1 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzo1 commented Jun 23, 2026

Uh oh!

krickert commented Jun 23, 2026

Uh oh!

krickert commented Jun 23, 2026

Uh oh!

krickert commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rzo1 commented Jun 23, 2026 •

edited

Loading