Skip to content

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4)#1103

Closed
krickert wants to merge 12 commits into
mainfrom
OPENNLP-1850-1-foundation
Closed

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4)#1103
krickert wants to merge 12 commits into
mainfrom
OPENNLP-1850-1-foundation

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Part 1/4 of OPENNLP-1850, splitting #1101 into reviewable stacked PRs.

Dependency-free normalization layer: opennlp-api value types (NormalizedText, OffsetMap) plus the opennlp-runtime util/normalizer engine — CharClass/CodePointSet, the Unicode White_Space/Dash sets, the normalizer rungs, the Dimension step ladder, the TextNormalizer builder, and the bundled Unicode confusables.txt (UTS #39) with its License V3 attribution. No tokenizer, Term, or DL changes.

Stack: foundation (this) <- tokenizer <- DL <- docs.

@krickert

Copy link
Copy Markdown
Contributor Author

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

  1. OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
  2. OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
  3. OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
  4. OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

@krickert

Copy link
Copy Markdown
Contributor Author

@rzo1 split it up like we discussed. OPENNLP-1850 is now four stacked PRs instead of the one big #1101 (closed):

  1. OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103, normalization foundation (base main): the dependency-free layer, meaning the CharClass engine, the normalizer rungs, Dimension, TextNormalizer, and confusables. Lands first, lowest risk.
  2. OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104, UAX OPENNLP-910: Add checkstyle #29 tokenizer + Term model (on OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103).
  3. OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105, DL input normalization (on OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104): the behavioral change, isolated for the focused review you wanted. It only depends on the foundation, so once OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 lands it can re-target straight to main, no need to wait on the tokenizer.
  4. OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106, docs (on OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105).

All your licensing points are addressed, riding with whichever stack ships the data file: attribution in NOTICE.template so it survives regen, full Unicode License V3 text in LICENSE, the four paths in rat-excludes, and the ExtendedPictographic.txt wording fixed to "filtered subset of emoji-data.txt," not "unmodified."

Each stack builds and tests green on its own. Thanks again for steering this, the split is much cleaner to review.

@krickert krickert marked this pull request as draft June 20, 2026 14:42
@krickert krickert requested a review from Copilot June 20, 2026 14:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces the “foundation” layer for OPENNLP-1850’s Unicode normalization stack: new value types in opennlp-api (normalized text + offset mapping) and a runtime normalization engine built around CodePointSet/CharClass, Unicode whitespace/dash reference tables, and a TextNormalizer builder, along with legal/NOTICE updates for bundling Unicode UTS #39 confusables.txt.

Changes:

  • Add core normalization primitives (CodePointSet, CharClass) and multiple CharSequenceNormalizer implementations (whitespace, dashes, quotes, digits, ellipsis, bullets, invisibles, case/accent folds, confusable skeleton).
  • Add API value types for normalization output and offset mapping (NormalizedText, OffsetMap).
  • Add extensive JUnit tests for the new normalization components and update legal/NOTICE/RAT configuration for the bundled Unicode data file.

Reviewed changes

Copilot reviewed 38 out of 39 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/license/NOTICE.template Adds NOTICE template entry for bundled Unicode UTS #39 data file.
rat-excludes Adds RAT exclusion entry for confusables.txt.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeWhitespaceTest.java Tests Unicode whitespace reference data and membership behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeDashTest.java Tests Unicode dash reference data and normalization defaults.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/UnicodeCharSequenceNormalizerTest.java Tests composition/behavior of multiple normalizers in pipelines.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/TextNormalizerTest.java Tests TextNormalizer builder and default search pipeline.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/SetBasedNormalizerTest.java Tests quote/digit/invisible/ellipsis/bullet normalizers.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/GermanUmlautCharSequenceNormalizerTest.java Tests German umlaut/eszett transliteration normalizer.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/DimensionTest.java Tests Dimension default normalizer wiring.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CodePointSetTest.java Tests CodePointSet creation, parsing, and set operations.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CharClassTest.java Tests CharClass operations including offset-mapped outputs.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/CaseFoldCharSequenceNormalizerTest.java Tests locale-safe and locale-specific case folding behavior.
opennlp-core/opennlp-runtime/src/test/java/opennlp/tools/util/normalizer/AccentFoldCharSequenceNormalizerTest.java Tests script-gated accent folding and stroke/ligature mapping.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/WhitespaceCharSequenceNormalizer.java Adds Unicode whitespace collapsing+trimming normalizer.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/UnicodeWhitespace.java Adds Unicode whitespace reference table + O(1) membership.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/UnicodeDash.java Adds Unicode dash reference table + O(1) membership and defaults.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/TextNormalizer.java Adds fluent builder for composing normalizer pipelines.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/QuoteCharSequenceNormalizer.java Adds typographic quote folding to ASCII.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NfkcCharSequenceNormalizer.java Adds NFKC normalizer wrapper.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/NfcCharSequenceNormalizer.java Adds NFC normalizer wrapper.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/InvisibleCharSequenceNormalizer.java Adds stripping of invisible/bidi control characters.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/GermanUmlautCharSequenceNormalizer.java Adds DIN-style German umlaut/eszett expansion normalizer.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/EllipsisCharSequenceNormalizer.java Adds ellipsis/leader expansion to ASCII dots.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Dimension.java Adds normalization “dimension” ladder with lazy defaults.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/DigitCharSequenceNormalizer.java Adds Unicode decimal digit folding to ASCII digits.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/DashCharSequenceNormalizer.java Adds Unicode dash folding to ASCII hyphen-minus.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/ConfusableSkeletonCharSequenceNormalizer.java Adds UTS #39 confusable skeleton normalizer wrapper.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Confusables.java Adds confusables resource loader and skeleton algorithm implementation.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CodePointSet.java Adds immutable O(1) membership code point set + parser/fromFile.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CharClass.java Adds cursor-based normalization/splitting/collapsing + offset mapping.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/CaseFoldCharSequenceNormalizer.java Adds locale-aware lowercasing normalizer with ROOT singleton.
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/BulletCharSequenceNormalizer.java Adds bullet replacement normalizer (bullets -> space).
opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/AccentFoldCharSequenceNormalizer.java Adds script-gated diacritic fold with optional stroke/ligature mapping.
opennlp-api/src/main/java/opennlp/tools/util/normalizer/OffsetMap.java Adds normalized↔original offset mapping structure and builder.
opennlp-api/src/main/java/opennlp/tools/util/normalizer/NormalizedText.java Adds NormalizedText record bundling original/normalized + OffsetMap.
opennlp-api/pom.xml Adds a test-scoped JUnit dependency.
NOTICE Adds NOTICE entry for bundled Unicode UTS #39 data file.
LICENSE Adds Unicode License V3 text for bundled confusables data file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread opennlp-api/src/main/java/opennlp/tools/util/normalizer/OffsetMap.java Outdated
Comment thread opennlp-api/src/main/java/opennlp/tools/util/normalizer/OffsetMap.java Outdated
Comment thread rat-excludes
Comment thread LICENSE Outdated
@rzo1

rzo1 commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Thx for the PR. Here are some suggestions:

  • Dimension javadoc forward-references Term/TermAnalyzer, which only arrive in OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104. Standalone javadoc on this branch emits unresolved-reference warnings. Either downgrade those {@link} to {@code}, or confirm the stack always merges as a unit.
  • Offset mapping isn't reachable through the builder. normalizeMapped/OffsetMap exist only directly on CharClass, while TextNormalizer.build() composes plain CharSequenceNormalizers and discards offsets, so the NormalizedText/OffsetMap capability can't be reached via TextNormalizer. Is that intentional for now?
  • OffsetMap buffer growth uses Arrays.copyOf(buffer, buffer.length * 2), which overflows to NegativeArraySizeException past ~2^30 chars instead of a clean OOM. Overflow-aware growth would be tidier. Edge case, per-document use.
  • Confusables.load() has no per-line guard, so a malformed bundled line surfaces as ExceptionInInitializerError, whereas the user-facing CodePointSet.fromFile reports the offending line number. Minor inconsistency. A checksum or version assertion would also back the "unmodified upstream 17.0.0" NOTICE claim.
  • Nit: serialVersionUID = 1L on the new rungs vs random longs on the legacy ones, and TextNormalizer.builder() returns a TextNormalizer that is itself the mutable builder (a nested Builder would read cleaner).

@krickert

Copy link
Copy Markdown
Contributor Author

You pointed out a gap in the API that I'm going to fix right now re: NormalizedText/OffsetMap.. Reworking it now - and I think the new way would be a much better shape that matches how it works in other packages.

Addressing all your points now - thanks for the feedback - keep it coming.

@krickert

Copy link
Copy Markdown
Contributor Author

Dimension javadoc forward-references Term/TermAnalyzer.
Dimension references Term/TermAnalyzer with {@code}, not {@link}, so standalone javadoc on this branch produces no unresolved-reference warnings.

Offset mapping isn't reachable through the builder.
You found an offset in my impl (pun intended), and the root cause was the missing composition primitive: there was no way to combine the per-stage offset maps. I got rid of the OffsetMapping and added Alignment.andThen so an offset-carrying pipeline is now possible. Wiring it through TextNormalizer.build() for arbitrary
CharSequenceNormalizers is a follow-up (only the CharClass-family transforms can emit an alignment cheaply; java.text.Normalizer-based stages would need ICU-style edit callbacks), but the primitive it depends on is in place.

OffsetMap buffer growth overflows past ~2^30.
OffsetMap is removed. Its replacement, Alignment.Builder, grows overflow-aware
(length + (length >> 1), clamped to Integer.MAX_VALUE - 8), so it degrades to a clean OutOfMemoryError instead of NegativeArraySizeException. WordSegmenter.IntList got the same treatment (see #1104).

Confusables.load() has no per-line guard.
Fixed. The per-line parse is wrapped and rethrows an IllegalStateException naming the offending line number, mirroring CodePointSet.fromFile, instead of surfacing a raw ExceptionInInitializerError. (A bundled-file checksum/version assertion is a reasonable follow-up but is left out here.)

Nit: serialVersionUID = 1L vs random longs; builder() returns its own mutable builder.
Although I'm camp "1L" for various reasons, I don't mind either way. Changing that now.

@krickert

Copy link
Copy Markdown
Contributor Author

Status since the last review. Offset-model items addressed; additive commits, so inline threads stay anchored.

  • buildAligned() + OffsetAwareNormalizer give the *Aligned API a real consumer: every per-code-point fold (whitespace, line-break-preserving whitespace, dashes, invisible-strip, quotes, digits, ellipsis, bullets, umlaut) composes into one AlignedText. Folds that route through java.text.Normalizer or JDK case mapping are rejected loudly, naming the rung.
  • Capital eszett U+1E9E folds to SS. buildAligned() reject text states the rule instead of a stale list. Confusables javadoc scoped to the skeleton plus equality test (restriction-level, mixed-script, bidi out of scope). Empty aligned pipeline normalizes input to one String.
  • Alignment.andThen leading-insertion is not a bug: Math.max(start, end) already yields the zero-width span. Added a test that proves it.
  • New tests: CharClass plain-vs-aligned parity battery, leading-insertion compose, capital-eszett offsets, buildAligned() rejection at every index and fold type, toNormalizedSpan no over-cover across deletions.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 45 changed files in this pull request and generated no new comments.

krickert added 9 commits June 21, 2026 23:48
…s, Dimension)

Adds the dependency-free normalization layer that the tokenizer, Term model, and
DL integration build on in later stacks. No segmentation, no behavioral changes.

- opennlp-api: the offset-preserving value types NormalizedText and OffsetMap
  alongside the existing CharSequenceNormalizer contract.
- opennlp-runtime util/normalizer: the CharClass / CodePointSet engine and the
  Unicode White_Space / Dash sets; the normalizer rungs (NFC, NFKC, whitespace,
  dash, case fold, accent fold, invisible, quote, digit, ellipsis, bullet,
  German umlaut, confusable skeleton); the Dimension enum that single-sources the
  char-level steps; and the TextNormalizer builder over them.
- Bundles the Unicode confusables.txt (UTS #39) data file with its Unicode
  License V3 attribution across NOTICE, NOTICE.template, LICENSE, and rat-excludes.
- OffsetMap.toNormalizedOffset: use floor semantics so an original offset that
  falls inside a collapsed run maps to that run's single normalized character
  instead of the following one; correct the javadoc to match.
- CharClass.collapse: document that each run collapses to one replacement with no
  trimming (an all-whitespace string becomes one space; compose with trim for
  emptiness).
- Confusables: parse the prototype hex tokens with a cursor scan instead of a
  regex split, honoring the stated no-regex contract and avoiding Pattern
  compilation for every line during static init.
- CaseFoldCharSequenceNormalizer: fail loud with a clear message on a null locale
  in both the factory and the constructor instead of defaulting; the
  locale-independent default is the no-arg getInstance().
- LICENSE: align the embedded Unicode License V3 copyright year with the bundled
  data file header and NOTICE (1991-2025).
- Tests: add OffsetMapTest (the api module's first test) and ConfusableSkeletonTest,
  extend CharClassTest with offset cases for collapsed runs of mixed Unicode
  whitespace, tabs, newlines, single-char/all-whitespace/empty inputs, and
  surrogate pairs, and cover null-locale rejection.
…ennlp-api

Replaces the never-released OffsetMap/NormalizedText (introduced earlier in this same
unmerged work, so there is nothing to keep compatible) with Alignment, a bidirectional
edit-sequence model (equal/replace runs) that maps span to span (toOriginalSpan/
toNormalizedSpan), composes with andThen, and represents deletions and length-growing
folds correctly. AlignedText carries it.

The CharClass engine (CharClass, CodePointSet, UnicodeWhitespace, UnicodeDash) moves from
opennlp-runtime to opennlp-api so opennlp-dl can reach it without depending on the whole
tools runtime. CharClass exposes normalize/collapse/collapsePreserving/trim/removeAll
Aligned variants; the per-character offset machinery and appendMapped helper are gone.
Confusables.load parsed each line's hex inside the static initializer with no per-line
guard, so a malformed bundled line surfaced as an opaque ExceptionInInitializerError.
Wrap the per-line parse and rethrow with the line number, mirroring CodePointSet.fromFile.
Restore the coverage the removed *Mapped tests had, now against the Aligned variants, and
extend it: mixed Unicode-whitespace / tab / newline collapse runs, empty/single/all-whitespace
inputs, surrogate-pair offsets, normalize identity, supplementary pass-through and expansion to a
supplementary replacement, leading/trailing deletions in removeAll, all-whitespace trim, and a
collapsePreserving run without a kept line break. Alignment gains builder-growth, three-stage
andThen composition, and reverse-span-across-expansion tests.
… TextNormalizer.Builder

Give the new CharSequenceNormalizer rungs generated-style serialVersionUID values to match the
existing classes instead of 1L, and split TextNormalizer's mutable builder state into a nested
TextNormalizer.Builder so builder() returns a dedicated Builder rather than a TextNormalizer that
is its own builder. Fluent usage (builder().nfc()...build()) is unchanged.
Add OffsetAwareNormalizer, a capability interface for rungs that can report an
Alignment from their normalized output back to the input, and
TextNormalizer.Builder.buildAligned() to compose a chain of them into one
AlignedText that maps a span found in the fully normalized text back to original
character offsets.

The cursor-based rungs implement it: whitespace (collapse then trim), dashes, and
invisible-control stripping, plus a new line-break-preserving whitespace rung that
collapses horizontal runs to a space while keeping line and paragraph breaks as a
single newline. Rungs that delegate to java.text.Normalizer (NFC/NFKC) or a fold
table cannot report their edits and are rejected loudly by buildAligned().

This gives the previously engine-only Alignment.andThen and the CharClass *Aligned
operations (collapseAligned, trimAligned, removeAllAligned, collapsePreservingAligned)
real production consumers.
Quote, Digit, Ellipsis, Bullet, and GermanUmlaut now implement OffsetAwareNormalizer,
so an offset-aware TextNormalizer.buildAligned() pipeline can include them. Each is a
plain per-code-point substitution (the expanding ones such as the ellipsis to three dots,
the eszett to ss, and the contracting supplementary digit to one ASCII digit record a
replace edit), so the alignment maps a match back to the exact source code point.

Only the folds that route through java.text.Normalizer (NFC, NFKC, accent fold,
confusable skeleton) or JDK case mapping (case fold) still cannot report their edits and
remain rejected by buildAligned().
Add the capital eszett (U+1E9E) to the German umlaut transliteration so it
expands to SS, matching the lowercase eszett to ss.

Generalize the buildAligned() rejection message and javadoc so they describe
the rule (per-code-point folds are offset-aware; folds that route through
java.text.Normalizer or JDK case mapping are not) instead of an enumerated
list that fell out of date as more folds became offset-aware.

Scope the Confusables javadoc to the skeleton transform and skeleton-equality
test it implements, noting that the restriction-level, mixed-script, and
bidirectional mechanisms of UTS #39 are out of scope.

Use a single String for both sides of the identity AlignedText produced by an
empty aligned pipeline so the alignment lengths cannot diverge from the stored
original for a CharSequence whose length() differs from its toString().

Tests: capital-eszett fold and its 1->2 offset mapping; a leading-insertion
andThen composition proving the zero-width-at-origin branch maps correctly;
a systematic plain-equals-aligned parity battery across every CharClass
operation; and buildAligned() rejection when the non-alignable rung is not
first and for each kind of non-alignable fold.
krickert added 2 commits June 21, 2026 23:48
Soften the Confusables class-javadoc opening to say it follows the skeleton
algorithm of UTS #39, consistent with the paragraph scoping the rest of the
report out.

Normalize the input of an aligned pipeline to a String once, so the stored
original and the per-stage alignment lengths agree even for a CharSequence whose
length() differs from its toString(); the empty pipeline already did this.

Add a forward-mapping test that a toNormalizedSpan over a span ending inside a
deleted run stops at the last kept character rather than over-covering the next.
The fold matches the precomposed umlaut and eszett code points; a base letter
followed by a combining diaeresis (U+0308) is not a member and passes through
unchanged. Note in the javadoc that the input should be NFC-composed first if it
may contain decomposed forms, and add a test that pins both behaviors.
@krickert krickert force-pushed the OPENNLP-1850-1-foundation branch from 090593f to 9f2622e Compare June 22, 2026 03:51
@rzo1 rzo1 requested a review from Copilot June 23, 2026 10:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 45 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

rat-excludes:1

  • The excluded path for confusables.txt appears to be incorrect relative to where the PR documents the bundled file (opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/util/normalizer/confusables.txt). If the exclude path doesn't match the actual location scanned by RAT, the release/build RAT check will likely fail. Update this entry to the correct module-relative path (or add both paths if RAT runs from multiple roots).
    opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/TextNormalizer.java:1
  • The exception message reports the offending rung index as zero-based (rung 0, rung 1, ...). That’s easy to misinterpret in user-facing diagnostics. Consider either switching to one-based numbering in the message, or explicitly labeling it as zero-based (e.g., rung index 2 (0-based)) to avoid ambiguity.

Comment on lines +142 to +149
for (int f = 0; f < finalLength; f++) {
final int middleStart = next.originalStart[f];
final int middleEnd = next.originalEnd[f];
final int start = middleStart < normalizedLength() ? originalStart[middleStart] : originalLength;
final int end = middleEnd > 0 ? originalEnd[middleEnd - 1] : 0;
starts[f] = start;
ends[f] = Math.max(start, end);
}
…index

Document why andThen attributes an insertion that lands inside an expansion to that
atomic original block (the only mapping that keeps originalStart/originalEnd sorted, so
toOriginalSpan/toNormalizedSpan keep their binary search; a zero-width point there would
corrupt the reverse mapping). Add tests for an interior insertion in a copied region
(zero-width) and an insertion inside an expansion (block-attributed, consistent both ways).
Label the buildAligned() rejection rung index as 0-based.
@rzo1

rzo1 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@mawiesne heads up on the diff size before you dive in: of the +15,543 lines, 9,994 are the bundled confusables.txt data file (the UTS #39 Unicode confusables table) — that's a license-checked data blob, not a code change to review line by line. The actual reviewable Java is ~5.5k lines, and roughly half of that is tests.

So the real surface area is a lot smaller than the headline number suggests.

@rzo1

rzo1 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

One design point worth addressing here (and a convention I'd like us to hold to generally): Confusables loads its data eagerly in a static initializer:

private static final Map<Integer, String> PROTOTYPES = load();   // runs at class-load time

In an application container the classloader that loads this class is often not the loader the app runs with (servlet containers, OSGi, shaded/relocated jars, modular setups). When that happens, getResourceAsStream can't see the bundled resource, load() throws, and the failure surfaces as an ExceptionInInitializerError which then permanently poisons the class: every subsequent use throws NoClassDefFoundError with the original cause gone. It's eager and unrecoverable, and it happens at class-load rather than at first real use.

Suggestion: make it lazy and recoverable instead; a double-checked prototypes() accessor that calls load() on first use, so a failure throws a normal exception you can catch/retry rather than killing the class for the lifetime of the loader.

To be clear this isn't specific to Confusables: it's a general rule for any resource/model loading: we should never do classpath resource loading inside a static {} block / static-final initializer. (The inline List.of(...) static blocks in UnicodeWhitespace/UnicodeDash are fine since no I/O, no classloader risk.)

Let's keep new code lazy here, and treat existing eager-static loaders elsewhere as something to migrate over time.

@rzo1

rzo1 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Still a huge PR. If we want an actual human-based QM review rather than a rubber stamp, this is far too big as it stands.

@mawiesne on reviewability: even after subtracting the 10k data file, this "foundation" PR still welds together two independent concepts, and I think we can make it noticeably easier to review by splitting it into two stacked PRs.

  • 1a, normalizer engine + data. CharClass/CodePointSet/UnicodeWhitespace/UnicodeDash, the rungs (NFC, NFKC, dash, quote, digit, German umlaut, confusable skeleton, etc.), Dimension, the non-aligned TextNormalizer, plus confusables.txt and all its LICENSE/NOTICE/rat-excludes bookkeeping. This is where the license review belongs and it's mostly mechanical per-code-point substitution, so it's quick to read.
  • 1b, the offset/alignment layer. Alignment, AlignedText, OffsetAwareNormalizer, buildAligned(), and the *Aligned CharClass variants with their tests. This is the conceptually hard part (binary-search span mapping, expansion/deletion edge cases) and carries the densest tests, so it deserves a focused read on its own ~800 lines instead of being skimmed inside a 15k diff. It was also added after the initial submission, so it splits out cleanly along the history.

That keeps each PR well under ~1.5k lines of real code and lets the reviewer spend their attention budget where it actually matters. If re-stacking the branch is more churn than it's worth, a lighter alternative is to keep it as one PR but call out in the description "skip confusables.txt; the alignment layer is files X/Y/Z" so the review path is obvious.

Curious which you'd prefer, @mawiesne and @krickert?

@krickert

Copy link
Copy Markdown
Contributor Author

I'm making the extra branches now. I wanna make the review as digestive as possible.

@krickert

Copy link
Copy Markdown
Contributor Author

Superseded by #1108 (engine — 1a) and #1109 (offset/alignment layer — 1b), splitting this foundation PR in two as suggested in review. #1104 (tokenizer) now bases on #1109. All three bundled-data loaders (Confusables, WordBreakProperty, ExtendedPictographic) are now lazy and recoverable (no classpath resource I/O in a static initializer). Each layer builds and tests green on its own.

@krickert krickert closed this Jun 23, 2026
@krickert

Copy link
Copy Markdown
Contributor Author

@rzo1 Thanks — both of your points on the foundation are addressed.

Split into 1a + 1b (done). I split the foundation along the history exactly where you suggested:

#1104 (tokenizer) now bases on #1109. So the stack is now 1a → 1b → tokenizer → DL → docs, each well under your ~1.5k-real-code target, and the 10k-line confusables.txt data file is contained in 1a. I closed #1103 pointing at the two replacements.

Static-initializer resource loading (done, and generalized). Agreed on the rule. All three bundled-data loaders that did classpath I/O in a static {} block now load lazily on first use through a double-checked accessor, so a resource the loader can't see surfaces as a catchable exception at call time rather than an ExceptionInInitializerError that poisons the class:

The List.of(...) static blocks in UnicodeWhitespace/UnicodeDash are left as-is (no I/O, no classloader risk), as you noted.

Each layer builds and tests green on its own (mvn -pl … -am verify, plus checkstyle + forbiddenapis across the full reactor).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants