Skip to content

OPENNLP-1850: Unicode normalization engine — CharClass, rungs, Dimension, confusables (1a/5)#1108

Draft
krickert wants to merge 3 commits into
mainfrom
OPENNLP-1850-1a-engine
Draft

OPENNLP-1850: Unicode normalization engine — CharClass, rungs, Dimension, confusables (1a/5)#1108
krickert wants to merge 3 commits into
mainfrom
OPENNLP-1850-1a-engine

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Part 1a of the OPENNLP-1850 stack. Splits the former foundation PR (#1103) into a mechanical engine layer (this PR) and the offset/alignment layer (1b), as requested in review.

The dependency-free engine:

  • CharClass/CodePointSet over the Unicode White_Space and Dash properties (cursor-based, no regex)
  • the per-code-point substitution/collapse/strip rungs, the Dimension ladder, and the non-aligned TextNormalizer builder
  • the bundled UTS Remove deprecated IndexHashTable class #39 confusables.txt with its LICENSE/NOTICE/rat-excludes bookkeeping

Confusables now loads lazily and recoverably (double-checked accessor; no classpath resource I/O in a static initializer), so a missing/unreadable resource is a catchable exception at call time rather than a class-poisoning ExceptionInInitializerError.

Merge bottom-up: 1a (this) → 1b alignment → tokenizer → DL → docs.

…n, confusables (1a)

Splits the former foundation PR into engine (1a) and the offset/alignment layer (1b) on review
request. This is the dependency-free engine: CharClass/CodePointSet over the White_Space and Dash
UCD sets, the per-code-point substitution/collapse/strip rungs, the Dimension ladder, the
non-aligned TextNormalizer builder, and the bundled UTS #39 confusables.txt with its
LICENSE/NOTICE/rat-excludes bookkeeping. Confusables now loads lazily and recoverably (no classpath
resource I/O in a static initializer), so a missing resource is a catchable exception rather than a
class-poisoning ExceptionInInitializerError. The Alignment offset layer follows in 1b.
@krickert

Copy link
Copy Markdown
Contributor Author

@rzo1 Thanks — both of your points on the foundation are addressed.

Split into 1a + 1b (done). I split the foundation along the history exactly where you suggested:

#1104 (tokenizer) now bases on #1109. So the stack is now 1a → 1b → tokenizer → DL → docs, each well under your ~1.5k-real-code target, and the 10k-line confusables.txt datafile is contained in 1a. I closed #1103 pointing at the two replacements.

Static-initializer resource loading (done, and generalized). Agreed on the rule. All three bundled-data loaders that did classpath I/O in a static {} block now load lazily on first use through a double-checked accessor, so a resource the loader can't see surfaces as a catchable exception at call time rather than an ExceptionInInitializerError that poisons the class:

The List.of(...) static blocks in UnicodeWhitespace/UnicodeDash are left as-is (no I/O, no classloader risk), as you noted.

Each layer builds and tests green on its own (mvn -pl … -am verify, plus checkstyle + forbiddenapis across the full reactor).

krickert added 2 commits June 24, 2026 07:10
…engine)

A confusables.txt data line with fewer than two ';' was silently skipped (continue), unlike the
malformed-hex path right below it and the sibling UCD loaders, so a corrupted resource would yield a
quietly-incomplete prototype map and wrong confusable() results with no signal. It now throws
IllegalStateException naming the line, consistent with the stack's fail-loud convention. Extracted a
package-visible parse(InputStream) seam; ConfusablesLoadTest proves the red->green. Bundled data still
loads clean (ConfusableSkeletonTest).
…Lazy holder (engine)

Adds CharClass.substitute(text, mapper): the three expanding folds (ellipsis, German umlaut, digit)
now supply only a per-code-point mapper instead of each re-implementing the cursor pass. Adds a
shared, recoverable Lazy<T> double-checked holder and routes Confusables' lazy load through it
(WordBreakProperty/ExtendedPictographic follow in 2a). Behavior-preserving; existing tests green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant