OPENNLP-1850: Unicode normalization engine — CharClass, rungs, Dimension, confusables (1a/5)#1108
OPENNLP-1850: Unicode normalization engine — CharClass, rungs, Dimension, confusables (1a/5)#1108krickert wants to merge 3 commits into
Conversation
…n, confusables (1a) Splits the former foundation PR into engine (1a) and the offset/alignment layer (1b) on review request. This is the dependency-free engine: CharClass/CodePointSet over the White_Space and Dash UCD sets, the per-code-point substitution/collapse/strip rungs, the Dimension ladder, the non-aligned TextNormalizer builder, and the bundled UTS #39 confusables.txt with its LICENSE/NOTICE/rat-excludes bookkeeping. Confusables now loads lazily and recoverably (no classpath resource I/O in a static initializer), so a missing resource is a catchable exception rather than a class-poisoning ExceptionInInitializerError. The Alignment offset layer follows in 1b.
|
@rzo1 Thanks — both of your points on the foundation are addressed. Split into 1a + 1b (done). I split the foundation along the history exactly where you suggested:
Static-initializer resource loading (done, and generalized). Agreed on the rule. All three bundled-data loaders that did classpath I/O in a
The Each layer builds and tests green on its own ( |
…engine) A confusables.txt data line with fewer than two ';' was silently skipped (continue), unlike the malformed-hex path right below it and the sibling UCD loaders, so a corrupted resource would yield a quietly-incomplete prototype map and wrong confusable() results with no signal. It now throws IllegalStateException naming the line, consistent with the stack's fail-loud convention. Extracted a package-visible parse(InputStream) seam; ConfusablesLoadTest proves the red->green. Bundled data still loads clean (ConfusableSkeletonTest).
…Lazy holder (engine) Adds CharClass.substitute(text, mapper): the three expanding folds (ellipsis, German umlaut, digit) now supply only a per-code-point mapper instead of each re-implementing the cursor pass. Adds a shared, recoverable Lazy<T> double-checked holder and routes Confusables' lazy load through it (WordBreakProperty/ExtendedPictographic follow in 2a). Behavior-preserving; existing tests green.
Part 1a of the OPENNLP-1850 stack. Splits the former foundation PR (#1103) into a mechanical engine layer (this PR) and the offset/alignment layer (1b), as requested in review.
The dependency-free engine:
CharClass/CodePointSetover the UnicodeWhite_SpaceandDashproperties (cursor-based, no regex)Dimensionladder, and the non-alignedTextNormalizerbuilderconfusables.txtwith itsLICENSE/NOTICE/rat-excludesbookkeepingConfusablesnow loads lazily and recoverably (double-checked accessor; no classpath resource I/O in a static initializer), so a missing/unreadable resource is a catchable exception at call time rather than a class-poisoningExceptionInInitializerError.Merge bottom-up: 1a (this) → 1b alignment → tokenizer → DL → docs.