OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) by krickert · Pull Request #1105 · apache/opennlp

krickert · 2026-06-20T12:36:51Z

Part 3/4 of OPENNLP-1850. Behavioral DL integration, isolated for focused review.

opennlp-dl compile-depends on opennlp-runtime for CharClass (input chunking on Unicode whitespace/dash); InferenceOptions opt-ins; AbstractDL applies them offset-safely so NameFinderDL/DocumentCategorizerDL decode spans back to the original text.

Note: this only depends on the foundation (#1103), not the tokenizer — once #1103 merges, this PR can be re-targeted to main and merged independently of #2.

krickert · 2026-06-20T12:37:31Z

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

Copilot

Pull request overview

This PR (OPENNLP-1850, part 3/4) updates the DL components to handle Unicode whitespace/dashes more robustly and to locate decoded entity spans in the original source text without relying on regex-based matching, while adding opt-in, offset-aware input normalization controls via InferenceOptions.

Changes:

Add Unicode-aware whitespace chunking in AbstractDL and use it in NameFinderDL / DocumentCategorizerDL instead of text.split("\\s+").
Replace regex-based span localization in NameFinderDL with a cursor-based matcher that treats span spaces as flexible Unicode whitespace and matches other code points case-insensitively.
Introduce opt-in InferenceOptions toggles for whitespace/dash folding and document the behavior; add targeted regression tests.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java	Adds shared Unicode whitespace/dash classes, optional input folding, and whitespace chunking helper used by DL components.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/InferenceOptions.java	Adds opt-in flags to normalize whitespace and dashes before inference.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java	Applies optional normalization, switches chunking to Unicode whitespace, and replaces regex span matching with a cursor matcher.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java	Applies optional normalization and switches chunking to Unicode whitespace.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/AbstractDLChunkingTest.java	New model-free tests covering Unicode whitespace chunking and opt-in normalization behavior.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/namefinder/NameFinderDLTest.java	Adds regression tests for decoding spans across NBSP/ideographic spaces and updates comments for the new matcher.
opennlp-core/opennlp-ml/opennlp-dl/README.md	Documents Unicode whitespace chunking, cursor-based span localization, and new normalization options.
opennlp-core/opennlp-ml/opennlp-dl/pom.xml	Makes `opennlp-runtime` a compile dependency to use `CharClass` at runtime.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rzo1 · 2026-06-21T17:04:22Z

Thx for the PR. Here are some suggestions:

opennlp-dl now compile-depends on all of opennlp-runtime (the pom promotes it from test to compile) solely to reach CharClass. Before this, the module compiled against opennlp-api plus onnxruntime only; it now transitively bundles the whole tools runtime.
@deprecated on NameFinderDL.find(String[]) is too broad. It's the canonical TokenNameFinder entry point used by every existing caller, but it only returns wrong offsets when normalizeDashes=true and the input contains a supplementary-plane dash (whitespace folding is length-preserving, so find() stays correct there). As written, every caller gets a deprecation warning for an edge case that can't affect them. Suggest dropping @deprecated, documenting the caveat in javadoc, and delegating the mapping (or logging once) only when a length-changing fold is active.
findInOriginal is only on the concrete type, not on TokenNameFinder, so interface-typed callers (UIMA, eval harness) can't reach the offset-correct path while the interface path is deprecated. Consider exposing an offset-mapped variant on the interface, or document the asymmetry.
Missing end-to-end test. The full normalize to decode to map-back path with a length-changing fold active is only covered piecemeal. Add one test asserting a span from findInOriginal covers the correct original text when normalizeDashes is on with a non-BMP dash. Also, NameFinderDLEval still calls the now-deprecated find(...).
Nit: README and javadoc say dash folding is "offset preserving for BMP". Worth stating the stronger guarantee that findInOriginal stays correct even for non-BMP dashes via the offset map.

krickert · 2026-06-21T21:37:20Z

opennlp-dl now compile-depends on all of opennlp-runtime.
Fixed at the root. The minimal normalization engine (CharClass, CodePointSet, UnicodeWhitespace, UnicodeDash) moved to opennlp-api, which already ships small concrete implementations (e.g. BertTokenizer, WhitespaceTokenizer). opennlp-dl now compiles against opennlp-api plus onnxruntime only; opennlp-runtime is back to a test-only dependency.

@deprecated on find(String[]) is too broad.
Fixed. The @Deprecated is removed. find's javadoc instead documents the single case it diverges: whitespace folding is length-preserving, so find is only affected when normalizeDashes is enabled and a non-BMP dash is present, in which case findInOriginal gives the exact original mapping. Every other caller is correct and no longer sees a warning. (This also resolves NameFinderDLEval calling a deprecated method.)

findInOriginal is only on the concrete type, not on TokenNameFinder.
Added it as a capability interface: OffsetMappingNameFinder extends TokenNameFinder in opennlp-api declares findInOriginal, and NameFinderDL implements it, so an interface-typed caller (UIMA, eval harness) reaches the offset-correct path with finder instanceof OffsetMappingNameFinder — a plain type check, no reflection. I went with a separate interface rather than a default method on TokenNameFinder for a concrete reason: TokenNameFinder.find is documented to return token spans (indices into the token array, as NameFinderME does), while findInOriginal returns character offsets in the original input, so a default findInOriginal(t) { return find(t); } would hand back token indices where the method promises character offsets. The only mismatch-free way to put it directly on TokenNameFinder is a default that throws UnsupportedOperationException, which callers have to guard against anyway, so it buys nothing over the capability check. (For full honesty, NameFinderDL.find itself already returns character offsets rather than token spans, so the DL coordinate story is a pre-existing deviation from the interface contract.) It's your interface, so if you'd still rather it live on TokenNameFinder in some form, say the word and I'll switch it.

Missing end-to-end test.
Added a real one. NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the actual dslim/bert-base-NER ONNX model with normalizeDashes enabled and a non-BMP dash (Yezidi hyphen, U+10EAD) before the entity, and asserts the findInOriginal span covers "George Washington" at its true original offset rather than the one-unit-shorter normalized offset. It exercises the full normalize/tokenize/infer/decode/map-back path with no production test seam, and it passes. The offset-mapping math is additionally unit-tested model-free in AbstractDLChunkingTest. This also moots the "NameFinderDLEval still calls deprecated find()" point, since find() is no longer deprecated.

Nit: README/javadoc say offset-preserving "for BMP".
Fixed. The README now states the stronger guarantee: whitespace folding never moves offsets, and findInOriginal maps spans back through the Alignment so they stay correct in the original input even for non-BMP dashes.

…indInOriginal end-to-end test Address the two outstanding #1105 review points without a production test seam: - Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api, declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of the concrete type. It is a separate capability interface because the classic TokenNameFinder contract reports token-index spans, for which an original-character mapping is not meaningful. - End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the findInOriginal span covers the entity at its true original offset, not the one-unit-shorter normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.

krickert · 2026-06-22T02:55:47Z

Status since the last review. Public API unchanged: find/findInOriginal signatures, OffsetMappingNameFinder, opennlp-dl compiles against opennlp-api only.

Chunk-boundary fix: decode each chunk bounded to its own character region, then mergeOverlappingSpans keeps the longer span (probability tie-break). The earlier "duplicate spans" framing did not hold; the forward cursor already serializes output. The real bug was a fuller boundary entity being dropped, now fixed. whitespaceChunkSpans carries the char span; whitespaceChunks delegates to it.
normalizeInputAligned composes whitespace and dash via Alignment.andThen, dropping the "whitespace is length-preserving" assumption. InferenceOptions javadoc documents 1:1 whitespace vs the runtime collapse-and-trim fold, and the supplementary-dash offset shift to findInOriginal().
Removed dead public I_PER/B_PER. mergeOverlappingSpans documented as length-dominant, not type-aware.
New tests: merge unit cases (containment, partial, tie, disjoint), two-chunk assembly, whitespaceChunkSpans offsets, composed whitespace+dash mapping, and two real-model NameFinderDLEval tests (boundary entity survives in full; overlap entity reported once). NameFinderDLEval is 13/13 on dslim/bert-base-NER.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…indInOriginal end-to-end test Address the two outstanding #1105 review points without a production test seam: - Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api, declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of the concrete type. It is a separate capability interface because the classic TokenNameFinder contract reports token-index spans, for which an original-character mapping is not meaningful. - End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the findInOriginal span covers the entity at its true original offset, not the one-unit-shorter normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.

… components Chunk on Unicode whitespace (not text.split), localize decoded entity spans with a cursor-based matcher that treats span spaces as flexible Unicode whitespace, and add opt-in InferenceOptions toggles to fold whitespace and dashes before inference. Entity spans stay offset-safe under those folds: NameFinderDL.findInOriginal maps each decoded span back to original-input coordinates through the normalization Alignment (Alignment.toOriginalSpan), so a length-changing fold (a supplementary dash shrinking) does not shift reported offsets. find() is the classic TokenNameFinder entry point and stays correct except when normalizeDashes is on and a non-BMP dash is present, which its javadoc documents. opennlp-dl compiles against opennlp-api plus onnxruntime only; CharClass lives in opennlp-api, so opennlp-runtime is a test-only dependency.

…indInOriginal end-to-end test Address the two outstanding #1105 review points without a production test seam: - Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api, declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of the concrete type. It is a separate capability interface because the classic TokenNameFinder contract reports token-index spans, for which an original-character mapping is not meaningful. - End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the findInOriginal span covers the entity at its true original offset, not the one-unit-shorter normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.

…ignment Decode each whitespace chunk bounded to its own character region instead of threading one forward cursor across a sentence's overlapping chunks. The forward cursor kept whichever decode it reached first and silently dropped the other, so a boundary entity that one chunk truncates and the overlapping chunk covers in full could lose its fuller span. Now both chunks emit a candidate and mergeOverlappingSpans reconciles overlaps by keeping the longer span (the more complete decode), breaking ties toward the higher probability; adjacent but disjoint spans and repeated surface forms at distinct offsets are preserved. whitespaceChunkSpans carries each chunk's character span (whitespaceChunks now delegates to it, single-sourcing the chunking), so a chunk's decoded spans can be located within the region it covers. Compose the optional whitespace and dash folds in normalizeInputAligned through Alignment.andThen rather than relying on whitespace folding staying length-preserving, so findInOriginal() stays correct regardless of the whitespace fold's behavior. Document on InferenceOptions that whitespace folding is a one-for-one replacement (not the runtime collapse-and-trim fold) and that a supplementary-plane dash shifts find() offsets, with findInOriginal() as the exact-offset path. Tests: mergeOverlappingSpans containment, partial overlap, probability tie, and disjoint cases; a two-overlapping-chunk assembly that keeps the fuller boundary span; whitespaceChunkSpans character offsets; and a composed whitespace+dash aligned mapping back to the original.

…l constants Add two NameFinderDLEval tests that exercise the chunk-boundary span resolution end to end with the real ONNX model: one where "United States" straddles a chunk boundary and must survive in full (not a truncated copy), and one where it sits inside the chunk overlap and is decoded by both chunks but must be reported once. Both also assert the output has no overlapping spans. Remove the unused public I_PER and B_PER label constants (decoding handles any B-/I- type, and they had no callers). Document that mergeOverlappingSpans is length-dominant rather than type-aware.

DocumentCategorizerDL.categorize now rejects a tokenless document (empty or only whitespace, including Unicode whitespace) with IllegalArgumentException, instead of building an empty score list and letting the scoring strategy fail with an IndexOutOfBoundsException. infer() reports a null model output as "null" rather than dereferencing it for getClass(). Both DocumentCategorizerDL and NameFinderDL constructors now reject a null InferenceOptions explicitly, alongside the other collaborators, for a clear early failure. NameFinderDL.buildSpanText skips RoBERTa <s>/</s> markers as well as BERT [CLS]/[SEP], so a special token can never leak into a reconstructed span when a RoBERTa-style tokenizer is used. Tests: tokenless content rejection (empty, ASCII whitespace, no-break space); null InferenceOptions rejection at construction; and buildSpanText skipping the RoBERTa special tokens.

…tions NameFinderDL.find and findInOriginal now reject a null token array up front via locate(), instead of failing deeper inside String.join. Add an eval test that both entry points throw on null. The disabled categorizeWithGpu eval test built the categorizer with a fresh InferenceOptions instead of the one it had configured with the GPU flags, so it would not have exercised the GPU path if enabled; pass the configured options.

DocumentCategorizerDL.categorize now rejects a model whose output length does not match the configured category count, instead of letting getBestCategory or scoreMap index past the score array. softmax rejects a NaN logit rather than silently returning an all-NaN distribution. Add a softmax NaN test.

@see

Drop the import of TokenNameFinder, which was referenced only from javadoc, and fully-qualify the {@link}/@see references instead. Removes any ambiguity for tools that flag the import as unused, with no behavioral change.

The fail-loud guard added for corrupt classifier output caught NaN but not Infinity: a +Infinity logit (which is not NaN) made max == +Inf, so value - max == NaN, every exp() == NaN, and categorize() silently returned an all-NaN distribution instead of failing loud. Guard !Float.isFinite(value); the numerically-stable max-subtraction still handles merely-large finite logits. testSoftmaxRejectsInfiniteLogit proves the red->green.

The overlap check scanned every kept span per candidate (O(n^2)). Kept spans never overlap each other, so a candidate can only intersect the kept span at/just-before its start (floor) or the next one starting within it (ceiling); a TreeMap keyed by start checks both in O(log n) and yields the result already in document order. Same longest-wins greedy and same outputs; merge unit tests green.

krickert mentioned this pull request Jun 20, 2026

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103

Closed

This was referenced Jun 20, 2026

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104

Closed

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106

Draft

OPENNLP-1850: Improve Whitespace UTF normalization #1101

Closed

krickert marked this pull request as draft June 20, 2026 14:43

krickert requested a review from Copilot June 20, 2026 14:56

Copilot started reviewing on behalf of krickert June 20, 2026 14:57 View session

Copilot AI reviewed Jun 20, 2026

View reviewed changes

Comment thread opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java

krickert force-pushed the OPENNLP-1850-2-tokenizer branch from dab5605 to 67c922a Compare June 20, 2026 20:16

krickert force-pushed the OPENNLP-1850-3-dl branch from 1c17110 to 8534bb3 Compare June 20, 2026 20:16

krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 67c922a to 81aa6c5 Compare June 21, 2026 19:00

krickert force-pushed the OPENNLP-1850-3-dl branch 2 times, most recently from 5154da4 to c51f37d Compare June 21, 2026 19:21

krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 81aa6c5 to 36de08f Compare June 21, 2026 19:21

krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 36de08f to d76a28b Compare June 21, 2026 22:59

krickert force-pushed the OPENNLP-1850-3-dl branch from 40698dc to 001ac01 Compare June 21, 2026 22:59

krickert force-pushed the OPENNLP-1850-2-tokenizer branch from d76a28b to 40cd729 Compare June 22, 2026 00:19

krickert force-pushed the OPENNLP-1850-3-dl branch from 001ac01 to 038e23d Compare June 22, 2026 00:19

krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 40cd729 to 8c2451a Compare June 22, 2026 01:52

krickert force-pushed the OPENNLP-1850-3-dl branch from 038e23d to bc401d3 Compare June 22, 2026 01:52

krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 8c2451a to 3f06095 Compare June 22, 2026 02:09

krickert force-pushed the OPENNLP-1850-3-dl branch from bc401d3 to 4c12897 Compare June 22, 2026 02:10

krickert requested a review from Copilot June 22, 2026 02:59

Copilot started reviewing on behalf of krickert June 22, 2026 03:00 View session

rzo1 requested a review from Copilot June 23, 2026 10:48

Copilot AI reviewed Jun 23, 2026

Copilot started reviewing on behalf of rzo1 June 23, 2026 11:19 View session

krickert force-pushed the OPENNLP-1850-2-tokenizer branch from e0ea17c to 7a3c25a Compare June 23, 2026 13:14

krickert force-pushed the OPENNLP-1850-3-dl branch from 3f77034 to 1ea12ea Compare June 23, 2026 13:14

krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 7a3c25a to 0bf7f6c Compare June 23, 2026 14:02

krickert force-pushed the OPENNLP-1850-3-dl branch from 1ea12ea to e7f3c59 Compare June 23, 2026 14:02

krickert force-pushed the OPENNLP-1850-3-dl branch from e7f3c59 to b6dc241 Compare June 23, 2026 15:17

krickert changed the base branch from OPENNLP-1850-2-tokenizer to OPENNLP-1850-2c-profiles June 23, 2026 15:19

krickert force-pushed the OPENNLP-1850-2c-profiles branch from b6cd173 to a345f48 Compare June 24, 2026 11:20

krickert force-pushed the OPENNLP-1850-3-dl branch from b6dc241 to be98bd6 Compare June 24, 2026 11:20

krickert force-pushed the OPENNLP-1850-2c-profiles branch from a345f48 to 93618ce Compare June 24, 2026 11:54

krickert force-pushed the OPENNLP-1850-3-dl branch from be98bd6 to 5a1114c Compare June 24, 2026 11:54

krickert added 10 commits June 25, 2026 04:23

krickert force-pushed the OPENNLP-1850-2c-profiles branch from 93618ce to 70ec7a3 Compare June 25, 2026 08:26

krickert force-pushed the OPENNLP-1850-3-dl branch from 5a1114c to 57ceefd Compare June 25, 2026 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4)#1105

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4)#1105
krickert wants to merge 10 commits into
OPENNLP-1850-2c-profilesfrom
OPENNLP-1850-3-dl

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

rzo1 commented Jun 21, 2026

Uh oh!

krickert commented Jun 21, 2026

Uh oh!

krickert commented Jun 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

rzo1 commented Jun 21, 2026

Uh oh!

krickert commented Jun 21, 2026

Uh oh!

krickert commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krickert commented Jun 22, 2026 •

edited

Loading