Skip to content

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4)#1105

Draft
krickert wants to merge 10 commits into
OPENNLP-1850-2c-profilesfrom
OPENNLP-1850-3-dl
Draft

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4)#1105
krickert wants to merge 10 commits into
OPENNLP-1850-2c-profilesfrom
OPENNLP-1850-3-dl

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Part 3/4 of OPENNLP-1850. Behavioral DL integration, isolated for focused review.

opennlp-dl compile-depends on opennlp-runtime for CharClass (input chunking on Unicode whitespace/dash); InferenceOptions opt-ins; AbstractDL applies them offset-safely so NameFinderDL/DocumentCategorizerDL decode spans back to the original text.

Note: this only depends on the foundation (#1103), not the tokenizer — once #1103 merges, this PR can be re-targeted to main and merged independently of #2.

@krickert

Copy link
Copy Markdown
Contributor Author

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

  1. OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
  2. OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
  3. OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
  4. OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR (OPENNLP-1850, part 3/4) updates the DL components to handle Unicode whitespace/dashes more robustly and to locate decoded entity spans in the original source text without relying on regex-based matching, while adding opt-in, offset-aware input normalization controls via InferenceOptions.

Changes:

  • Add Unicode-aware whitespace chunking in AbstractDL and use it in NameFinderDL / DocumentCategorizerDL instead of text.split("\\s+").
  • Replace regex-based span localization in NameFinderDL with a cursor-based matcher that treats span spaces as flexible Unicode whitespace and matches other code points case-insensitively.
  • Introduce opt-in InferenceOptions toggles for whitespace/dash folding and document the behavior; add targeted regression tests.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java Adds shared Unicode whitespace/dash classes, optional input folding, and whitespace chunking helper used by DL components.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/InferenceOptions.java Adds opt-in flags to normalize whitespace and dashes before inference.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java Applies optional normalization, switches chunking to Unicode whitespace, and replaces regex span matching with a cursor matcher.
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java Applies optional normalization and switches chunking to Unicode whitespace.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/AbstractDLChunkingTest.java New model-free tests covering Unicode whitespace chunking and opt-in normalization behavior.
opennlp-core/opennlp-ml/opennlp-dl/src/test/java/opennlp/dl/namefinder/NameFinderDLTest.java Adds regression tests for decoding spans across NBSP/ideographic spaces and updates comments for the new matcher.
opennlp-core/opennlp-ml/opennlp-dl/README.md Documents Unicode whitespace chunking, cursor-based span localization, and new normalization options.
opennlp-core/opennlp-ml/opennlp-dl/pom.xml Makes opennlp-runtime a compile dependency to use CharClass at runtime.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from dab5605 to 67c922a Compare June 20, 2026 20:16
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 1c17110 to 8534bb3 Compare June 20, 2026 20:16
@rzo1

rzo1 commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Thx for the PR. Here are some suggestions:

  • opennlp-dl now compile-depends on all of opennlp-runtime (the pom promotes it from test to compile) solely to reach CharClass. Before this, the module compiled against opennlp-api plus onnxruntime only; it now transitively bundles the whole tools runtime.
  • @deprecated on NameFinderDL.find(String[]) is too broad. It's the canonical TokenNameFinder entry point used by every existing caller, but it only returns wrong offsets when normalizeDashes=true and the input contains a supplementary-plane dash (whitespace folding is length-preserving, so find() stays correct there). As written, every caller gets a deprecation warning for an edge case that can't affect them. Suggest dropping @deprecated, documenting the caveat in javadoc, and delegating the mapping (or logging once) only when a length-changing fold is active.
  • findInOriginal is only on the concrete type, not on TokenNameFinder, so interface-typed callers (UIMA, eval harness) can't reach the offset-correct path while the interface path is deprecated. Consider exposing an offset-mapped variant on the interface, or document the asymmetry.
  • Missing end-to-end test. The full normalize to decode to map-back path with a length-changing fold active is only covered piecemeal. Add one test asserting a span from findInOriginal covers the correct original text when normalizeDashes is on with a non-BMP dash. Also, NameFinderDLEval still calls the now-deprecated find(...).
  • Nit: README and javadoc say dash folding is "offset preserving for BMP". Worth stating the stronger guarantee that findInOriginal stays correct even for non-BMP dashes via the offset map.

@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 67c922a to 81aa6c5 Compare June 21, 2026 19:00
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch 2 times, most recently from 5154da4 to c51f37d Compare June 21, 2026 19:21
@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 81aa6c5 to 36de08f Compare June 21, 2026 19:21
@krickert

Copy link
Copy Markdown
Contributor Author

opennlp-dl now compile-depends on all of opennlp-runtime.
Fixed at the root. The minimal normalization engine (CharClass, CodePointSet, UnicodeWhitespace, UnicodeDash) moved to opennlp-api, which already ships small concrete implementations (e.g. BertTokenizer, WhitespaceTokenizer). opennlp-dl now compiles against opennlp-api plus onnxruntime only; opennlp-runtime is back to a test-only dependency.

@deprecated on find(String[]) is too broad.
Fixed. The @Deprecated is removed. find's javadoc instead documents the single case it diverges: whitespace folding is length-preserving, so find is only affected when normalizeDashes is enabled and a non-BMP dash is present, in which case findInOriginal gives the exact original mapping. Every other caller is correct and no longer sees a warning. (This also resolves NameFinderDLEval calling a deprecated method.)

findInOriginal is only on the concrete type, not on TokenNameFinder.
Added it as a capability interface: OffsetMappingNameFinder extends TokenNameFinder in opennlp-api declares findInOriginal, and NameFinderDL implements it, so an interface-typed caller (UIMA, eval harness) reaches the offset-correct path with finder instanceof OffsetMappingNameFinder — a plain type check, no reflection. I went with a separate interface rather than a default method on TokenNameFinder for a concrete reason: TokenNameFinder.find is documented to return token spans (indices into the token array, as NameFinderME does), while findInOriginal returns character offsets in the original input, so a default findInOriginal(t) { return find(t); } would hand back token indices where the method promises character offsets. The only mismatch-free way to put it directly on TokenNameFinder is a default that throws UnsupportedOperationException, which callers have to guard against anyway, so it buys nothing over the capability check. (For full honesty, NameFinderDL.find itself already returns character offsets rather than token spans, so the DL coordinate story is a pre-existing deviation from the interface contract.) It's your interface, so if you'd still rather it live on TokenNameFinder in some form, say the word and I'll switch it.

Missing end-to-end test.
Added a real one. NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the actual dslim/bert-base-NER ONNX model with normalizeDashes enabled and a non-BMP dash (Yezidi hyphen, U+10EAD) before the entity, and asserts the findInOriginal span covers "George Washington" at its true original offset rather than the one-unit-shorter normalized offset. It exercises the full normalize/tokenize/infer/decode/map-back path with no production test seam, and it passes. The offset-mapping math is additionally unit-tested model-free in AbstractDLChunkingTest. This also moots the "NameFinderDLEval still calls deprecated find()" point, since find() is no longer deprecated.

Nit: README/javadoc say offset-preserving "for BMP".
Fixed. The README now states the stronger guarantee: whitespace folding never moves offsets, and findInOriginal maps spans back through the Alignment so they stay correct in the original input even for non-BMP dashes.

@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 36de08f to d76a28b Compare June 21, 2026 22:59
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 40698dc to 001ac01 Compare June 21, 2026 22:59
@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from d76a28b to 40cd729 Compare June 22, 2026 00:19
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 001ac01 to 038e23d Compare June 22, 2026 00:19
@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 40cd729 to 8c2451a Compare June 22, 2026 01:52
krickert added a commit that referenced this pull request Jun 22, 2026
…indInOriginal end-to-end test

Address the two outstanding #1105 review points without a production test seam:

- Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api,
  declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval
  harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of
  the concrete type. It is a separate capability interface because the classic TokenNameFinder
  contract reports token-index spans, for which an original-character mapping is not meaningful.

- End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model
  with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the
  findInOriginal span covers the entity at its true original offset, not the one-unit-shorter
  normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 038e23d to bc401d3 Compare June 22, 2026 01:52
@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 8c2451a to 3f06095 Compare June 22, 2026 02:09
krickert added a commit that referenced this pull request Jun 22, 2026
…indInOriginal end-to-end test

Address the two outstanding #1105 review points without a production test seam:

- Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api,
  declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval
  harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of
  the concrete type. It is a separate capability interface because the classic TokenNameFinder
  contract reports token-index spans, for which an original-character mapping is not meaningful.

- End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model
  with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the
  findInOriginal span covers the entity at its true original offset, not the one-unit-shorter
  normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from bc401d3 to 4c12897 Compare June 22, 2026 02:10
@krickert

krickert commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Status since the last review. Public API unchanged: find/findInOriginal signatures, OffsetMappingNameFinder, opennlp-dl compiles against opennlp-api only.

  • Chunk-boundary fix: decode each chunk bounded to its own character region, then mergeOverlappingSpans keeps the longer span (probability tie-break). The earlier "duplicate spans" framing did not hold; the forward cursor already serializes output. The real bug was a fuller boundary entity being dropped, now fixed. whitespaceChunkSpans carries the char span; whitespaceChunks delegates to it.
  • normalizeInputAligned composes whitespace and dash via Alignment.andThen, dropping the "whitespace is length-preserving" assumption. InferenceOptions javadoc documents 1:1 whitespace vs the runtime collapse-and-trim fold, and the supplementary-dash offset shift to findInOriginal().
  • Removed dead public I_PER/B_PER. mergeOverlappingSpans documented as length-dominant, not type-aware.
  • New tests: merge unit cases (containment, partial, tie, disjoint), two-chunk assembly, whitespaceChunkSpans offsets, composed whitespace+dash mapping, and two real-model NameFinderDLEval tests (boundary entity survives in full; overlap entity reported once). NameFinderDLEval is 13/13 on dslim/bert-base-NER.

@rzo1 rzo1 requested a review from Copilot June 23, 2026 10:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

krickert added a commit that referenced this pull request Jun 23, 2026
…indInOriginal end-to-end test

Address the two outstanding #1105 review points without a production test seam:

- Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api,
  declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval
  harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of
  the concrete type. It is a separate capability interface because the classic TokenNameFinder
  contract reports token-index spans, for which an original-character mapping is not meaningful.

- End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model
  with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the
  findInOriginal span covers the entity at its true original offset, not the one-unit-shorter
  normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from e0ea17c to 7a3c25a Compare June 23, 2026 13:14
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 3f77034 to 1ea12ea Compare June 23, 2026 13:14
@krickert krickert force-pushed the OPENNLP-1850-2-tokenizer branch from 7a3c25a to 0bf7f6c Compare June 23, 2026 14:02
krickert added a commit that referenced this pull request Jun 23, 2026
…indInOriginal end-to-end test

Address the two outstanding #1105 review points without a production test seam:

- Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api,
  declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval
  harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of
  the concrete type. It is a separate capability interface because the classic TokenNameFinder
  contract reports token-index spans, for which an original-character mapping is not meaningful.

- End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model
  with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the
  findInOriginal span covers the entity at its true original offset, not the one-unit-shorter
  normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 1ea12ea to e7f3c59 Compare June 23, 2026 14:02
krickert added a commit that referenced this pull request Jun 23, 2026
…indInOriginal end-to-end test

Address the two outstanding #1105 review points without a production test seam:

- Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api,
  declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval
  harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of
  the concrete type. It is a separate capability interface because the classic TokenNameFinder
  contract reports token-index spans, for which an original-character mapping is not meaningful.

- End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model
  with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the
  findInOriginal span covers the entity at its true original offset, not the one-unit-shorter
  normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from e7f3c59 to b6dc241 Compare June 23, 2026 15:17
@krickert krickert changed the base branch from OPENNLP-1850-2-tokenizer to OPENNLP-1850-2c-profiles June 23, 2026 15:19
@krickert krickert force-pushed the OPENNLP-1850-2c-profiles branch from b6cd173 to a345f48 Compare June 24, 2026 11:20
krickert added a commit that referenced this pull request Jun 24, 2026
…indInOriginal end-to-end test

Address the two outstanding #1105 review points without a production test seam:

- Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api,
  declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval
  harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of
  the concrete type. It is a separate capability interface because the classic TokenNameFinder
  contract reports token-index spans, for which an original-character mapping is not meaningful.

- End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model
  with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the
  findInOriginal span covers the entity at its true original offset, not the one-unit-shorter
  normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from b6dc241 to be98bd6 Compare June 24, 2026 11:20
@krickert krickert force-pushed the OPENNLP-1850-2c-profiles branch from a345f48 to 93618ce Compare June 24, 2026 11:54
krickert added a commit that referenced this pull request Jun 24, 2026
…indInOriginal end-to-end test

Address the two outstanding #1105 review points without a production test seam:

- Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api,
  declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval
  harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of
  the concrete type. It is a separate capability interface because the classic TokenNameFinder
  contract reports token-index spans, for which an original-character mapping is not meaningful.

- End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model
  with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the
  findInOriginal span covers the entity at its true original offset, not the one-unit-shorter
  normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from be98bd6 to 5a1114c Compare June 24, 2026 11:54
krickert added 10 commits June 25, 2026 04:23
… components

Chunk on Unicode whitespace (not text.split), localize decoded entity spans with a
cursor-based matcher that treats span spaces as flexible Unicode whitespace, and add opt-in
InferenceOptions toggles to fold whitespace and dashes before inference.

Entity spans stay offset-safe under those folds: NameFinderDL.findInOriginal maps each decoded
span back to original-input coordinates through the normalization Alignment
(Alignment.toOriginalSpan), so a length-changing fold (a supplementary dash shrinking) does not
shift reported offsets. find() is the classic TokenNameFinder entry point and stays correct
except when normalizeDashes is on and a non-BMP dash is present, which its javadoc documents.

opennlp-dl compiles against opennlp-api plus onnxruntime only; CharClass lives in opennlp-api,
so opennlp-runtime is a test-only dependency.
…indInOriginal end-to-end test

Address the two outstanding #1105 review points without a production test seam:

- Capability interface: add OffsetMappingNameFinder (extends TokenNameFinder) in opennlp-api,
  declaring findInOriginal. NameFinderDL implements it, so an interface-typed caller (UIMA, eval
  harness) can reach the offset-correct path via 'instanceof OffsetMappingNameFinder' instead of
  the concrete type. It is a separate capability interface because the classic TokenNameFinder
  contract reports token-index spans, for which an original-character mapping is not meaningful.

- End-to-end test: NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash runs the real model
  with normalizeDashes enabled and a non-BMP dash (U+10EAD) before an entity, and asserts the
  findInOriginal span covers the entity at its true original offset, not the one-unit-shorter
  normalized offset. This exercises the full normalize/tokenize/infer/decode/map-back path.
…ignment

Decode each whitespace chunk bounded to its own character region instead of
threading one forward cursor across a sentence's overlapping chunks. The
forward cursor kept whichever decode it reached first and silently dropped the
other, so a boundary entity that one chunk truncates and the overlapping chunk
covers in full could lose its fuller span. Now both chunks emit a candidate and
mergeOverlappingSpans reconciles overlaps by keeping the longer span (the more
complete decode), breaking ties toward the higher probability; adjacent but
disjoint spans and repeated surface forms at distinct offsets are preserved.

whitespaceChunkSpans carries each chunk's character span (whitespaceChunks now
delegates to it, single-sourcing the chunking), so a chunk's decoded spans can
be located within the region it covers.

Compose the optional whitespace and dash folds in normalizeInputAligned through
Alignment.andThen rather than relying on whitespace folding staying
length-preserving, so findInOriginal() stays correct regardless of the
whitespace fold's behavior. Document on InferenceOptions that whitespace folding
is a one-for-one replacement (not the runtime collapse-and-trim fold) and that a
supplementary-plane dash shifts find() offsets, with findInOriginal() as the
exact-offset path.

Tests: mergeOverlappingSpans containment, partial overlap, probability tie, and
disjoint cases; a two-overlapping-chunk assembly that keeps the fuller boundary
span; whitespaceChunkSpans character offsets; and a composed whitespace+dash
aligned mapping back to the original.
…l constants

Add two NameFinderDLEval tests that exercise the chunk-boundary span resolution
end to end with the real ONNX model: one where "United States" straddles a chunk
boundary and must survive in full (not a truncated copy), and one where it sits
inside the chunk overlap and is decoded by both chunks but must be reported once.
Both also assert the output has no overlapping spans.

Remove the unused public I_PER and B_PER label constants (decoding handles any
B-/I- type, and they had no callers). Document that mergeOverlappingSpans is
length-dominant rather than type-aware.
DocumentCategorizerDL.categorize now rejects a tokenless document (empty or only
whitespace, including Unicode whitespace) with IllegalArgumentException, instead
of building an empty score list and letting the scoring strategy fail with an
IndexOutOfBoundsException. infer() reports a null model output as "null" rather
than dereferencing it for getClass(). Both DocumentCategorizerDL and NameFinderDL
constructors now reject a null InferenceOptions explicitly, alongside the other
collaborators, for a clear early failure.

NameFinderDL.buildSpanText skips RoBERTa <s>/</s> markers as well as BERT
[CLS]/[SEP], so a special token can never leak into a reconstructed span when a
RoBERTa-style tokenizer is used.

Tests: tokenless content rejection (empty, ASCII whitespace, no-break space);
null InferenceOptions rejection at construction; and buildSpanText skipping the
RoBERTa special tokens.
…tions

NameFinderDL.find and findInOriginal now reject a null token array up front via
locate(), instead of failing deeper inside String.join. Add an eval test that
both entry points throw on null.

The disabled categorizeWithGpu eval test built the categorizer with a fresh
InferenceOptions instead of the one it had configured with the GPU flags, so it
would not have exercised the GPU path if enabled; pass the configured options.
DocumentCategorizerDL.categorize now rejects a model whose output length does not
match the configured category count, instead of letting getBestCategory or
scoreMap index past the score array. softmax rejects a NaN logit rather than
silently returning an all-NaN distribution. Add a softmax NaN test.
Drop the import of TokenNameFinder, which was referenced only from javadoc, and fully-qualify
the {@link}/@see references instead. Removes any ambiguity for tools that flag the import as
unused, with no behavioral change.
The fail-loud guard added for corrupt classifier output caught NaN but not Infinity: a +Infinity
logit (which is not NaN) made max == +Inf, so value - max == NaN, every exp() == NaN, and categorize()
silently returned an all-NaN distribution instead of failing loud. Guard !Float.isFinite(value); the
numerically-stable max-subtraction still handles merely-large finite logits. testSoftmaxRejectsInfiniteLogit
proves the red->green.
The overlap check scanned every kept span per candidate (O(n^2)). Kept spans never overlap each
other, so a candidate can only intersect the kept span at/just-before its start (floor) or the next
one starting within it (ceiling); a TreeMap keyed by start checks both in O(log n) and yields the
result already in document order. Same longest-wins greedy and same outputs; merge unit tests green.
@krickert krickert force-pushed the OPENNLP-1850-2c-profiles branch from 93618ce to 70ec7a3 Compare June 25, 2026 08:26
@krickert krickert force-pushed the OPENNLP-1850-3-dl branch from 5a1114c to 57ceefd Compare June 25, 2026 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants