OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) by krickert · Pull Request #1106 · apache/opennlp

krickert · 2026-06-20T12:36:52Z

Part 4/4 of OPENNLP-1850. New normalizer manual chapter plus tokenizer/doccat/namefinder/introduction updates and the master opennlp.xml.

krickert · 2026-06-20T12:37:31Z

OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to main as the one below lands):

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103 — Unicode normalization foundation (CharClass engine, rungs, Dimension)
OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104 — UAX OPENNLP-910: Add checkstyle #29 word tokenizer + layered Term model
OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105 — Offset-safe input normalization in the DL components
OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4) #1106 — Documentation

Supersedes #1101.

Copilot

Pull request overview

Adds/updates Developer Manual documentation for Unicode-aware normalization and UAX #29 tokenization, and connects those docs to the DL (ONNX) components’ Unicode chunking/normalization behavior.

Changes:

Adds a new “Text Normalization” manual chapter and includes it in the master DocBook.
Extends the tokenizer chapter with guidance about Unicode preprocessing and a new UAX #29 tokenizer/segmenter section.
Updates NameFinderDL/DocumentCategorizerDL and introduction docs to reference Unicode-aware DL chunking and normalization options.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
opennlp-docs/src/docbkx/tokenizer.xml	Adds Unicode preprocessing guidance and documents the UAX #29 tokenizer/segmenter APIs.
opennlp-docs/src/docbkx/opennlp.xml	Includes the new normalizer chapter in the book build.
opennlp-docs/src/docbkx/normalizer.xml	New “Text Normalization” chapter covering normalizers, pipelines, term model, and reference data.
opennlp-docs/src/docbkx/namefinder.xml	Updates NameFinderDL constructor usage and documents Unicode-aware DL chunking and normalization options.
opennlp-docs/src/docbkx/introduction.xml	Links DL inference Unicode handling to the normalizer documentation.
opennlp-docs/src/docbkx/doccat.xml	Documents Unicode-aware DL chunking/normalization and adds ONNX usage snippet updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rzo1 · 2026-06-21T17:04:48Z

Thx for the PR. Here are some suggestions:

Declare xmlns:xlink explicitly on the chapter roots of normalizer.xml and tokenizer.xml. Both use xlink:href (normalizer.xml:457, tokenizer.xml:461) but rely on the DocBook 5.0 DTD's #FIXED default to bind the prefix. The Maven build resolves the DTD so it works, but it breaks under any non-validating namespace-aware tool (IDE linters, xmllint --nonet), and every other chapter declares it explicitly:
normalizer.xml: <chapter xml:id="tools.normalizer" xmlns:xlink="http://www.w3.org/1999/xlink">
tokenizer.xml: <chapter xml:id="tools.tokenizer" xmlns:xlink="http://www.w3.org/1999/xlink">
Note that tokenizer.xml newly introduces xlink usage, so this is the first chapter to add it there.

Otherwise the content is accurate.

krickert · 2026-06-22T02:58:20Z

Status since the last review. Normalizer chapter gains an "Offset-aware pipelines" section for buildAligned() and the capability interface with a worked dash-fold span example, the line-break-preserving rung in the fold table, and the supplementary-dash offset note in the DL fold options. Name-finder chapter names OffsetMappingNameFinder behind findInOriginal. docbkx HTML builds clean. Rebased onto the updated stack.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

+// Maps the model's output indices to its BIO labels, e.g. "O", "B-PER", "I-PER".
+Map<Integer, String> ids2Labels = new HashMap<>();
+SentenceDetector sentenceDetector =
+    new SentenceDetectorME(new SentenceModel(new File("/path/to/en-sent.bin")));
+String[] tokens = {"George", "Washington", "was", "president", "of", "the", "United", "States", "."};
+NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, ids2Labels, sentenceDetector);
+// findInOriginal returns spans in the original input's coordinates.
+Span[] spans = nameFinderDL.findInOriginal(tokens);]]>


+			rules WB1 through WB999. It is rule based and needs no trained model, it works directly over
+			a <code>CharSequence</code>, and it reports character offsets so the original text is


…nd DL handling Add the Text Normalization manual chapter (CharClass engine, normalizer pipeline, the Term model, and the Aligned offset variants that return an AlignedText carrying an Alignment), extend the tokenizer chapter with the UAX #29 segmenter, and document the DL components' Unicode-aware chunking and opt-in whitespace/dash folding with offset-safe findInOriginal. All embedded ONNX snippets are self-contained and compile.

…ligned) Add an "Offset-aware pipelines" section to the normalizer chapter covering TextNormalizer.Builder.buildAligned(), the OffsetAwareNormalizer capability interface, mapping a match back to the source with AlignedText/Alignment, and the fail-loud rejection of rungs that cannot report edits (NFC/NFKC). List the new line-break-preserving whitespace rung in the normalizer family table.

… the manual Document that NameFinderDL.findInOriginal comes from the OffsetMappingNameFinder capability interface, detectable with a plain instanceof check, so the name-finder chapter matches how the normalizer chapter presents OffsetAwareNormalizer.

…gits, ellipsis, bullets, umlaut)

…old options Note in the normalizer manual that, with dash folding enabled, a dash in the supplementary planes shrinks from two UTF-16 units to one and shifts later offsets, so find reports offsets into the normalized text in that case while findInOriginal maps them back to the original input. The one-for-one whitespace fold versus the run-collapsing whitespace rung is already covered in the same section.

Scope the "never relies on Character.isWhitespace" statement to the normalization engine rather than the whole library. Note that getInstance() gives the default shared instance and that case and accent folding also offer configured forms. Refer to the conformance file by its full name WordBreakTest.txt.

…enizer manual The Word Tokenizer section said it drops punctuation and keeps emoji without noting that emoji means any Extended_Pictographic code point, so symbol-like characters such as the copyright, trademark, and double-exclamation signs are kept. Match the WordTokenizer class javadoc.

…d hyphenation Show a concrete, exhaustive ids2Labels BIO mapping in the ONNX name-finder example instead of an empty map (an unmapped predicted index raises IllegalStateException at runtime), and note the exhaustiveness requirement. Hyphenate 'rule-based' and split the comma splice in the UAX #29 tokenizer section.

krickert mentioned this pull request Jun 20, 2026

OPENNLP-1850: Unicode normalization foundation — CharClass engine, rungs, Dimension (1/4) #1103

Closed

This was referenced Jun 20, 2026

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104

Closed

OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) #1105

Draft

OPENNLP-1850: Improve Whitespace UTF normalization #1101

Closed

krickert marked this pull request as draft June 20, 2026 14:43

krickert requested a review from Copilot June 20, 2026 14:56

Copilot started reviewing on behalf of krickert June 20, 2026 14:57 View session

krickert requested review from mawiesne and rzo1 June 20, 2026 14:58

Copilot AI reviewed Jun 20, 2026

View reviewed changes

Comment thread opennlp-docs/src/docbkx/namefinder.xml Outdated

Comment thread opennlp-docs/src/docbkx/namefinder.xml Outdated

Comment thread opennlp-docs/src/docbkx/doccat.xml Outdated

Comment thread opennlp-docs/src/docbkx/doccat.xml

krickert force-pushed the OPENNLP-1850-3-dl branch from 1c17110 to 8534bb3 Compare June 20, 2026 20:16

krickert force-pushed the OPENNLP-1850-4-docs branch from 3037db7 to 9a71f28 Compare June 20, 2026 20:16

krickert force-pushed the OPENNLP-1850-3-dl branch from 8534bb3 to 5154da4 Compare June 21, 2026 19:00

krickert force-pushed the OPENNLP-1850-4-docs branch 2 times, most recently from d7d316f to 0ff5d07 Compare June 21, 2026 19:21

krickert force-pushed the OPENNLP-1850-3-dl branch from 5154da4 to c51f37d Compare June 21, 2026 19:21

krickert force-pushed the OPENNLP-1850-4-docs branch 2 times, most recently from 667e850 to d71e472 Compare June 21, 2026 22:59

krickert force-pushed the OPENNLP-1850-3-dl branch from 40698dc to 001ac01 Compare June 21, 2026 22:59

krickert force-pushed the OPENNLP-1850-4-docs branch from b65c0de to 0022bc1 Compare June 22, 2026 00:19

krickert force-pushed the OPENNLP-1850-3-dl branch 2 times, most recently from 038e23d to bc401d3 Compare June 22, 2026 01:52

krickert force-pushed the OPENNLP-1850-4-docs branch from 0022bc1 to 2fd9543 Compare June 22, 2026 01:52

krickert force-pushed the OPENNLP-1850-3-dl branch from bc401d3 to 4c12897 Compare June 22, 2026 02:10

krickert force-pushed the OPENNLP-1850-4-docs branch from 2fd9543 to 743a955 Compare June 22, 2026 02:10

krickert requested a review from Copilot June 22, 2026 02:59

Copilot started reviewing on behalf of krickert June 22, 2026 02:59 View session

krickert force-pushed the OPENNLP-1850-4-docs branch from 743a955 to 213ab50 Compare June 22, 2026 03:31

krickert force-pushed the OPENNLP-1850-3-dl branch from 2006d1d to fdd329f Compare June 22, 2026 03:51

krickert force-pushed the OPENNLP-1850-4-docs branch from 213ab50 to 8475b41 Compare June 22, 2026 03:51

krickert force-pushed the OPENNLP-1850-3-dl branch from fdd329f to 47a39bf Compare June 22, 2026 03:59

krickert force-pushed the OPENNLP-1850-4-docs branch 3 times, most recently from def4b08 to d7838a5 Compare June 22, 2026 05:49

rzo1 requested a review from Copilot June 23, 2026 10:48

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Copilot started reviewing on behalf of rzo1 June 23, 2026 11:21 View session

krickert force-pushed the OPENNLP-1850-3-dl branch from 3f77034 to 1ea12ea Compare June 23, 2026 13:14

krickert force-pushed the OPENNLP-1850-4-docs branch from d7838a5 to f75627b Compare June 23, 2026 13:14

krickert force-pushed the OPENNLP-1850-3-dl branch from 1ea12ea to e7f3c59 Compare June 23, 2026 14:02

krickert force-pushed the OPENNLP-1850-4-docs branch from f75627b to c83701e Compare June 23, 2026 14:02

krickert force-pushed the OPENNLP-1850-3-dl branch from e7f3c59 to b6dc241 Compare June 23, 2026 15:17

krickert force-pushed the OPENNLP-1850-4-docs branch from c83701e to e0011e2 Compare June 23, 2026 15:17

krickert force-pushed the OPENNLP-1850-3-dl branch from b6dc241 to be98bd6 Compare June 24, 2026 11:20

krickert force-pushed the OPENNLP-1850-4-docs branch from e0011e2 to 4a48643 Compare June 24, 2026 11:20

krickert force-pushed the OPENNLP-1850-3-dl branch from be98bd6 to 5a1114c Compare June 24, 2026 11:54

krickert force-pushed the OPENNLP-1850-4-docs branch from 4a48643 to 0bb8b2d Compare June 24, 2026 11:54

krickert added 8 commits June 25, 2026 04:23

OPENNLP-1850 Document the offset-aware substitution folds (quotes, di…

8bb3011

…gits, ellipsis, bullets, umlaut)

krickert force-pushed the OPENNLP-1850-3-dl branch from 5a1114c to 57ceefd Compare June 25, 2026 08:26

krickert force-pushed the OPENNLP-1850-4-docs branch from 0bb8b2d to c0fb2eb Compare June 25, 2026 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106

OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106
krickert wants to merge 8 commits into
OPENNLP-1850-3-dlfrom
OPENNLP-1850-4-docs

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rzo1 commented Jun 21, 2026 •

edited

Loading

Uh oh!

krickert commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		rules WB1 through WB999. It is rule based and needs no trained model, it works directly over
		a <code>CharSequence</code>, and it reports character offsets so the original text is

Uh oh!

Conversation

krickert commented Jun 20, 2026

Uh oh!

krickert commented Jun 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rzo1 commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krickert commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rzo1 commented Jun 21, 2026 •

edited

Loading