OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106
OPENNLP-1850: Document Unicode normalization and the UAX #29 tokenizer (4/4)#1106krickert wants to merge 8 commits into
Conversation
|
OPENNLP-1850 stacked PRs (review independently; merge bottom-up, re-targeting each base to
Supersedes #1101. |
There was a problem hiding this comment.
Pull request overview
Adds/updates Developer Manual documentation for Unicode-aware normalization and UAX #29 tokenization, and connects those docs to the DL (ONNX) components’ Unicode chunking/normalization behavior.
Changes:
- Adds a new “Text Normalization” manual chapter and includes it in the master DocBook.
- Extends the tokenizer chapter with guidance about Unicode preprocessing and a new UAX #29 tokenizer/segmenter section.
- Updates NameFinderDL/DocumentCategorizerDL and introduction docs to reference Unicode-aware DL chunking and normalization options.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| opennlp-docs/src/docbkx/tokenizer.xml | Adds Unicode preprocessing guidance and documents the UAX #29 tokenizer/segmenter APIs. |
| opennlp-docs/src/docbkx/opennlp.xml | Includes the new normalizer chapter in the book build. |
| opennlp-docs/src/docbkx/normalizer.xml | New “Text Normalization” chapter covering normalizers, pipelines, term model, and reference data. |
| opennlp-docs/src/docbkx/namefinder.xml | Updates NameFinderDL constructor usage and documents Unicode-aware DL chunking and normalization options. |
| opennlp-docs/src/docbkx/introduction.xml | Links DL inference Unicode handling to the normalizer documentation. |
| opennlp-docs/src/docbkx/doccat.xml | Documents Unicode-aware DL chunking/normalization and adds ONNX usage snippet updates. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1c17110 to
8534bb3
Compare
3037db7 to
9a71f28
Compare
|
Thx for the PR. Here are some suggestions:
Otherwise the content is accurate. |
8534bb3 to
5154da4
Compare
d7d316f to
0ff5d07
Compare
5154da4 to
c51f37d
Compare
667e850 to
d71e472
Compare
40698dc to
001ac01
Compare
b65c0de to
0022bc1
Compare
038e23d to
bc401d3
Compare
0022bc1 to
2fd9543
Compare
bc401d3 to
4c12897
Compare
2fd9543 to
743a955
Compare
|
Status since the last review. Normalizer chapter gains an "Offset-aware pipelines" section for |
743a955 to
213ab50
Compare
2006d1d to
fdd329f
Compare
213ab50 to
8475b41
Compare
fdd329f to
47a39bf
Compare
def4b08 to
d7838a5
Compare
| // Maps the model's output indices to its BIO labels, e.g. "O", "B-PER", "I-PER". | ||
| Map<Integer, String> ids2Labels = new HashMap<>(); | ||
| SentenceDetector sentenceDetector = | ||
| new SentenceDetectorME(new SentenceModel(new File("/path/to/en-sent.bin"))); | ||
| String[] tokens = {"George", "Washington", "was", "president", "of", "the", "United", "States", "."}; | ||
| NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, ids2Labels, sentenceDetector); | ||
| // findInOriginal returns spans in the original input's coordinates. | ||
| Span[] spans = nameFinderDL.findInOriginal(tokens);]]> |
| rules WB1 through WB999. It is rule based and needs no trained model, it works directly over | ||
| a <code>CharSequence</code>, and it reports character offsets so the original text is |
3f77034 to
1ea12ea
Compare
d7838a5 to
f75627b
Compare
1ea12ea to
e7f3c59
Compare
f75627b to
c83701e
Compare
e7f3c59 to
b6dc241
Compare
c83701e to
e0011e2
Compare
b6dc241 to
be98bd6
Compare
e0011e2 to
4a48643
Compare
be98bd6 to
5a1114c
Compare
4a48643 to
0bb8b2d
Compare
…nd DL handling Add the Text Normalization manual chapter (CharClass engine, normalizer pipeline, the Term model, and the Aligned offset variants that return an AlignedText carrying an Alignment), extend the tokenizer chapter with the UAX #29 segmenter, and document the DL components' Unicode-aware chunking and opt-in whitespace/dash folding with offset-safe findInOriginal. All embedded ONNX snippets are self-contained and compile.
…ligned) Add an "Offset-aware pipelines" section to the normalizer chapter covering TextNormalizer.Builder.buildAligned(), the OffsetAwareNormalizer capability interface, mapping a match back to the source with AlignedText/Alignment, and the fail-loud rejection of rungs that cannot report edits (NFC/NFKC). List the new line-break-preserving whitespace rung in the normalizer family table.
… the manual Document that NameFinderDL.findInOriginal comes from the OffsetMappingNameFinder capability interface, detectable with a plain instanceof check, so the name-finder chapter matches how the normalizer chapter presents OffsetAwareNormalizer.
…gits, ellipsis, bullets, umlaut)
…old options Note in the normalizer manual that, with dash folding enabled, a dash in the supplementary planes shrinks from two UTF-16 units to one and shifts later offsets, so find reports offsets into the normalized text in that case while findInOriginal maps them back to the original input. The one-for-one whitespace fold versus the run-collapsing whitespace rung is already covered in the same section.
Scope the "never relies on Character.isWhitespace" statement to the normalization engine rather than the whole library. Note that getInstance() gives the default shared instance and that case and accent folding also offer configured forms. Refer to the conformance file by its full name WordBreakTest.txt.
…enizer manual The Word Tokenizer section said it drops punctuation and keeps emoji without noting that emoji means any Extended_Pictographic code point, so symbol-like characters such as the copyright, trademark, and double-exclamation signs are kept. Match the WordTokenizer class javadoc.
…d hyphenation Show a concrete, exhaustive ids2Labels BIO mapping in the ONNX name-finder example instead of an empty map (an unmapped predicted index raises IllegalStateException at runtime), and note the exhaustiveness requirement. Hyphenate 'rule-based' and split the comma splice in the UAX #29 tokenizer section.
5a1114c to
57ceefd
Compare
0bb8b2d to
c0fb2eb
Compare
Part 4/4 of OPENNLP-1850. New normalizer manual chapter plus tokenizer/doccat/namefinder/introduction updates and the master opennlp.xml.