OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7) by krickert · Pull Request #1111 · apache/opennlp

krickert · 2026-06-23T15:18:43Z

Part 2b of the OPENNLP-1850 stack: the token-analysis layer, split out of the former tokenizer PR (#1104) on review request.

A Term is one token projected through the ordered Dimension stack (original, NFC, NFKC, whitespace, dash, case fold, accent fold, confusable fold, stem, lemma), keeping its source Span and every intermediate form. TermAnalyzer segments with the UAX #29 WordTokenizer (from 2a) and applies the configured dimension prefix. Restores Dimension's {@link Term}/{@link TermAnalyzer} javadoc now that those types exist.

Base: OPENNLP-1850-2a-tokenizer (#1110). Stack: 1a → 1b → 2a → 2b (this) → 2c → DL → docs.

The token analysis layer split out of the former tokenizer PR (#1104) on review request. A Term is one token projected through the ordered Dimension stack (original, NFC, NFKC, whitespace, dash, case fold, accent fold, confusable fold, stem, lemma), keeping its source Span and every intermediate form; TermAnalyzer segments with the UAX #29 WordTokenizer (from 2a) and applies the configured dimension prefix. Restores Dimension's {@link Term}/{@link TermAnalyzer} javadoc now that they exist. Builds on the tokenizer in 2a.

This was referenced Jun 23, 2026

OPENNLP-1850: Per-language NormalizationProfile registry (2c/7) #1112

Draft

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104

Closed

krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from a450069 to dc02b9e Compare June 24, 2026 11:20

krickert force-pushed the OPENNLP-1850-2b-term branch from 57e2b58 to 82cb041 Compare June 24, 2026 11:20

krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from dc02b9e to dd1906d Compare June 24, 2026 11:54

krickert force-pushed the OPENNLP-1850-2b-term branch from 82cb041 to e35e859 Compare June 24, 2026 11:54

krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from dd1906d to 3fae8aa Compare June 25, 2026 08:26

krickert force-pushed the OPENNLP-1850-2b-term branch from e35e859 to 55dbeb4 Compare June 25, 2026 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7)#1111

OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7)#1111
krickert wants to merge 1 commit into
OPENNLP-1850-2a-tokenizerfrom
OPENNLP-1850-2b-term

krickert commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

krickert commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant