Skip to content

OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7)#1111

Draft
krickert wants to merge 1 commit into
OPENNLP-1850-2a-tokenizerfrom
OPENNLP-1850-2b-term
Draft

OPENNLP-1850: Layered Term model — Term, TermAnalyzer (2b/7)#1111
krickert wants to merge 1 commit into
OPENNLP-1850-2a-tokenizerfrom
OPENNLP-1850-2b-term

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Part 2b of the OPENNLP-1850 stack: the token-analysis layer, split out of the former tokenizer PR (#1104) on review request.

A Term is one token projected through the ordered Dimension stack (original, NFC, NFKC, whitespace, dash, case fold, accent fold, confusable fold, stem, lemma), keeping its source Span and every intermediate form. TermAnalyzer segments with the UAX #29 WordTokenizer (from 2a) and applies the configured dimension prefix. Restores Dimension's {@link Term}/{@link TermAnalyzer} javadoc now that those types exist.

Base: OPENNLP-1850-2a-tokenizer (#1110). Stack: 1a → 1b → 2a → 2b (this) → 2c → DL → docs.

The token analysis layer split out of the former tokenizer PR (#1104) on review request. A Term is
one token projected through the ordered Dimension stack (original, NFC, NFKC, whitespace, dash, case
fold, accent fold, confusable fold, stem, lemma), keeping its source Span and every intermediate
form; TermAnalyzer segments with the UAX #29 WordTokenizer (from 2a) and applies the configured
dimension prefix. Restores Dimension's {@link Term}/{@link TermAnalyzer} javadoc now that they exist.
Builds on the tokenizer in 2a.
@krickert krickert force-pushed the OPENNLP-1850-2a-tokenizer branch from dd1906d to 3fae8aa Compare June 25, 2026 08:26
@krickert krickert force-pushed the OPENNLP-1850-2b-term branch from e35e859 to 55dbeb4 Compare June 25, 2026 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant