OPENNLP-1850: Per-language NormalizationProfile registry (2c/7) by krickert · Pull Request #1112 · apache/opennlp

krickert · 2026-06-23T15:18:45Z

Part 2c of the OPENNLP-1850 stack: the language-to-settings registry, split out of the former tokenizer PR (#1104) on review request.

NormalizationProfiles maps a language code to its stemmer and language-appropriate diacritic fold (the way OpenNLP already selects a Snowball stemmer by language) and builds a search-oriented TermAnalyzer; NormalizationProfile is the per-language record.

Base: OPENNLP-1850-2b-term (#1111). Stack: 1a → 1b → 2a → 2b → 2c (this) → DL → docs.

The language-to-settings registry split out of the former tokenizer PR (#1104) on review request. NormalizationProfiles maps a language code to its stemmer and diacritic fold (the way OpenNLP already selects a Snowball stemmer by language) and builds a search-oriented TermAnalyzer; NormalizationProfile is the per-language record. Builds on the Term model in 2b.

forLanguage("nb")/("nn") returned empty: the registry keyed Norwegian under the macrolanguage "nor", but the standard written codes nb/nn convert to ISO 639-3 nob/nno, which were absent from the map -- so Norwegian text got no profile, and the LanguageDetector (which emits nob/nno) hit the same hole for a language supportedLanguages() advertises. Added nob/nno aliases to the Norwegian stemmer.

krickert mentioned this pull request Jun 23, 2026

OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) #1104

Closed

krickert force-pushed the OPENNLP-1850-2b-term branch from 57e2b58 to 82cb041 Compare June 24, 2026 11:20

krickert force-pushed the OPENNLP-1850-2c-profiles branch from b6cd173 to a345f48 Compare June 24, 2026 11:20

krickert added 2 commits June 24, 2026 07:48

krickert force-pushed the OPENNLP-1850-2b-term branch from 82cb041 to e35e859 Compare June 24, 2026 11:54

krickert force-pushed the OPENNLP-1850-2c-profiles branch from a345f48 to 93618ce Compare June 24, 2026 11:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OPENNLP-1850: Per-language NormalizationProfile registry (2c/7)#1112

OPENNLP-1850: Per-language NormalizationProfile registry (2c/7)#1112
krickert wants to merge 2 commits into
OPENNLP-1850-2b-termfrom
OPENNLP-1850-2c-profiles

krickert commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

krickert commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant