Skip to content

OPENNLP-1850: Per-language NormalizationProfile registry (2c/7)#1112

Draft
krickert wants to merge 2 commits into
OPENNLP-1850-2b-termfrom
OPENNLP-1850-2c-profiles
Draft

OPENNLP-1850: Per-language NormalizationProfile registry (2c/7)#1112
krickert wants to merge 2 commits into
OPENNLP-1850-2b-termfrom
OPENNLP-1850-2c-profiles

Conversation

@krickert

Copy link
Copy Markdown
Contributor

Part 2c of the OPENNLP-1850 stack: the language-to-settings registry, split out of the former tokenizer PR (#1104) on review request.

NormalizationProfiles maps a language code to its stemmer and language-appropriate diacritic fold (the way OpenNLP already selects a Snowball stemmer by language) and builds a search-oriented TermAnalyzer; NormalizationProfile is the per-language record.

Base: OPENNLP-1850-2b-term (#1111). Stack: 1a → 1b → 2a → 2b → 2c (this) → DL → docs.

krickert added 2 commits June 24, 2026 07:48
The language-to-settings registry split out of the former tokenizer PR (#1104) on review request.
NormalizationProfiles maps a language code to its stemmer and diacritic fold (the way OpenNLP already
selects a Snowball stemmer by language) and builds a search-oriented TermAnalyzer; NormalizationProfile
is the per-language record. Builds on the Term model in 2b.
forLanguage("nb")/("nn") returned empty: the registry keyed Norwegian under the macrolanguage
"nor", but the standard written codes nb/nn convert to ISO 639-3 nob/nno, which were absent from the
map -- so Norwegian text got no profile, and the LanguageDetector (which emits nob/nno) hit the same
hole for a language supportedLanguages() advertises. Added nob/nno aliases to the Norwegian stemmer.
@krickert krickert force-pushed the OPENNLP-1850-2b-term branch from 82cb041 to e35e859 Compare June 24, 2026 11:54
@krickert krickert force-pushed the OPENNLP-1850-2c-profiles branch from a345f48 to 93618ce Compare June 24, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant