Skip to content

fix: keep Russian root match from absorbing conjunctions#146

Merged
alyldas merged 2 commits into
mainfrom
issue-144-russian-boundary
Jul 1, 2026
Merged

fix: keep Russian root match from absorbing conjunctions#146
alyldas merged 2 commits into
mainfrom
issue-144-russian-boundary

Conversation

@alyldas

@alyldas alyldas commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Keep the Russian huy base rule from treating a following standalone conjunction as part of the same loose stretched root.
  • Replace the final root character class with explicit base variants so loose stretching repeats the selected variant rather than the whole variant set.
  • Add regression coverage for match ranges, censor output, and Telegram HTML spoiler rendering.

Validation

  • npm run check with a temporary npm cache
  • npm run benchmark:profanity on origin/main
  • npm run benchmark:profanity on this branch

Benchmark Evidence

Baseline origin/main:

  • createProfanityFilter(): 7.6038 avg ms
  • check short clean: 0.0275 avg ms
  • check cyrillic clean: 0.0615 avg ms
  • check long clean: 4.5850 avg ms
  • check loose match: 0.0229 avg ms
  • analyze short match: 0.0309 avg ms
  • censor short match: 0.0356 avg ms

This branch:

  • createProfanityFilter(): 7.7734 avg ms
  • check short clean: 0.0263 avg ms
  • check cyrillic clean: 0.0611 avg ms
  • check long clean: 4.3767 avg ms
  • check loose match: 0.0218 avg ms
  • analyze short match: 0.0315 avg ms
  • censor short match: 0.0325 avg ms

Runtime behavior changes only for the Russian base root boundary case where a following standalone conjunction was previously absorbed by loose stretching.

Compatibility Notes

  • Public APIs, exports, scanner contracts, and package metadata are unchanged.
  • Existing Russian loose stretching remains available for the explicit base variants while the хуй и range now ends after the obscene token.
  • No workflow, permission, secret, or packaging compatibility changes are introduced.

Closes #144

No publish, no merge, no tag/release.

@alyldas alyldas marked this pull request as ready for review July 1, 2026 16:32
@alyldas alyldas force-pushed the issue-144-russian-boundary branch from 23ba5f0 to 3a00c9d Compare July 1, 2026 16:34
@alyldas alyldas changed the title Keep Russian root match from absorbing conjunctions fix: keep Russian root match from absorbing conjunctions Jul 1, 2026
@alyldas alyldas merged commit 0b67a55 into main Jul 1, 2026
4 checks passed
@alyldas alyldas deleted the issue-144-russian-boundary branch July 1, 2026 16:37

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4956df1f95

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"category": "OBSCENE_MAT",
"severity": "high",
"source": "ху[йяиёею]",
"source": "(?:хуй|хуя|хуи|хуё|хуе|хую)",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore reviewed хуей/ху-ей loose matches

With this source split into fixed alternatives, stretch: true can only repeat the selected final letter, so inputs like хуей or ху-ей now only produce a хуе prefix match; the loose boundary check then rejects it because the same token still has a trailing й. These are existing reviewed loose corpus cases in tests/loose-corpus.spec.ts that should be fully masked, so this change creates a bypass while fixing the following-conjunction case.

Useful? React with 👍 / 👎.

"category": "OBSCENE_MAT",
"severity": "high",
"source": "ху[йяиёею]",
"source": "(?:хуй|хуя|хуи|хуё|хуе|хую)",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop absorbing same-letter following words

Because each explicit alternative is still stretched, a following standalone word that starts with the same final letter is consumed as another repeat of that final atom. For example, привет хуи и мир still gets a loose range over хуи и rather than just хуи (and хуя я ... has the same shape), so the range leak this patch is meant to fix remains for base variants whose last letter matches the next word.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Keep Russian obscene root match from absorbing a following conjunction

1 participant