Skip to content

fix+perf: lettered sub-items, ~10x faster unmask (v2.6.3, v2.6.4)#37

Merged
click0 merged 2 commits into
mainfrom
claude/refactor-data-masking-lIcWN
Jun 28, 2026
Merged

fix+perf: lettered sub-items, ~10x faster unmask (v2.6.3, v2.6.4)#37
click0 merged 2 commits into
mainfrom
claude/refactor-data-masking-lIcWN

Conversation

@click0

@click0 click0 commented Jun 28, 2026

Copy link
Copy Markdown
Owner

Two deferred audit items closed

2.6.3 — false-positive initials

  • п. В. Петренко (clause item B), ст. А., абз. Б., див. п. В. no longer masked as PIB
  • When a service abbreviation (п., пп., ч., ст., абз., гл., розд., …) directly precedes an initials+surname pattern, the leading letter is a clause marker, not a name
  • Real PIB after ordinary words still masked
  • Backfilled CHANGELOG entries for 2.6.1 / 2.6.2

2.6.4 — unmask performance

  • unmask_other_data scanned the full text once per mask — O(masks × text)
  • Now one alternation regex (longer masks first), single text walk
  • Per-mask occurrence counter keeps instance tracking exact for collisions (two originals → same mask)
  • 278 KB doc: ~30 s → ~2.8 s (~10×), roundtrip verified
  • Slow per-mask path kept as fallback for masks that break regex compilation

Test plan

  • 433 tests pass (+6 new: lettered sub-items, collision, substring priority, case, empty)

https://claude.ai/code/session_01XT6iUWaQgahXDB9TWX9Bq7


Generated by Claude Code

claude added 2 commits June 28, 2026 16:33
'п. В. Петренко' (clause item B), 'ст. А.', 'абз. Б.' etc. no longer
masked as PIB — when a service abbreviation precedes an initials+surname
pattern, the leading letter is a clause marker, not a name. Real PIB
after ordinary words still masked. Backfill CHANGELOG for 2.6.1/2.6.2.

https://claude.ai/code/session_01XT6iUWaQgahXDB9TWX9Bq7
unmask_other_data scanned the whole text once per mask (O(masks x text)).
Now builds one alternation regex (longer masks first) and walks the text
once. Per-mask occurrence counter preserves exact instance tracking for
collisions. 278KB doc: ~30s -> ~2.8s. Fallback to per-mask scan kept for
masks that fail regex compilation. +4 collision/case/substring tests.

https://claude.ai/code/session_01XT6iUWaQgahXDB9TWX9Bq7
@click0 click0 merged commit 3361638 into main Jun 28, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants