feat(regex): add German structured PII detection#138
Open
pranjalparmar wants to merge 5 commits into
Open
Conversation
Add deterministic German-specific PII entity types to the regex engine: - DE_VAT_ID: German VAT identification number (USt-IdNr) - DE_IBAN: German IBAN for payments (DE + 20 digits) - DE_TAX_ID: German tax ID (Steuer-ID, 11 digits) - DE_SOCIAL_SECURITY_NUMBER: German pension insurance number (11 characters) - DE_PHONE: German phone numbers (+49 country code) - DE_POSTAL_CODE: German postal code with prefix (PLZ/DE/D + 5 digits) - DE_PASSPORT_NUMBER: German passport (1 letter + 8 digits) - DE_RESIDENCE_PERMIT_NUMBER: German residence permit (AT + 7 digits) Changes: - Added regex patterns and labels to RegexAnnotator - Registered canonical entity types in engine.py and core.py - Expanded structured_pii.json corpus with test cases - Created comprehensive test_de_pii_regex.py with positive/negative cases - Updated STRUCTURED_TYPES in accuracy tests - No setup.py or dependency changes (regex-only, deterministic) Test results: - 381 tests passed (includes 18 new German PII tests) - All regex and accuracy tests pass - No regressions in existing functionality
Replace digit-only lookahead with alphanumeric boundaries to prevent false positive prefix matches. For example, DE123456789A now correctly rejects the longer token instead of matching as DE123456789. All 363 tests pass with zero regressions.
DE_PHONE overlaps with the generic PHONE pattern, causing the redaction system to apply both replacements and corrupt output. Since German phone numbers are already detected by the generic PHONE pattern, remove the DE_PHONE pattern as a separate entity type. Removes: - DE_PHONE from LABELS and regex patterns - DE_PHONE from ALL_ENTITY_TYPES in engine - DE_PHONE from supported entities in core - DE_PHONE test cases from test_de_pii_regex.py - DE_PHONE corpus entry from structured_pii.json - Updated label count from 15 to 14 German PII detection is still comprehensive with 7 entity types: DE_VAT_ID, DE_IBAN, DE_TAX_ID, DE_SOCIAL_SECURITY_NUMBER, DE_POSTAL_CODE, DE_PASSPORT_NUMBER, DE_RESIDENCE_PERMIT_NUMBER All 361 tests pass with zero regressions.
…erage - Replace exact LABELS length check with subset validation to avoid breakage on future label additions - Add positive and negative test cases for DE_VAT_ID and DE_IBAN regex patterns - Ensures regex patterns are resilient to new entity types without modifying existing tests
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add deterministic German-specific PII entity types to the regex engine:
Changes
Test Results
Type
Target Branch
dev