Skip to content

feat(regex): add German structured PII detection#138

Open
pranjalparmar wants to merge 5 commits into
DataFog:devfrom
pranjalparmar:pranjalparmar/feat-german-structured-pii
Open

feat(regex): add German structured PII detection#138
pranjalparmar wants to merge 5 commits into
DataFog:devfrom
pranjalparmar:pranjalparmar/feat-german-structured-pii

Conversation

@pranjalparmar
Copy link
Copy Markdown

Add deterministic German-specific PII entity types to the regex engine:

  • DE_VAT_ID: German VAT identification number (USt-IdNr)
  • DE_IBAN: German IBAN for payments (DE + 20 digits)
  • DE_TAX_ID: German tax ID (Steuer-ID, 11 digits)
  • DE_SOCIAL_SECURITY_NUMBER: German pension insurance number (11 characters)
  • DE_POSTAL_CODE: German postal code with prefix (PLZ/DE/D + 5 digits)
  • DE_PASSPORT_NUMBER: German passport (1 letter + 8 digits)
  • DE_RESIDENCE_PERMIT_NUMBER: German residence permit (AT + 7 digits)

Changes

  • Added regex patterns and labels to RegexAnnotator
  • Registered canonical entity types in engine.py and core.py
  • Expanded structured_pii.json corpus with test cases
  • Created comprehensive test_de_pii_regex.py with positive/negative cases
  • Updated STRUCTURED_TYPES in accuracy tests
  • No setup.py or dependency changes (regex-only, deterministic)

Test Results

  • 381 tests passed (includes 18 new German PII tests)
  • All regex and accuracy tests pass
  • No regressions in existing functionality

Type

  • Feature

Target Branch

  • This PR targets dev

Add deterministic German-specific PII entity types to the regex engine:
- DE_VAT_ID: German VAT identification number (USt-IdNr)
- DE_IBAN: German IBAN for payments (DE + 20 digits)
- DE_TAX_ID: German tax ID (Steuer-ID, 11 digits)
- DE_SOCIAL_SECURITY_NUMBER: German pension insurance number (11 characters)
- DE_PHONE: German phone numbers (+49 country code)
- DE_POSTAL_CODE: German postal code with prefix (PLZ/DE/D + 5 digits)
- DE_PASSPORT_NUMBER: German passport (1 letter + 8 digits)
- DE_RESIDENCE_PERMIT_NUMBER: German residence permit (AT + 7 digits)

Changes:
- Added regex patterns and labels to RegexAnnotator
- Registered canonical entity types in engine.py and core.py
- Expanded structured_pii.json corpus with test cases
- Created comprehensive test_de_pii_regex.py with positive/negative cases
- Updated STRUCTURED_TYPES in accuracy tests
- No setup.py or dependency changes (regex-only, deterministic)

Test results:
- 381 tests passed (includes 18 new German PII tests)
- All regex and accuracy tests pass
- No regressions in existing functionality
Replace digit-only lookahead with alphanumeric boundaries to prevent
false positive prefix matches. For example, DE123456789A now correctly
rejects the longer token instead of matching as DE123456789.

All 363 tests pass with zero regressions.
DE_PHONE overlaps with the generic PHONE pattern, causing the redaction
system to apply both replacements and corrupt output. Since German phone
numbers are already detected by the generic PHONE pattern, remove the
DE_PHONE pattern as a separate entity type.

Removes:
- DE_PHONE from LABELS and regex patterns
- DE_PHONE from ALL_ENTITY_TYPES in engine
- DE_PHONE from supported entities in core
- DE_PHONE test cases from test_de_pii_regex.py
- DE_PHONE corpus entry from structured_pii.json
- Updated label count from 15 to 14

German PII detection is still comprehensive with 7 entity types:
DE_VAT_ID, DE_IBAN, DE_TAX_ID, DE_SOCIAL_SECURITY_NUMBER,
DE_POSTAL_CODE, DE_PASSPORT_NUMBER, DE_RESIDENCE_PERMIT_NUMBER

All 361 tests pass with zero regressions.
…erage

- Replace exact LABELS length check with subset validation to avoid breakage on future label additions
- Add positive and negative test cases for DE_VAT_ID and DE_IBAN regex patterns
- Ensures regex patterns are resilient to new entity types without modifying existing tests
Copilot AI review requested due to automatic review settings May 23, 2026 10:53
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants