Skip to content

[codex] Integrate German regex support#146

Open
sidmohan0 wants to merge 7 commits into
devfrom
codex/dfpy-78-german-regex-support
Open

[codex] Integrate German regex support#146
sidmohan0 wants to merge 7 commits into
devfrom
codex/dfpy-78-german-regex-support

Conversation

@sidmohan0
Copy link
Copy Markdown
Contributor

Summary

  • Adapt external PR feat(regex): add German structured PII detection #138 for the 4.5 lightweight regex path without adding dependencies.
  • Add German VAT ID and German IBAN detection to the default regex set.
  • Add broader German structured identifiers behind explicit locales=["de"] or explicit entity_types, with context guards to avoid ordinary ticket/SKU/order ID false positives.
  • Propagate locale support through scan, redact, guardrail helpers, DataFog, TextService, and the core text CLI commands.
  • Document German locale behavior in README and user docs.

Review notes

  • This follows the DFPY-77 review decision: proceed by adapting feat(regex): add German structured PII detection #138, but avoid merging the original PR as-is because several broad German identifiers were too noisy when default-on.
  • Regex overlap suppression now prefers the longer/more specific German VAT match over an inner generic SSN-shaped substring, preventing bad redaction output.

Verification

  • DATAFOG_NO_TELEMETRY=1 DO_NOT_TRACK=1 .venv312/bin/python -m pytest tests/test_de_pii_regex.py tests/test_regex_annotator.py -q
  • DATAFOG_NO_TELEMETRY=1 DO_NOT_TRACK=1 .venv312/bin/python -m pytest tests/test_detection_accuracy.py::test_structured_pii_detection_fast tests/test_detection_accuracy.py::test_negative_cases_fast tests/test_main.py::test_lean_datafog_detect tests/test_main.py::test_lean_datafog_process tests/test_client.py::test_scan_text_success tests/test_cli_smoke.py::test_redact_text_command -q
  • .venv312/bin/pre-commit run --files README.md datafog/__init__.py datafog/agent.py datafog/client.py datafog/core.py datafog/engine.py datafog/main.py datafog/processing/text_processing/regex_annotator/regex_annotator.py datafog/services/text_service.py docs/cli.rst docs/getting-started.rst docs/python-sdk.rst tests/corpus/structured_pii.json tests/test_detection_accuracy.py tests/test_regex_annotator.py tests/test_de_pii_regex.py --show-diff-on-failure
  • DATAFOG_NO_TELEMETRY=1 DO_NOT_TRACK=1 .venv312/bin/python -m sphinx -b html docs docs/_build/html
  • DATAFOG_NO_TELEMETRY=1 DO_NOT_TRACK=1 .venv312/bin/python -m pytest tests/test_runtime_dependency_safety.py tests/test_no_network_core.py -q
  • git diff --check
  • DATAFOG_NO_TELEMETRY=1 DO_NOT_TRACK=1 .venv312/bin/python -m pytest -m "not slow" -q -> 583 passed, 4 skipped, 295 deselected, 19 xfailed

Refs DFPY-78.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant