Skip to content

feat(#112): locale-aware number normalization (supersedes #319)#320

Merged
jqnatividad merged 4 commits into
mainfrom
fix-112-locale-aware-numbers
May 19, 2026
Merged

feat(#112): locale-aware number normalization (supersedes #319)#320
jqnatividad merged 4 commits into
mainfrom
fix-112-locale-aware-numbers

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Closes #112. Supersedes the closed #319.

Summary

Issue #112 reporter (German locale): qsv inferred numeric columns as String when values used , as the decimal separator (57,957, 41,84699). qsv's number parser only recognizes . as the decimal separator — CSV files don't carry locale metadata, so this is correct default behavior. The fix is opt-in.

#319 used a global boolean flag (decimal_comma) + an anchored qsv replace. Per review feedback, the boolean had three structural problems: hardcoded shape (only <int>,<frac>), global toggle (every ingestion treated the same), and ignored existing metadata. This PR replaces it with a CLDR-locale-driven design.

Resolution order (per resource, per ingestion)

# Source Path
1 context.resource.get('dpp_locale') babel CLDR parse
2 ckanext.datapusher_plus.default_locale babel CLDR parse
3 ckanext.datapusher_plus.decimal_separator (single char) anchored regex
4 (none of the above) no-op

First non-empty wins. The locale paths use babel.numbers.parse_decimal, which handles 57,957 (de_DE), 1.234,56 (de_DE thousands+decimal), 57 957,12 (fr_FR space-thousands), and any other CLDR-supported shape. The separator path is the escape hatch for operators with non-CLDR data (single-char anchored regex, no thousands support).

Components

  • config.py + config_declaration.yamlDEFAULT_LOCALE (str) and DECIMAL_SEPARATOR (str), both default "", both editable: true. The YAML description spells out the resolution order, the babel-vs-regex split, and the known v1 limits.
  • dataset-druf.yaml — new dpp_locale resource_field (optional text, BCP 47 / POSIX-style id). Sits in the dpp_* namespace alongside dpp_spatial_extent/dpp_suggestions.
  • analysis.py_normalize_locale_numbers called from process() right after _sanitize_headers. Pure-Python csv.reader/csv.writer (validation upstream already normalizes to RFC 4180). Per-column conversion counts logged at INFO so operators can spot wrong-locale typos.
  • requirements.txtbabel>=2.9 declared explicitly (was transitive via CKAN).

Skip rules

  • babel.numbers.get_decimal_symbol(locale_id) == '.' → log INFO, short-circuit. Locales that already use dot-decimal (en_US, en_GB, ja_JP, ...) get nothing done. NOTE: de_CH falls here; Swiss thousands-apostrophe stripping is a v1 limit — use decimal_separator = \"'\" as the workaround.
  • UnknownLocaleError → log WARNING, short-circuit without falling back to default_locale. Operators need to see the typo, not get silently masked.

Known v1 limitations (documented in YAML description)

  • de_CH apostrophe-thousands not stripped (above)
  • Mixed-locale columns fall back to String (no worse than today)
  • Accounting parens (57,957) not parsed (babel raises NumberFormatError)
  • Per-column locale override deferred to a follow-up if a real case appears

Test plan

  • 9 new tests in tests/test_issue_112_decimal_comma.py, all passing:
    1. config.py inline fallbacks both \"\" (opt-in only)
    2. YAML declaration defaults match (drift guard — Date without timestamp #179-style failure mode)
    3. dpp_locale declared in dataset-druf.yaml as a resource_field
    4. Helper is a no-op when nothing resolves (no temp file, no tmp mutation)
    5. default_locale='de_DE' normalizes both 57,95757.957 AND 1.234,561234.56 (pins the babel-vs-regex distinction)
    6. Per-resource dpp_locale='fr_FR' overrides global default; French space-thousands 57 957,1257957.12
    7. decimal_separator=',' with no locale converts 57,95757.957 and intentionally leaves 1.234,56 verbatim (anchored regex doesn't match)
    8. Unknown locale xx_YY → WARNING + no-op (doesn't crash, doesn't fall back)
    9. End-to-end with real qsv on the reporter's exact German sample: baseline pins the Does not handle locale #112 failure mode, post-conversion the six numeric columns infer as Float and Kennziffer/Einheit stay String
  • Full unit suite: 243 passed (234 baseline + 9 new). No regressions.
  • Scheming round-trip verification in the integration stack: upload a resource with dpp_locale set, confirm resource_show returns it as a top-level dict key, confirm ingestion picks it up
  • Reporter reproducer: upload the issue's German sample via the integration stack with default_locale = de_DE, confirm Postgres column types are numeric/double precision

🤖 Generated with Claude Code

jqnatividad and others added 2 commits May 18, 2026 23:20
Issue #112 reporter (German locale): qsv infers numeric columns as
``String`` when values use ``,`` as the decimal separator. qsv's
number parser only recognizes ``.`` as the decimal separator — CSV
files don't carry locale metadata, so this is correct default
behavior. The fix is opt-in: declare a locale and DP+ rewrites the
numeric cells before stats inference.

This supersedes the closed PR #319, which used a global boolean flag
``decimal_comma`` + a single anchored ``qsv replace``. The boolean
approach hardcoded one specific shape (``<int>,<frac>``) and ignored
all the locale metadata CKAN already carries; this design uses CLDR
locale identifiers and routes through ``babel.numbers.parse_decimal``
for full coverage (``1.234,56`` de_DE thousands+decimal,
``57 957,12`` fr_FR space-thousands, etc.).

Resolution order (per resource, per ingestion):

  1. ``context.resource.get('dpp_locale')``  → babel CLDR parse
  2. ``conf.DEFAULT_LOCALE``                 → babel CLDR parse
  3. ``conf.DECIMAL_SEPARATOR`` (single chr) → anchored regex pass
  4. (none of the above)                     → no-op

First non-empty wins. Locale paths use babel; the separator path is
the escape hatch for operators who know the separator but don't have
or want CLDR locale info (synthetic data, non-CLDR formats, Swiss
thousands-apostrophe workaround).

Implementation:

* ``config.py`` + ``config_declaration.yaml``: two new keys —
  ``default_locale`` (str, default ``""``) and ``decimal_separator``
  (str, default ``""``). Both ``editable: true``. The YAML
  description documents the resolution order, the babel-vs-regex
  paths, and the known v1 limitations (``de_CH`` skipped because
  ``get_decimal_symbol`` returns ``.``; mixed-locale columns fall
  back to ``String``; accounting parens not parsed;
  ``dpp_column_locales`` deferred).
* ``dataset-druf.yaml``: new ``dpp_locale`` resource_field. Optional
  text, BCP 47 / POSIX-style identifier. Materializes on the
  resource dict via scheming so ``context.resource.get('dpp_locale')``
  reads it directly. ``dpp_*`` namespace matches the existing
  ``dpp_spatial_extent`` / ``dpp_suggestions`` precedent.
* ``analysis.py``: new ``_normalize_locale_numbers`` called from
  ``process()`` right after ``_sanitize_headers``, before
  ``_create_index``. Pure-Python ``csv.reader`` + ``csv.writer`` (the
  validation stage already normalizes upstream CSVs to RFC 4180, so
  no dialect plumbing needed). Cell-by-cell try-parse via babel; on
  ``NumberFormatError`` or empty cell, the cell stays verbatim.
  ``format(value, 'f')`` forces fixed-point output (avoids ``str()``
  emitting scientific notation for tiny Decimals — caught by the
  Plan agent). Per-column conversion counts logged at INFO so
  operators can spot wrong-locale misconfigurations (zero
  conversions on a column they expected to convert).
* Skip rules: ``get_decimal_symbol(locale_id) == '.'`` → log INFO and
  short-circuit (the locale already uses dot-decimal — en_US, en_GB,
  ja_JP, de_CH, en_IN, ...). ``UnknownLocaleError`` → log WARNING
  and short-circuit without falling back to ``DEFAULT_LOCALE``
  (operators need to see the typo, not get silently masked).
* ``requirements.txt``: ``babel>=2.9`` declared explicitly. Babel was
  already transitive via CKAN; DP+ now imports ``babel.numbers``
  directly so the dep is declared.

Tests (9, all passing — full unit suite at 243 / +9 net):

  1. ``DEFAULT_LOCALE`` + ``DECIMAL_SEPARATOR`` Python fallbacks are ``""``
  2. YAML declaration defaults agree with Python fallback (drift guard
     — the #179-style failure mode)
  3. ``dpp_locale`` is declared in ``dataset-druf.yaml`` as a
     resource_field (so scheming surfaces it and ``resource_show``
     returns it)
  4. Helper is a no-op when nothing resolves (no temp file written,
     no ``context.tmp`` mutation) — the most important property:
     existing operators see byte-identical pre-#112 behavior
  5. ``default_locale='de_DE'`` normalizes both ``57,957`` →
     ``57.957`` AND ``1.234,56`` → ``1234.56`` (the second pins the
     babel-vs-regex difference)
  6. Per-resource ``dpp_locale='fr_FR'`` overrides global
     ``default_locale='de_DE'``, parsing French space-thousands
     ``57 957,12`` correctly
  7. ``decimal_separator=','`` with no locale converts ``57,957`` →
     ``57.957`` and (intentionally) leaves ``1.234,56`` verbatim
     (anchored regex doesn't match)
  8. Unknown locale ``xx_YY`` logs WARNING and short-circuits to
     no-op (does NOT crash, does NOT fall back)
  9. **End-to-end with real qsv**: baseline pins the #112 failure
     mode (all comma-decimal columns inferred as ``String``); after
     conversion, ``Float`` for numeric columns and ``String``
     preserved for ``Kennziffer`` / ``Einheit``

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
CI on first push of #320 failed at the ``Initialize CKAN database``
step with::

    ckan.logic.ValidationError: {'groups': [{'options': [...,
      {'type': ["Value must be one of ['base', 'bool', 'int',
                'dynamic', 'list']"]},
      {'type': [...]}, ...]}]}

CKAN's config_declaration loader only accepts those five ``type``
values — there's no ``str`` type. The convention for string-valued
settings is to OMIT ``type`` entirely; the loader then defaults to
``base`` which validates as a free-form string. See existing
precedents: ``download_proxy`` and ``file_hash_algorithm`` both
declare strings without a ``type`` key.

Changes:

* ``config_declaration.yaml`` — drop ``type: str`` from both new
  entries (``default_locale``, ``decimal_separator``).
* ``tests/test_issue_112_decimal_comma.py::
  test_config_declaration_defaults_match_python_fallback`` — flip the
  ``type == 'str'`` assertion to ``'type' not in entry``, plus a
  docstring NOTE explaining the constraint so the next person who
  reaches for ``type: str`` finds the precedent.

All 9 #112 tests still pass locally; the integration CI failure was
purely on the YAML schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR resolves issue #112 by introducing locale-aware preprocessing of CSV numeric cells so that comma-decimal and other CLDR-format numbers (e.g. German 57,957, 1.234,56; French 57 957,12) are normalized to dot-decimal before qsv stats infers column types. It supersedes the simpler decimal_comma boolean flag in PR #319 with a per-resource/global/single-separator resolution chain driven primarily by babel.numbers.parse_decimal.

Changes:

  • New AnalysisStage._normalize_locale_numbers runs between header sanitization and stats/index; resolves dpp_locale (per-resource) → DEFAULT_LOCALEDECIMAL_SEPARATOR → no-op.
  • New config keys default_locale and decimal_separator (both default "", opt-in) plus new dpp_locale resource field declared in dataset-druf.yaml.
  • Adds explicit babel>=2.9 dependency and 9 new regression tests pinning the resolution order, drift between YAML/Python defaults, and end-to-end qsv inference on the reporter's German sample.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ckanext/datapusher_plus/jobs/stages/analysis.py Adds _normalize_locale_numbers and _rewrite_cell helpers and wires them into process().
ckanext/datapusher_plus/config.py Declares DEFAULT_LOCALE and DECIMAL_SEPARATOR config getters (both default "").
ckanext/datapusher_plus/config_declaration.yaml YAML declarations for the two new keys with detailed description and v1 limitations.
ckanext/datapusher_plus/dataset-druf.yaml Declares dpp_locale resource field for per-resource override.
requirements.txt Adds explicit babel>=2.9 dependency.
tests/test_issue_112_decimal_comma.py Nine new tests covering defaults, drift, gating, locale paths, separator path, unknown locale, and an end-to-end qsv test.
Comments suppressed due to low confidence (1)

ckanext/datapusher_plus/jobs/stages/analysis.py:398

  • babel.numbers.parse_decimal always returns a Decimal (it constructs one internally and raises NumberFormatError otherwise), so the else str(value) branch is unreachable. Either drop the isinstance guard and call format(value, "f") directly, or — if you want to be defensive against a future babel API change — assert/cast accordingly. As written the dead branch implies parse_decimal might return some non-Decimal type, which can confuse future readers.
            return format(value, "f") if isinstance(value, Decimal) else str(value)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ckanext/datapusher_plus/jobs/stages/analysis.py Outdated
Comment thread ckanext/datapusher_plus/jobs/stages/analysis.py Outdated
jqnatividad and others added 2 commits May 18, 2026 23:49
Two findings from the Copilot review of #320, both substantive:

**1. ``parse_decimal(strict=False)`` silently corrupts date-like
   strings under comma-decimal locales (critical).**

Verified the repro in dpp-test::

    babel.numbers.parse_decimal('12.06.1994',
                                locale='de_DE', strict=False)
    # → Decimal('12061994')   ← SILENT DATA CORRUPTION

Under ``strict=False``, babel treats ``.`` as a permissive thousands
separator and rewrites German date strings (``DD.MM.YYYY`` is the
standard) to enormous integers. A German CSV with dates in DATA
columns — extremely common — would have every date silently rewritten.
The PR's test sample only had date-like strings in the HEADER row
(correctly skipped), so the suite missed this exact failure mode.

Fix: ``strict=True`` in ``_rewrite_cell``. Babel then enforces
canonical 3-digit grouping; non-canonical dates raise
``NumberFormatError`` and the cell is left verbatim. Valid forms
still parse cleanly (``57,957`` → ``57.957``, ``1.234,56`` →
``1234.56``, ``1234,56`` → ``1234.56``).

Regression test ``test_de_de_does_not_corrupt_date_like_data_cells``
puts ``12.06.1994`` / ``13.06.1999`` / ``26.05.2019`` in DATA cells
under ``de_DE`` and asserts they survive verbatim while neighboring
comma-decimal numbers still convert.

**2. Per-resource ``dpp_locale='en_US'`` silently activated global
   ``decimal_separator`` (UX surprise).**

The prior code, when a resolved locale was dot-decimal, cleared
``locale_id`` and fell through to the separator path. That conflated
"operator declared en_US" with "operator declared nothing" — applying
a global separator to a resource the operator explicitly marked as
en_US is surprising.

Fix: when a locale resolves dot-decimal, return unconditionally
regardless of ``decimal_separator``. Per-resource intent wins. The
INFO log was reworded to guide operators who want the separator
fallback to apply: it now explicitly says to unset ``dpp_locale``
(or ``default_locale`` for global).

Regression test
``test_dpp_locale_dot_decimal_does_not_fall_through_to_separator``:
per-resource ``dpp_locale='en_US'`` + global ``decimal_separator=','``
is a no-op, no file written, guidance text appears in the INFO log.

Test counts: 9 → 11 on #112; 243 → 245 overall. All passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Low-confidence Copilot suggestion on PR #320, but a fair one once
verified. ``babel.numbers.parse_decimal`` is explicitly annotated
``-> decimal.Decimal`` and the only failure mode is raising
``NumberFormatError`` / ``UnsupportedNumberingSystemError`` (both
caught one line above). So::

    return format(value, "f") if isinstance(value, Decimal) else str(value)

…had a dead ``else`` branch. Worse, if babel ever did return a
non-Decimal, the ``str(value)`` fallback would silently emit
float-imprecision garbage (``str(0.1 + 0.2)`` → ``'0.30000000000000004'``)
— exactly the format qsv stats won't infer cleanly as Float. Better
to ``format`` directly and let any future API drift fail loudly.

Dropped the guard and the now-unused ``from decimal import Decimal``.
Updated the comment to explain the babel contract + why we still need
``format(d, 'f')`` (forces fixed-point; without it, very small
Decimals would render in scientific notation).

11/11 #112 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@jqnatividad jqnatividad merged commit a00e3e7 into main May 19, 2026
1 check passed
@jqnatividad jqnatividad deleted the fix-112-locale-aware-numbers branch May 19, 2026 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Does not handle locale

2 participants