feat(#112): locale-aware number normalization (supersedes #319)#320
Merged
Conversation
Issue #112 reporter (German locale): qsv infers numeric columns as ``String`` when values use ``,`` as the decimal separator. qsv's number parser only recognizes ``.`` as the decimal separator — CSV files don't carry locale metadata, so this is correct default behavior. The fix is opt-in: declare a locale and DP+ rewrites the numeric cells before stats inference. This supersedes the closed PR #319, which used a global boolean flag ``decimal_comma`` + a single anchored ``qsv replace``. The boolean approach hardcoded one specific shape (``<int>,<frac>``) and ignored all the locale metadata CKAN already carries; this design uses CLDR locale identifiers and routes through ``babel.numbers.parse_decimal`` for full coverage (``1.234,56`` de_DE thousands+decimal, ``57 957,12`` fr_FR space-thousands, etc.). Resolution order (per resource, per ingestion): 1. ``context.resource.get('dpp_locale')`` → babel CLDR parse 2. ``conf.DEFAULT_LOCALE`` → babel CLDR parse 3. ``conf.DECIMAL_SEPARATOR`` (single chr) → anchored regex pass 4. (none of the above) → no-op First non-empty wins. Locale paths use babel; the separator path is the escape hatch for operators who know the separator but don't have or want CLDR locale info (synthetic data, non-CLDR formats, Swiss thousands-apostrophe workaround). Implementation: * ``config.py`` + ``config_declaration.yaml``: two new keys — ``default_locale`` (str, default ``""``) and ``decimal_separator`` (str, default ``""``). Both ``editable: true``. The YAML description documents the resolution order, the babel-vs-regex paths, and the known v1 limitations (``de_CH`` skipped because ``get_decimal_symbol`` returns ``.``; mixed-locale columns fall back to ``String``; accounting parens not parsed; ``dpp_column_locales`` deferred). * ``dataset-druf.yaml``: new ``dpp_locale`` resource_field. Optional text, BCP 47 / POSIX-style identifier. Materializes on the resource dict via scheming so ``context.resource.get('dpp_locale')`` reads it directly. ``dpp_*`` namespace matches the existing ``dpp_spatial_extent`` / ``dpp_suggestions`` precedent. * ``analysis.py``: new ``_normalize_locale_numbers`` called from ``process()`` right after ``_sanitize_headers``, before ``_create_index``. Pure-Python ``csv.reader`` + ``csv.writer`` (the validation stage already normalizes upstream CSVs to RFC 4180, so no dialect plumbing needed). Cell-by-cell try-parse via babel; on ``NumberFormatError`` or empty cell, the cell stays verbatim. ``format(value, 'f')`` forces fixed-point output (avoids ``str()`` emitting scientific notation for tiny Decimals — caught by the Plan agent). Per-column conversion counts logged at INFO so operators can spot wrong-locale misconfigurations (zero conversions on a column they expected to convert). * Skip rules: ``get_decimal_symbol(locale_id) == '.'`` → log INFO and short-circuit (the locale already uses dot-decimal — en_US, en_GB, ja_JP, de_CH, en_IN, ...). ``UnknownLocaleError`` → log WARNING and short-circuit without falling back to ``DEFAULT_LOCALE`` (operators need to see the typo, not get silently masked). * ``requirements.txt``: ``babel>=2.9`` declared explicitly. Babel was already transitive via CKAN; DP+ now imports ``babel.numbers`` directly so the dep is declared. Tests (9, all passing — full unit suite at 243 / +9 net): 1. ``DEFAULT_LOCALE`` + ``DECIMAL_SEPARATOR`` Python fallbacks are ``""`` 2. YAML declaration defaults agree with Python fallback (drift guard — the #179-style failure mode) 3. ``dpp_locale`` is declared in ``dataset-druf.yaml`` as a resource_field (so scheming surfaces it and ``resource_show`` returns it) 4. Helper is a no-op when nothing resolves (no temp file written, no ``context.tmp`` mutation) — the most important property: existing operators see byte-identical pre-#112 behavior 5. ``default_locale='de_DE'`` normalizes both ``57,957`` → ``57.957`` AND ``1.234,56`` → ``1234.56`` (the second pins the babel-vs-regex difference) 6. Per-resource ``dpp_locale='fr_FR'`` overrides global ``default_locale='de_DE'``, parsing French space-thousands ``57 957,12`` correctly 7. ``decimal_separator=','`` with no locale converts ``57,957`` → ``57.957`` and (intentionally) leaves ``1.234,56`` verbatim (anchored regex doesn't match) 8. Unknown locale ``xx_YY`` logs WARNING and short-circuits to no-op (does NOT crash, does NOT fall back) 9. **End-to-end with real qsv**: baseline pins the #112 failure mode (all comma-decimal columns inferred as ``String``); after conversion, ``Float`` for numeric columns and ``String`` preserved for ``Kennziffer`` / ``Einheit`` Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
CI on first push of #320 failed at the ``Initialize CKAN database`` step with:: ckan.logic.ValidationError: {'groups': [{'options': [..., {'type': ["Value must be one of ['base', 'bool', 'int', 'dynamic', 'list']"]}, {'type': [...]}, ...]}]} CKAN's config_declaration loader only accepts those five ``type`` values — there's no ``str`` type. The convention for string-valued settings is to OMIT ``type`` entirely; the loader then defaults to ``base`` which validates as a free-form string. See existing precedents: ``download_proxy`` and ``file_hash_algorithm`` both declare strings without a ``type`` key. Changes: * ``config_declaration.yaml`` — drop ``type: str`` from both new entries (``default_locale``, ``decimal_separator``). * ``tests/test_issue_112_decimal_comma.py:: test_config_declaration_defaults_match_python_fallback`` — flip the ``type == 'str'`` assertion to ``'type' not in entry``, plus a docstring NOTE explaining the constraint so the next person who reaches for ``type: str`` finds the precedent. All 9 #112 tests still pass locally; the integration CI failure was purely on the YAML schema. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR resolves issue #112 by introducing locale-aware preprocessing of CSV numeric cells so that comma-decimal and other CLDR-format numbers (e.g. German 57,957, 1.234,56; French 57 957,12) are normalized to dot-decimal before qsv stats infers column types. It supersedes the simpler decimal_comma boolean flag in PR #319 with a per-resource/global/single-separator resolution chain driven primarily by babel.numbers.parse_decimal.
Changes:
- New
AnalysisStage._normalize_locale_numbersruns between header sanitization and stats/index; resolvesdpp_locale(per-resource) →DEFAULT_LOCALE→DECIMAL_SEPARATOR→ no-op. - New config keys
default_localeanddecimal_separator(both default"", opt-in) plus newdpp_localeresource field declared indataset-druf.yaml. - Adds explicit
babel>=2.9dependency and 9 new regression tests pinning the resolution order, drift between YAML/Python defaults, and end-to-end qsv inference on the reporter's German sample.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| ckanext/datapusher_plus/jobs/stages/analysis.py | Adds _normalize_locale_numbers and _rewrite_cell helpers and wires them into process(). |
| ckanext/datapusher_plus/config.py | Declares DEFAULT_LOCALE and DECIMAL_SEPARATOR config getters (both default ""). |
| ckanext/datapusher_plus/config_declaration.yaml | YAML declarations for the two new keys with detailed description and v1 limitations. |
| ckanext/datapusher_plus/dataset-druf.yaml | Declares dpp_locale resource field for per-resource override. |
| requirements.txt | Adds explicit babel>=2.9 dependency. |
| tests/test_issue_112_decimal_comma.py | Nine new tests covering defaults, drift, gating, locale paths, separator path, unknown locale, and an end-to-end qsv test. |
Comments suppressed due to low confidence (1)
ckanext/datapusher_plus/jobs/stages/analysis.py:398
babel.numbers.parse_decimalalways returns aDecimal(it constructs one internally and raisesNumberFormatErrorotherwise), so theelse str(value)branch is unreachable. Either drop theisinstanceguard and callformat(value, "f")directly, or — if you want to be defensive against a future babel API change — assert/cast accordingly. As written the dead branch implies parse_decimal might return some non-Decimal type, which can confuse future readers.
return format(value, "f") if isinstance(value, Decimal) else str(value)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Two findings from the Copilot review of #320, both substantive: **1. ``parse_decimal(strict=False)`` silently corrupts date-like strings under comma-decimal locales (critical).** Verified the repro in dpp-test:: babel.numbers.parse_decimal('12.06.1994', locale='de_DE', strict=False) # → Decimal('12061994') ← SILENT DATA CORRUPTION Under ``strict=False``, babel treats ``.`` as a permissive thousands separator and rewrites German date strings (``DD.MM.YYYY`` is the standard) to enormous integers. A German CSV with dates in DATA columns — extremely common — would have every date silently rewritten. The PR's test sample only had date-like strings in the HEADER row (correctly skipped), so the suite missed this exact failure mode. Fix: ``strict=True`` in ``_rewrite_cell``. Babel then enforces canonical 3-digit grouping; non-canonical dates raise ``NumberFormatError`` and the cell is left verbatim. Valid forms still parse cleanly (``57,957`` → ``57.957``, ``1.234,56`` → ``1234.56``, ``1234,56`` → ``1234.56``). Regression test ``test_de_de_does_not_corrupt_date_like_data_cells`` puts ``12.06.1994`` / ``13.06.1999`` / ``26.05.2019`` in DATA cells under ``de_DE`` and asserts they survive verbatim while neighboring comma-decimal numbers still convert. **2. Per-resource ``dpp_locale='en_US'`` silently activated global ``decimal_separator`` (UX surprise).** The prior code, when a resolved locale was dot-decimal, cleared ``locale_id`` and fell through to the separator path. That conflated "operator declared en_US" with "operator declared nothing" — applying a global separator to a resource the operator explicitly marked as en_US is surprising. Fix: when a locale resolves dot-decimal, return unconditionally regardless of ``decimal_separator``. Per-resource intent wins. The INFO log was reworded to guide operators who want the separator fallback to apply: it now explicitly says to unset ``dpp_locale`` (or ``default_locale`` for global). Regression test ``test_dpp_locale_dot_decimal_does_not_fall_through_to_separator``: per-resource ``dpp_locale='en_US'`` + global ``decimal_separator=','`` is a no-op, no file written, guidance text appears in the INFO log. Test counts: 9 → 11 on #112; 243 → 245 overall. All passing. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Low-confidence Copilot suggestion on PR #320, but a fair one once verified. ``babel.numbers.parse_decimal`` is explicitly annotated ``-> decimal.Decimal`` and the only failure mode is raising ``NumberFormatError`` / ``UnsupportedNumberingSystemError`` (both caught one line above). So:: return format(value, "f") if isinstance(value, Decimal) else str(value) …had a dead ``else`` branch. Worse, if babel ever did return a non-Decimal, the ``str(value)`` fallback would silently emit float-imprecision garbage (``str(0.1 + 0.2)`` → ``'0.30000000000000004'``) — exactly the format qsv stats won't infer cleanly as Float. Better to ``format`` directly and let any future API drift fail loudly. Dropped the guard and the now-unused ``from decimal import Decimal``. Updated the comment to explain the babel contract + why we still need ``format(d, 'f')`` (forces fixed-point; without it, very small Decimals would render in scientific notation). 11/11 #112 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #112. Supersedes the closed #319.
Summary
Issue #112 reporter (German locale): qsv inferred numeric columns as
Stringwhen values used,as the decimal separator (57,957,41,84699). qsv's number parser only recognizes.as the decimal separator — CSV files don't carry locale metadata, so this is correct default behavior. The fix is opt-in.#319 used a global boolean flag (
decimal_comma) + an anchoredqsv replace. Per review feedback, the boolean had three structural problems: hardcoded shape (only<int>,<frac>), global toggle (every ingestion treated the same), and ignored existing metadata. This PR replaces it with a CLDR-locale-driven design.Resolution order (per resource, per ingestion)
context.resource.get('dpp_locale')ckanext.datapusher_plus.default_localeckanext.datapusher_plus.decimal_separator(single char)First non-empty wins. The locale paths use
babel.numbers.parse_decimal, which handles57,957(de_DE),1.234,56(de_DE thousands+decimal),57 957,12(fr_FR space-thousands), and any other CLDR-supported shape. The separator path is the escape hatch for operators with non-CLDR data (single-char anchored regex, no thousands support).Components
config.py+config_declaration.yaml—DEFAULT_LOCALE(str) andDECIMAL_SEPARATOR(str), both default"", botheditable: true. The YAML description spells out the resolution order, the babel-vs-regex split, and the known v1 limits.dataset-druf.yaml— newdpp_localeresource_field (optional text, BCP 47 / POSIX-style id). Sits in thedpp_*namespace alongsidedpp_spatial_extent/dpp_suggestions.analysis.py—_normalize_locale_numberscalled fromprocess()right after_sanitize_headers. Pure-Pythoncsv.reader/csv.writer(validation upstream already normalizes to RFC 4180). Per-column conversion counts logged at INFO so operators can spot wrong-locale typos.requirements.txt—babel>=2.9declared explicitly (was transitive via CKAN).Skip rules
babel.numbers.get_decimal_symbol(locale_id) == '.'→ log INFO, short-circuit. Locales that already use dot-decimal (en_US, en_GB, ja_JP, ...) get nothing done. NOTE:de_CHfalls here; Swiss thousands-apostrophe stripping is a v1 limit — usedecimal_separator = \"'\"as the workaround.UnknownLocaleError→ log WARNING, short-circuit without falling back todefault_locale. Operators need to see the typo, not get silently masked.Known v1 limitations (documented in YAML description)
de_CHapostrophe-thousands not stripped (above)String(no worse than today)(57,957)not parsed (babel raisesNumberFormatError)Test plan
tests/test_issue_112_decimal_comma.py, all passing:config.pyinline fallbacks both\"\"(opt-in only)dpp_localedeclared indataset-druf.yamlas a resource_fieldtmpmutation)default_locale='de_DE'normalizes both57,957→57.957AND1.234,56→1234.56(pins the babel-vs-regex distinction)dpp_locale='fr_FR'overrides global default; French space-thousands57 957,12→57957.12decimal_separator=','with no locale converts57,957→57.957and intentionally leaves1.234,56verbatim (anchored regex doesn't match)xx_YY→ WARNING + no-op (doesn't crash, doesn't fall back)FloatandKennziffer/EinheitstayStringdpp_localeset, confirmresource_showreturns it as a top-level dict key, confirm ingestion picks it updefault_locale = de_DE, confirm Postgres column types arenumeric/double precision🤖 Generated with Claude Code