Parse Date/Time/TimeZone triplets in samples and WQP#272
Draft
thodson-usgs wants to merge 10 commits intoDOI-USGS:mainfrom
Draft
Parse Date/Time/TimeZone triplets in samples and WQP#272thodson-usgs wants to merge 10 commits intoDOI-USGS:mainfrom
thodson-usgs wants to merge 10 commits intoDOI-USGS:mainfrom
Conversation
Add a shared utils.attach_datetime_columns helper that scans a CSV-derived DataFrame for <prefix>Date / <prefix>Time / <prefix>TimeZone triplets and appends a derived <prefix>DateTime UTC column for each one, leaving the original triplet columns intact. Recognizes both the WQX3 / Samples naming (Activity_StartDate, Activity_StartTime, Activity_StartTimeZone) and the legacy WQP naming (ActivityStartDate, ActivityStartTime/Time, ActivityStartTime/TimeZoneCode). Mirrors R dataRetrieval's create_dateTime. Wired into waterdata.get_samples and wqp.get_results. Closes DOI-USGS#266. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- Replace the lambda-laden _DATETIME_TRIPLET_PATTERNS dict with a flat
_TIME_TZ_SUFFIXES tuple of (time_suffix, tz_suffix) pairs; the unused
date_suffix field is gone.
- Use str.removesuffix("Date") for the prefix swap and resolve the target
column name once before iterating patterns, hoisting the existence check
out of the inner loop.
- Drop the redundant <NA>-mask in _build_utc_datetime; errors="coerce"
already turns rows with missing inputs into NaT.
- Switch pd.to_datetime from format="mixed" to a fixed
"%Y-%m-%d %H:%M:%S %z" so pandas doesn't probe formats per row.
- Trim WHAT-comments and per-test docstrings so the new tests match the
noise level of the surrounding Test_to_str class.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a utility to derive UTC Timestamp columns from <prefix>Date / <prefix>Time / <prefix>TimeZone triplets returned by Samples and WQP CSV endpoints, and wires it into waterdata.get_samples() and wqp.get_results() to address #266.
Changes:
- Added
dataretrieval.utils.attach_datetime_columns(df)to detect WQX3 and legacy WQP triplet naming patterns and append<prefix>DateTime(UTC) columns without overwriting existing ones. - Integrated the helper into
dataretrieval.waterdata.api.get_samples()anddataretrieval.wqp.get_results(). - Updated unit/integration tests and NEWS entry to reflect the new derived columns.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
dataretrieval/utils.py |
Introduces attach_datetime_columns() and supporting parsing helpers for Date/Time/TimeZone triplets. |
dataretrieval/waterdata/api.py |
Applies attach_datetime_columns() to Samples CSV responses returned by get_samples(). |
dataretrieval/wqp.py |
Applies attach_datetime_columns() to WQP CSV responses returned by get_results() and documents the behavior. |
tests/utils_test.py |
Adds unit tests covering triplet detection, UTC conversion, and edge cases (unknown TZ, missing values, no overwrite). |
tests/waterdata_test.py |
Updates Samples mocked/live tests to account for new derived datetime columns and validate a fixture timestamp. |
tests/wqp_test.py |
Updates WQP mocked tests to account for new columns and asserts derived datetime presence. |
NEWS.md |
Documents the new derived datetime behavior for Samples and WQP. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The /simplify pass introduced col.removesuffix("Date"), but the project
declares requires-python = ">=3.8" (and the ruff target is py38), and
removesuffix was added in Python 3.9 — so the helper would AttributeError
at first call on a 3.8 interpreter. Revert to the slice form.
Reported by Copilot review on PR DOI-USGS#272.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
CI's matrix already tests only Python 3.9 / 3.13 / 3.14 (and the
waterdata test module skips itself on <3.10), but pyproject.toml still
declared requires-python = ">=3.8" and ruff was targeting py38. Bring
the manifest in line with reality:
- requires-python = ">=3.9"
- [tool.ruff] target-version = "py39"
That unblocks col.removesuffix("Date") in attach_datetime_columns
(restored), and surfaces two pre-existing pyupgrade fixes that ruff now
applies under the py39 target:
- dataretrieval/waterdata/ratings.py: import Iterable from
collections.abc instead of typing.
- dataretrieval/waterdata/utils.py: zoneinfo is stdlib on 3.9+, so the
ZoneInfo import moves into the stdlib group.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Bumping ruff target-version from py38 to py39 made the formatter prefer parenthesized context managers (a 3.9-PEG-parser feature). The CI lint job picked up the resulting drift in tests/waterdata_filters_test.py; apply the formatter to bring it back in line. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-authored-by: Copilot Autofix powered by AI <[email protected]>
The previous commit (3a6cad9) raised requires-python from >=3.8 to >=3.9 to align pyproject with what CI actually tested. That is a breaking change for any downstream user still on 3.8, so call it out in the changelog. Reported by Copilot review on PR DOI-USGS#272. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The Copilot autofix in c635619 reverted one of four parenthesized-with blocks back to the chained form, leaving the file inconsistent under the project's ruff target (py39 prefers the parenthesized form per the 3.9 PEG parser). Re-running ruff format restores all four blocks to the canonical form so ruff format --check passes again. The "parenthesized with is 3.10+ only" concern is technically incorrect on this codebase: the 3.9 PEG parser accepts it, and the last CI run on 3a6cad9 / 2f7cf10 passed test (ubuntu-latest, 3.9) and test (windows-latest, 3.9) with this syntax in place. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
tests/waterdata_test.py:9
- This module is still skipped entirely on Python < 3.10, but the project now declares
requires-python >= 3.9and CI includes a 3.9 job. As a result,waterdatabehavior (including the newActivity_*DateTimederivation asserted below) is not tested on one of the declared supported versions. Consider either removing/relaxing the module-level skip or updating the declared minimum Python version to match the versions that can run thewaterdatatest suite.
import pandas as pd
import pytest
from pandas import DataFrame
if sys.version_info < (3, 10):
pytest.skip("Skip entire module on Python < 3.10", allow_module_level=True)
Comment on lines
5
to
10
| [project] | ||
| name = "dataretrieval" | ||
| description = "Discover and retrieve water data from U.S. federal hydrologic web services." | ||
| readme = "README.md" | ||
| requires-python = ">=3.8" | ||
| requires-python = ">=3.9" | ||
| keywords = ["USGS", "water data"] |
The helper is purely an internal post-processing step inside get_samples / get_results — users have no reason to call it directly, and dataretrieval/__init__.py's `from dataretrieval.utils import *` was leaking it into the public API surface as `dataretrieval.attach_datetime_columns`. Underscore-prefix it and update the two call sites plus the unit tests. Also annotate _attach_datetime_columns and _build_utc_datetime with pd.DataFrame / pd.Series / pd.Series → pd.Series signatures, matching the typing style already used in dataretrieval/waterdata/utils.py. Addresses self-review of PR DOI-USGS#272. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The two-tuple constant identifying the WQP/Samples Date/Time/TimeZone naming patterns was previously labeled "WQX3 / Samples" and "legacy WQP" — accurate for someone steeped in USGS jargon, opaque for a maintainer reading the file cold. Spell out an example column-name triplet next to each entry so the constant is self-explanatory. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Comment on lines
+129
to
+133
| def _attach_datetime_columns(df: pd.DataFrame) -> pd.DataFrame: | ||
| """Add ``<prefix>DateTime`` UTC columns for any Date/Time/TimeZone triplets. | ||
|
|
||
| Detects two naming patterns that appear in USGS Samples and Water Quality | ||
| Portal CSV responses: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #266.
dataretrieval.utils.attach_datetime_columns(df)— scans a CSV-derived DataFrame for<prefix>Date/<prefix>Time/<prefix>TimeZonetriplets and appends a derived<prefix>DateTimeUTC column for each one. Recognizes both the WQX3 / Samples shape (Activity_StartDate,Activity_StartTime,Activity_StartTimeZone) and the legacy WQP shape (ActivityStartDate,ActivityStartTime/Time,ActivityStartTime/TimeZoneCode).waterdata.get_samples()andwqp.get_results()(per the issue title's "samples/WQP" scope).<prefix>DateTimecolumn on the response is never overwritten. Unknown timezone abbreviations resolve toNaT.dataretrieval.codes.tzmapping (EST/EDT/CST/CDT/MST/MDT/PST/PDT/AKST/AKDT/HST/UTC/GMT and many more), the same source the existingformat_datetime()helper uses. Mirrors RdataRetrieval'screate_dateTimeinimportWQP.R.The worked example in the issue:
Test plan
tests/utils_test.py::Test_attach_datetime_columnscover: WQX3 triplet → UTC, legacy WQP triplet → UTC, unknown TZ → NaT, partial-row missing time/tz → NaT, existing target column not overwritten, multiple triplets in one frame, lone*Datecolumn without time/tz left alone.tests/waterdata_test.py::test_mock_get_samplesupdated to assert the newActivity_StartDateTimecolumn appears with the correct UTC timestamp on thesamples_results.txtfixture.tests/wqp_test.py::test_get_results(legacy) andtest_get_results_WQX3updated to assert the appropriate*DateTimecolumn appears and parses on each fixture.tests/waterdata_test.py::test_samples_activity(live) column-count assertion bumped from 95 → 97 to account forActivity_StartDateTimeandActivity_EndDateTime.🤖 Generated with Claude Code