Skip to content

Fix strptime failure with non-zero-padded format codes#6

Open
stephen-zhao wants to merge 1 commit intomainfrom
claude/fix-github-issue-xSBsq
Open

Fix strptime failure with non-zero-padded format codes#6
stephen-zhao wants to merge 1 commit intomainfrom
claude/fix-github-issue-xSBsq

Conversation

@stephen-zhao
Copy link
Copy Markdown
Owner

@stephen-zhao stephen-zhao commented Mar 29, 2026

Summary

  • Fixes Non-zero padding formats #4: DatetimeExtractor fails to parse datetimes when using non-zero-padded format codes like %-d, %-m, etc. on Linux CPython, where strptime doesn't accept the - modifier.
  • Normalizes format codes by stripping the - modifier (e.g. %-d%d) before passing to strptime, since strptime can already handle non-zero-padded values with the standard directives.
  • Adds unit tests for non-zero-padded date extraction using TEST_MINUS_SIGNS and TEST_DATE_LONG_FORM pipelines.

Test plan

  • All 32 tests pass, including 5 new tests covering non-zero-padded format codes
  • Verified %-m/%-d extraction works (e.g. 1/11/2017 with %-d/%m/%Y)
  • Verified %-d in long-form dates works (e.g. Wednesday January 5, 2022)
  • Existing tests unchanged and still passing

https://claude.ai/code/session_0137rpSUUxos1kRYsMtH8jiE

Summary by Sourcery

Handle non-zero-padded datetime format codes in DatetimeExtractor and add regression tests to cover them.

Bug Fixes:

  • Fix datetime parsing failures when using non-zero-padded strptime format codes (e.g. '%-d', '%-m') on platforms that do not support the '-' modifier.

Tests:

  • Add tests for non-zero-padded date components in filename-based extraction using the TEST_MINUS_SIGNS pipeline.
  • Add tests for non-zero-padded day components in long-form date strings using the TEST_DATE_LONG_FORM pipeline.

On some platforms (notably Linux CPython), strptime does not accept the
'-' modifier in format codes like %-d. Since strptime's %d can already
parse non-zero-padded values, we normalize format codes by stripping
the '-' modifier before passing them to strptime.

Fixes #4

https://claude.ai/code/session_0137rpSUUxos1kRYsMtH8jiE
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Mar 29, 2026

Reviewer's Guide

Normalizes datetime format tokens by stripping unsupported '%-' modifiers before calling strptime, and adds regression tests to ensure non-zero-padded date formats are correctly parsed in existing pipelines.

Sequence diagram for datetime parsing with format code normalization

sequenceDiagram
    actor Client
    participant DatetimeExtractor
    participant Match
    participant DfregexToken
    participant Strptime

    Client->>DatetimeExtractor: extract_datetime(text, pipeline)
    DatetimeExtractor->>Match: finditer on text
    loop for each match
        DatetimeExtractor->>Match: groupdict()
        Match-->>DatetimeExtractor: groups
        DatetimeExtractor->>DfregexToken: get df_tokens[datetime_group_num]
        DfregexToken-->>DatetimeExtractor: format_code
        DatetimeExtractor->>DatetimeExtractor: __normalize_format_code(format_code)
        DatetimeExtractor-->>DatetimeExtractor: normalized_format_code
        DatetimeExtractor->>Strptime: strptime(datetime_string_value, normalized_format_code)
        Strptime-->>DatetimeExtractor: datetime_object or error
    end
    DatetimeExtractor-->>Client: parsed datetimes
Loading

Class diagram for updated DatetimeExtractor format normalization

classDiagram
    class DatetimeExtractor {
        +__finditer_with_limit(pattern, text, limit)
        +__parse_match_into_maybe_datetime(match, df_tokens)
        +__normalize_format_code(format_code) static
    }

    class DfregexToken {
        +value
    }

    class Match {
        +groupdict()
    }

    DatetimeExtractor ..> DfregexToken : uses
    DatetimeExtractor ..> Match : parses
Loading

File-Level Changes

Change Details Files
Normalize datetime format codes before passing them to strptime to support non-zero-padded directives on platforms that reject '%-' modifiers.
  • Introduce a private static helper to strip the '-' modifier from all '%-' sequences in format codes
  • Use the new normalizer when building the list of datetime format codes from df_tokens in __parse_match_into_maybe_datetime
  • Preserve existing error handling for problematic tokens while ensuring normalized codes are passed to strptime
src/datetime_matcher/datetime_extractor.py
Add regression tests verifying extraction of non-zero-padded dates in existing pipelines.
  • Add parametrized tests covering non-zero-padded month/day parsing in TEST_MINUS_SIGNS pipeline
  • Add parametrized tests covering non-zero-padded day parsing in long-form date strings in TEST_DATE_LONG_FORM pipeline
  • Assert that only a single datetime is produced per input and that iteration stops afterward to match existing extractor behavior
test/test_datetime_extractor.py

Assessment against linked issues

Issue Objective Addressed Explanation
#4 Ensure DatetimeExtractor can parse datetimes when using non-zero-padded format codes (e.g. '%-d', '%-m') by adjusting the format string before calling strptime so it works on platforms where strptime does not accept the '-' modifier.
#4 Add automated tests that cover extraction of dates using non-zero-padded format codes to prevent regressions.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The __normalize_format_code implementation currently does a blanket replace("%-", "%"); if any format token can contain %− in a non-directive context this will silently alter the semantics—consider constraining the replacement (e.g., only when followed by known directive characters) or adding a brief comment explaining why a global replace is safe here.
  • Since __normalize_format_code is logically a pure helper, you might consider making it a module-level function or a @staticmethod with single underscore naming to avoid name mangling and keep it more easily testable/reusable if other components ever need the same normalization.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `__normalize_format_code` implementation currently does a blanket `replace("%-", "%")`; if any format token can contain `%−` in a non-directive context this will silently alter the semantics—consider constraining the replacement (e.g., only when followed by known directive characters) or adding a brief comment explaining why a global replace is safe here.
- Since `__normalize_format_code` is logically a pure helper, you might consider making it a module-level function or a `@staticmethod` with single underscore naming to avoid name mangling and keep it more easily testable/reusable if other components ever need the same normalization.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Non-zero padding formats

2 participants