Fix South Kesteven binday scraper by Dozi3 · Pull Request #2121 · robbrad/UKBinCollectionData

Dozi3 · 2026-06-05T23:10:23Z

Summary

Updates the South Kesteven scraper to use the current live binday flow and normalize the returned bin names.
Adds --artifact-dir support to the data collection CLI so targeted council validation can retain debugging artifacts when needed.
Adds South Kesteven fixture coverage and targeted tests for the scraper behavior.

Why

The previous South Kesteven implementation no longer matched the live council collection service, which made the returned collection dates and bin labels unreliable.

Validation

python -m pytest --basetemp=C:\Users\mattm\AppData\Local\Temp\CodexPytestSkdcBindayPr --confcutdir=uk_bin_collection\uk_bin_collection\councils\tests uk_bin_collection\uk_bin_collection\councils\tests\test_south_kesteven_district_council.py
python -m compileall uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py uk_bin_collection/uk_bin_collection/collect_data.py

Summary by CodeRabbit

New Features
- Added optional --artifact-dir CLI argument to save debug artifacts when collection encounters errors
- Enhanced South Kesteven District Council collection with improved live checker lookup and validation
Improvements
- Better error diagnostics with captured page state and debugging information on collection failures

Related to #1907.

This PR addresses the South Kesteven issue discussed here:
#1907 (comment)

coderabbitai · 2026-06-05T23:10:35Z

Need an answer fast? Review this PR in Change Stack to ask focused questions about the PR or a changed range.

📝 Walkthrough

Walkthrough

This PR completely rewrites South Kesteven bin collection parsing from OCR-based calendar image analysis to live Selenium webdriver automation of the council's binday checker form, including CLI infrastructure for artifact capture, environment-driven test fixtures, rewritten unit and integration tests, and updated configuration documentation.

Changes

South Kesteven Binday Checker Implementation

Layer / File(s)	Summary
CLI and test fixture infrastructure `uk_bin_collection/uk_bin_collection/collect_data.py`, `uk_bin_collection/uk_bin_collection/councils/tests/conftest.py`	Added `--artifact-dir` CLI argument for debug artifact capture, and updated pytest fixtures to read `UKBC_TEST_*` environment variables (postcode, paon, URL, web driver, headless mode) with sensible defaults for South Kesteven integration tests.
Core Selenium binday checker implementation `uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py`	Replaced entire CouncilClass with new Selenium-driven parse_data that navigates the binday checker form: resolves checker URL from landing page, enters postcode, selects address, waits for results table, parses collection dates with type normalization. Added helpers for webdriver waits, DOM readiness checks, address selection, and debug artifact capture (HTML, screenshot, metadata) on failure.
Unit tests for parsing and selection helpers `uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_district_council.py`	Rewrote from unittest to pytest with focused tests for URL extraction, address dropdown readiness, address selection, bin row parsing with type mapping, debug artifact file writing, and full parse_data flow with mocked webdriver interactions.
Integration tests with live Selenium binday checker `uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_integration.py`	Rewrote from requests-based tests to Selenium-driven integration tests that validate real binday checker queries, assert correct bin types and `DD/MM/YYYY` date format, handle unknown property errors with RuntimeError, and skip gracefully when Selenium is unavailable.
Configuration and documentation update `uk_bin_collection/tests/input.json`	Updated South Kesteven entry with new example postcode/house_number, updated binday checker URL, added web_driver configuration, and rewrote wiki_note to describe Selenium-driven "Your Collections" table parsing instead of previous OCR calendar image approach.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The PR involves a substantial rewrite of the core South Kesteven implementation (~1150 lines removed, ~300 lines added in the council class alone), complete replacement of two test modules with new paradigms, and updates to shared infrastructure (CLI, fixtures). The logic is dense with webdriver interactions, DOM state detection, and error handling paths. However, the changes are focused on a single council and follow a clear pattern, and many test assertions are straightforward validations.

Possibly related PRs

robbrad/UKBinCollectionData#1652: This PR is a direct successor that replaces the requests/OCR-based implementation introduced in #1652 with a Selenium-driven binday checker approach, rewriting all corresponding tests.

Suggested reviewers

dp247

Poem

🐰 Hopping through the bins, a tale unfolds—
From calendars scanned to webforms bold,
Selenium clicks where OCR once strode,
South Kesteven's path, a cleaner code!
Debug artifacts saved, when things go wrong,
Here's to collectors, shiny and strong! ✨🗑️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix South Kesteven binday scraper' directly addresses the main objective of the PR: fixing the South Kesteven scraper to follow the current live binday flow and normalize bin names.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the South Kesteven District Council scraper to use the council’s live “binday” checker via Selenium, adds debug artifact capture on failures, and refreshes test coverage + sample config accordingly.

Changes:

Replaced the previous requests/OCR-based approach with a Selenium-driven binday flow that parses “Your Collections” results tables.
Added artifact capture (HTML/screenshot/metadata) and a new CLI flag (--artifact-dir) to control where artifacts are written.
Reworked unit/integration tests and updated tests/input.json to reflect the new required inputs (postcode + house number/name).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py	Implements Selenium binday navigation, parsing, normalization, and failure artifact capture.
uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_integration.py	Updates integration coverage to exercise the Selenium flow and artifact output.
uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_district_council.py	Replaces older unit tests with Selenium-flow unit tests and parsing/artifact tests.
uk_bin_collection/uk_bin_collection/councils/tests/conftest.py	Adds env-driven fixtures for postcode/paon/url/webdriver/headless.
uk_bin_collection/uk_bin_collection/collect_data.py	Adds `--artifact-dir` arg and passes it through to the scraper.
uk_bin_collection/tests/input.json	Updates SKDC example configuration to use new inputs + Selenium URL.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    def _select_address(self, address_select, paon: str) -> None:
+        target = str(paon).strip().lower()
+        select = Select(address_select)
+
+        for option in select.options:
+            option_text = option.text.strip().lower()
+            if target in option_text:
+                select.select_by_visible_text(option.text)
+                return
+
+        raise RuntimeError(
+            f"Unable to find the property '{paon}' in the address dropdown."
+        )


+    def _capture_debug_artifacts(
+        self, driver, artifact_root: Path, context: dict[str, str]
+    ) -> Path | None:
+        if not driver:
+            return None

-    def get_bin_type_from_calendar(self, collection_date, calendar_data=None):
-        """Determine the specific bin type from the parsed calendar data."""
-        try:
-            # Parse the date
-            date_obj = datetime.strptime(collection_date, "%d/%m/%Y")
-            year = str(date_obj.year)
-            month = str(date_obj.month)
-            day = date_obj.day
-
-            # Determine which week of the month this is
-            week_of_month = str(((day - 1) // 7) + 1)
-
-            # Use provided calendar data or get it if not provided
-            if calendar_data is None:
-                calendar_data = self.parse_calendar_images()
-
-            # Look up the bin type from the calendar data
-            if year in calendar_data and month in calendar_data[year] and week_of_month in calendar_data[year][month]:
-                return calendar_data[year][month][week_of_month]
-            else:
-                # Raise error if not found in calendar instead of fallback
-                raise ValueError(f"No bin type found for {collection_date} (Week {week_of_month} of {month}/{year})")
-
-        except Exception as e:
-            print(f"Error determining bin type for {collection_date}: {e}")
-            raise
+        timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
+        artifact_path = artifact_root / timestamp
+        artifact_path.mkdir(parents=True, exist_ok=True)

-    def parse_data(self, page: str, **kwargs) -> dict:
-        try:
-            user_postcode = kwargs.get("postcode")
+        metadata = {
+            "current_url": str(getattr(driver, "current_url", "")) or None,
+            **context,
+        }

-            # Validate postcode
-            if not user_postcode:
-                raise ValueError("Postcode is required for South Kesteven")
+        screenshot_path = artifact_path / "page.png"
+        html_path = artifact_path / "page.html"
+        metadata_path = artifact_path / "metadata.json"

-            # No WebDriver needed - using requests-based approach
-
-            # Get collection day for regular bins
-            collection_day = self.get_collection_day_from_postcode(None, user_postcode)
-            if not collection_day:
-                raise ValueError(f"Could not determine collection day for postcode {user_postcode}")
+        try:
+            html_path.write_text(driver.page_source, encoding="utf-8")
+        except Exception as exc:
+            metadata["page_html_error"] = str(exc)

-            # Get green bin info
-            green_bin_info = self.get_green_bin_info_from_postcode(None, user_postcode)
+        try:
+            metadata["screenshot_saved"] = bool(driver.save_screenshot(str(screenshot_path)))
+        except Exception as exc:
+            metadata["screenshot_error"] = str(exc)
+
+        metadata_path.write_text(json.dumps(metadata, indent=4), encoding="utf-8")
+        return artifact_path.resolve()


+        timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
+        artifact_path = artifact_root / timestamp


+        metadata = {
+            "current_url": str(getattr(driver, "current_url", "")) or None,
+            **context,
+        }


+        assert "bins" in result
+        assert isinstance(result["bins"], list)
+        assert result["bins"]
+        assert {bin_entry["type"] for bin_entry in result["bins"]} <= self.EXPECTED_BIN_TYPES


+        "house_number": "43",
+        "postcode": "NG31 8XG",
        "skip_get_url": true,
-        "url": "https://pre.southkesteven.gov.uk/skdcNext/tempforms/checkmybin.aspx",
+        "url": "https://www.southkesteven.gov.uk/binday",
+        "web_driver": "http://selenium:4444",


coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py`:
- Around line 249-261: The _select_address function currently uses substring
matching which can pick the wrong option; change it to find exact/anchored
matches against the normalized paon: build a list of candidate options by
normalizing option.text and matching either equality or a regex anchored to the
start/end (e.g., full token match) to avoid substring hits, then if exactly one
candidate select it with select.select_by_visible_text(option.text), if zero or
>1 candidates raise a RuntimeError describing no match or ambiguous multiple
matches (include the list of matching option texts) so the caller can see the
ambiguity; keep references to the Select instance, option.text and the
_select_address(paon) signature when making the change.

In
`@uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_district_council.py`:
- Around line 79-86: Update the pytest.raises match arguments to use raw/escaped
regex strings so metacharacters are treated literally: change occurrences like
match="Property number or name \\(paon\\) is required for South Kesteven." to
raw-string form match=r"Property number or name \(paon\) is required for South
Kesteven." (do the same for the postcode test and the other occurrences noted);
locate these in the test function test_parse_data_requires_paon and any other
tests invoking council.parse_data and replace the match="..." with match=r"...",
escaping literal parentheses and other regex metacharacters as needed.

In
`@uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_integration.py`:
- Around line 44-48: Replace the fragile slash/length checks on
bin_entry["collectionDate"] with strict parsing using the datetime parser: call
datetime.datetime.strptime on the collection date string (format "%d/%m/%Y") in
the test (e.g., inside the test_south_kesteven_integration test) and assert that
parsing succeeds (or that the returned datetime has expected day/month/year
properties) instead of asserting string lengths; add the necessary import for
datetime at the top of the test file.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f3663b32-c409-4d9c-9168-80e1196eded9

📥 Commits

Reviewing files that changed from the base of the PR and between b65502c and bbb31d8.

📒 Files selected for processing (6)

uk_bin_collection/tests/input.json
uk_bin_collection/uk_bin_collection/collect_data.py
uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py
uk_bin_collection/uk_bin_collection/councils/tests/conftest.py
uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_district_council.py
uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_integration.py

coderabbitai · 2026-06-05T23:16:04Z

+    def _select_address(self, address_select, paon: str) -> None:
+        target = str(paon).strip().lower()
+        select = Select(address_select)
+
+        for option in select.options:
+            option_text = option.text.strip().lower()
+            if target in option_text:
+                select.select_by_visible_text(option.text)
+                return
+
+        raise RuntimeError(
+            f"Unable to find the property '{paon}' in the address dropdown."
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid ambiguous substring matching when selecting the property.

Line 255 currently matches paon using substring containment, which can select the wrong address and return another property's collections. Match exact/anchored candidates and fail when multiple entries match.

Proposed fix

def _select_address(self, address_select, paon: str) -> None: target = str(paon).strip().lower() select = Select(address_select) - - for option in select.options: - option_text = option.text.strip().lower() - if target in option_text: - select.select_by_visible_text(option.text) - return + exact_or_anchored_matches = [] + for option in select.options: + option_text = " ".join(option.text.split()).lower() + if ( + option_text == target + or option_text.startswith(f"{target},") + or option_text.startswith(f"{target} ") + ): + exact_or_anchored_matches.append(option.text) + + if len(exact_or_anchored_matches) == 1: + select.select_by_visible_text(exact_or_anchored_matches[0]) + return + if len(exact_or_anchored_matches) > 1: + raise RuntimeError( + f"Property '{paon}' matched multiple addresses; provide a more specific value." + ) raise RuntimeError( f"Unable to find the property '{paon}' in the address dropdown." )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py` around lines 249 - 261, The _select_address function currently uses substring matching which can pick the wrong option; change it to find exact/anchored matches against the normalized paon: build a list of candidate options by normalizing option.text and matching either equality or a regex anchored to the start/end (e.g., full token match) to avoid substring hits, then if exactly one candidate select it with select.select_by_visible_text(option.text), if zero or >1 candidates raise a RuntimeError describing no match or ambiguous multiple matches (include the list of matching option texts) so the caller can see the ambiguity; keep references to the Select instance, option.text and the _select_address(paon) signature when making the change.

coderabbitai · 2026-06-05T23:16:04Z

+    with pytest.raises(ValueError, match="Postcode is required for South Kesteven."):
+        council.parse_data("", paon="43")
+
+
+def test_parse_data_requires_paon(council):
+    with pytest.raises(
+        ValueError,
+        match="Property number or name \\(paon\\) is required for South Kesteven.",


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use raw/escaped regex strings in pytest.raises(..., match=...).

These match= patterns include regex metacharacters and trigger RUF043; tighten them with raw strings and escaped literals to keep assertions precise.

Proposed fix

- with pytest.raises(ValueError, match="Postcode is required for South Kesteven."): + with pytest.raises(ValueError, match=r"Postcode is required for South Kesteven\."): @@ - match="Property number or name \\(paon\\) is required for South Kesteven.", + match=r"Property number or name \(paon\) is required for South Kesteven\.", @@ - match="Unable to find the property '99' in the address dropdown.", + match=r"Unable to find the property '99' in the address dropdown\.", @@ - match="Unable to find the address dropdown after searching for the postcode.", + match=r"Unable to find the address dropdown after searching for the postcode\.",

Also applies to: 125-125, 301-301

🧰 Tools

🪛 Ruff (0.15.15)

[warning] 79-79: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)

[warning] 86-86: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_district_council.py` around lines 79 - 86, Update the pytest.raises match arguments to use raw/escaped regex strings so metacharacters are treated literally: change occurrences like match="Property number or name \\(paon\\) is required for South Kesteven." to raw-string form match=r"Property number or name \(paon\) is required for South Kesteven." (do the same for the postcode test and the other occurrences noted); locate these in the test function test_parse_data_requires_paon and any other tests invoking council.parse_data and replace the match="..." with match=r"...", escaping literal parentheses and other regex metacharacters as needed.

Source: Linters/SAST tools

coderabbitai · 2026-06-05T23:16:04Z

+            date_parts = bin_entry["collectionDate"].split("/")
+            assert len(date_parts) == 3
+            assert len(date_parts[0]) == 2
+            assert len(date_parts[1]) == 2
+            assert len(date_parts[2]) == 4


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use strict date parsing in the integration assertion.

The current slash/length checks allow invalid dates (for example, 99/99/9999) to pass, which can hide parser regressions.

Proposed change

+from datetime import datetime import pytest from selenium.common.exceptions import WebDriverException from urllib3.exceptions import MaxRetryError @@ for bin_entry in result["bins"]: assert "type" in bin_entry assert "collectionDate" in bin_entry - date_parts = bin_entry["collectionDate"].split("/") - assert len(date_parts) == 3 - assert len(date_parts[0]) == 2 - assert len(date_parts[1]) == 2 - assert len(date_parts[2]) == 4 + datetime.strptime(bin_entry["collectionDate"], "%d/%m/%Y")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

date_parts = bin_entry["collectionDate"].split("/")

assert len(date_parts) == 3

assert len(date_parts[0]) == 2

assert len(date_parts[1]) == 2

assert len(date_parts[2]) == 4

datetime.strptime(bin_entry["collectionDate"], "%d/%m/%Y")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_integration.py` around lines 44 - 48, Replace the fragile slash/length checks on bin_entry["collectionDate"] with strict parsing using the datetime parser: call datetime.datetime.strptime on the collection date string (format "%d/%m/%Y") in the test (e.g., inside the test_south_kesteven_integration test) and assert that parsing succeeds (or that the returned datetime has expected day/month/year properties) instead of asserting string lengths; add the necessary import for datetime at the top of the test file.

fix: update South Kesteven binday scraper

bbb31d8

Copilot AI review requested due to automatic review settings June 5, 2026 23:10

Copilot AI reviewed Jun 5, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Dozi3 mentioned this pull request Jun 5, 2026

South Kesteven District Council - Incorrect URL #1907

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix South Kesteven binday scraper#2121

Fix South Kesteven binday scraper#2121
Dozi3 wants to merge 1 commit into
robbrad:masterfrom
Dozi3:codex/skdc-binday-fix-pr

Dozi3 commented Jun 5, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Uh oh!

coderabbitai Bot Jun 5, 2026

Uh oh!

coderabbitai Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
		artifact_path = artifact_root / timestamp

Conversation

Dozi3 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dozi3 commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading