Skip to content

Fix South Kesteven binday scraper#2121

Open
Dozi3 wants to merge 1 commit into
robbrad:masterfrom
Dozi3:codex/skdc-binday-fix-pr
Open

Fix South Kesteven binday scraper#2121
Dozi3 wants to merge 1 commit into
robbrad:masterfrom
Dozi3:codex/skdc-binday-fix-pr

Conversation

@Dozi3

@Dozi3 Dozi3 commented Jun 5, 2026

Copy link
Copy Markdown

Summary

  • Updates the South Kesteven scraper to use the current live binday flow and normalize the returned bin names.
  • Adds --artifact-dir support to the data collection CLI so targeted council validation can retain debugging artifacts when needed.
  • Adds South Kesteven fixture coverage and targeted tests for the scraper behavior.

Why

The previous South Kesteven implementation no longer matched the live council collection service, which made the returned collection dates and bin labels unreliable.

Validation

  • python -m pytest --basetemp=C:\Users\mattm\AppData\Local\Temp\CodexPytestSkdcBindayPr --confcutdir=uk_bin_collection\uk_bin_collection\councils\tests uk_bin_collection\uk_bin_collection\councils\tests\test_south_kesteven_district_council.py
  • python -m compileall uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py uk_bin_collection/uk_bin_collection/collect_data.py

Summary by CodeRabbit

  • New Features

    • Added optional --artifact-dir CLI argument to save debug artifacts when collection encounters errors
    • Enhanced South Kesteven District Council collection with improved live checker lookup and validation
  • Improvements

    • Better error diagnostics with captured page state and debugging information on collection failures

Related to #1907.

This PR addresses the South Kesteven issue discussed here:
#1907 (comment)

Copilot AI review requested due to automatic review settings June 5, 2026 23:10
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Need an answer fast? Review this PR in Change Stack to ask focused questions about the PR or a changed range.

Review Change Stack

📝 Walkthrough

Walkthrough

This PR completely rewrites South Kesteven bin collection parsing from OCR-based calendar image analysis to live Selenium webdriver automation of the council's binday checker form, including CLI infrastructure for artifact capture, environment-driven test fixtures, rewritten unit and integration tests, and updated configuration documentation.

Changes

South Kesteven Binday Checker Implementation

Layer / File(s) Summary
CLI and test fixture infrastructure
uk_bin_collection/uk_bin_collection/collect_data.py, uk_bin_collection/uk_bin_collection/councils/tests/conftest.py
Added --artifact-dir CLI argument for debug artifact capture, and updated pytest fixtures to read UKBC_TEST_* environment variables (postcode, paon, URL, web driver, headless mode) with sensible defaults for South Kesteven integration tests.
Core Selenium binday checker implementation
uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py
Replaced entire CouncilClass with new Selenium-driven parse_data that navigates the binday checker form: resolves checker URL from landing page, enters postcode, selects address, waits for results table, parses collection dates with type normalization. Added helpers for webdriver waits, DOM readiness checks, address selection, and debug artifact capture (HTML, screenshot, metadata) on failure.
Unit tests for parsing and selection helpers
uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_district_council.py
Rewrote from unittest to pytest with focused tests for URL extraction, address dropdown readiness, address selection, bin row parsing with type mapping, debug artifact file writing, and full parse_data flow with mocked webdriver interactions.
Integration tests with live Selenium binday checker
uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_integration.py
Rewrote from requests-based tests to Selenium-driven integration tests that validate real binday checker queries, assert correct bin types and DD/MM/YYYY date format, handle unknown property errors with RuntimeError, and skip gracefully when Selenium is unavailable.
Configuration and documentation update
uk_bin_collection/tests/input.json
Updated South Kesteven entry with new example postcode/house_number, updated binday checker URL, added web_driver configuration, and rewrote wiki_note to describe Selenium-driven "Your Collections" table parsing instead of previous OCR calendar image approach.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The PR involves a substantial rewrite of the core South Kesteven implementation (~1150 lines removed, ~300 lines added in the council class alone), complete replacement of two test modules with new paradigms, and updates to shared infrastructure (CLI, fixtures). The logic is dense with webdriver interactions, DOM state detection, and error handling paths. However, the changes are focused on a single council and follow a clear pattern, and many test assertions are straightforward validations.

Possibly related PRs

  • robbrad/UKBinCollectionData#1652: This PR is a direct successor that replaces the requests/OCR-based implementation introduced in #1652 with a Selenium-driven binday checker approach, rewriting all corresponding tests.

Suggested reviewers

  • dp247

Poem

🐰 Hopping through the bins, a tale unfolds—
From calendars scanned to webforms bold,
Selenium clicks where OCR once strode,
South Kesteven's path, a cleaner code!
Debug artifacts saved, when things go wrong,
Here's to collectors, shiny and strong! ✨🗑️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix South Kesteven binday scraper' directly addresses the main objective of the PR: fixing the South Kesteven scraper to follow the current live binday flow and normalize bin names.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the South Kesteven District Council scraper to use the council’s live “binday” checker via Selenium, adds debug artifact capture on failures, and refreshes test coverage + sample config accordingly.

Changes:

  • Replaced the previous requests/OCR-based approach with a Selenium-driven binday flow that parses “Your Collections” results tables.
  • Added artifact capture (HTML/screenshot/metadata) and a new CLI flag (--artifact-dir) to control where artifacts are written.
  • Reworked unit/integration tests and updated tests/input.json to reflect the new required inputs (postcode + house number/name).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py Implements Selenium binday navigation, parsing, normalization, and failure artifact capture.
uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_integration.py Updates integration coverage to exercise the Selenium flow and artifact output.
uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_district_council.py Replaces older unit tests with Selenium-flow unit tests and parsing/artifact tests.
uk_bin_collection/uk_bin_collection/councils/tests/conftest.py Adds env-driven fixtures for postcode/paon/url/webdriver/headless.
uk_bin_collection/uk_bin_collection/collect_data.py Adds --artifact-dir arg and passes it through to the scraper.
uk_bin_collection/tests/input.json Updates SKDC example configuration to use new inputs + Selenium URL.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +249 to +261
def _select_address(self, address_select, paon: str) -> None:
target = str(paon).strip().lower()
select = Select(address_select)

for option in select.options:
option_text = option.text.strip().lower()
if target in option_text:
select.select_by_visible_text(option.text)
return

raise RuntimeError(
f"Unable to find the property '{paon}' in the address dropdown."
)
Comment on lines +282 to +312
def _capture_debug_artifacts(
self, driver, artifact_root: Path, context: dict[str, str]
) -> Path | None:
if not driver:
return None

def get_bin_type_from_calendar(self, collection_date, calendar_data=None):
"""Determine the specific bin type from the parsed calendar data."""
try:
# Parse the date
date_obj = datetime.strptime(collection_date, "%d/%m/%Y")
year = str(date_obj.year)
month = str(date_obj.month)
day = date_obj.day

# Determine which week of the month this is
week_of_month = str(((day - 1) // 7) + 1)

# Use provided calendar data or get it if not provided
if calendar_data is None:
calendar_data = self.parse_calendar_images()

# Look up the bin type from the calendar data
if year in calendar_data and month in calendar_data[year] and week_of_month in calendar_data[year][month]:
return calendar_data[year][month][week_of_month]
else:
# Raise error if not found in calendar instead of fallback
raise ValueError(f"No bin type found for {collection_date} (Week {week_of_month} of {month}/{year})")

except Exception as e:
print(f"Error determining bin type for {collection_date}: {e}")
raise
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
artifact_path = artifact_root / timestamp
artifact_path.mkdir(parents=True, exist_ok=True)

def parse_data(self, page: str, **kwargs) -> dict:
try:
user_postcode = kwargs.get("postcode")
metadata = {
"current_url": str(getattr(driver, "current_url", "")) or None,
**context,
}

# Validate postcode
if not user_postcode:
raise ValueError("Postcode is required for South Kesteven")
screenshot_path = artifact_path / "page.png"
html_path = artifact_path / "page.html"
metadata_path = artifact_path / "metadata.json"

# No WebDriver needed - using requests-based approach

# Get collection day for regular bins
collection_day = self.get_collection_day_from_postcode(None, user_postcode)
if not collection_day:
raise ValueError(f"Could not determine collection day for postcode {user_postcode}")
try:
html_path.write_text(driver.page_source, encoding="utf-8")
except Exception as exc:
metadata["page_html_error"] = str(exc)

# Get green bin info
green_bin_info = self.get_green_bin_info_from_postcode(None, user_postcode)
try:
metadata["screenshot_saved"] = bool(driver.save_screenshot(str(screenshot_path)))
except Exception as exc:
metadata["screenshot_error"] = str(exc)

metadata_path.write_text(json.dumps(metadata, indent=4), encoding="utf-8")
return artifact_path.resolve()
Comment on lines +288 to +289
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
artifact_path = artifact_root / timestamp
Comment on lines +292 to +295
metadata = {
"current_url": str(getattr(driver, "current_url", "")) or None,
**context,
}
assert "bins" in result
assert isinstance(result["bins"], list)
assert result["bins"]
assert {bin_entry["type"] for bin_entry in result["bins"]} <= self.EXPECTED_BIN_TYPES
Comment on lines +2327 to +2331
"house_number": "43",
"postcode": "NG31 8XG",
"skip_get_url": true,
"url": "https://pre.southkesteven.gov.uk/skdcNext/tempforms/checkmybin.aspx",
"url": "https://www.southkesteven.gov.uk/binday",
"web_driver": "http://selenium:4444",

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py`:
- Around line 249-261: The _select_address function currently uses substring
matching which can pick the wrong option; change it to find exact/anchored
matches against the normalized paon: build a list of candidate options by
normalizing option.text and matching either equality or a regex anchored to the
start/end (e.g., full token match) to avoid substring hits, then if exactly one
candidate select it with select.select_by_visible_text(option.text), if zero or
>1 candidates raise a RuntimeError describing no match or ambiguous multiple
matches (include the list of matching option texts) so the caller can see the
ambiguity; keep references to the Select instance, option.text and the
_select_address(paon) signature when making the change.

In
`@uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_district_council.py`:
- Around line 79-86: Update the pytest.raises match arguments to use raw/escaped
regex strings so metacharacters are treated literally: change occurrences like
match="Property number or name \\(paon\\) is required for South Kesteven." to
raw-string form match=r"Property number or name \(paon\) is required for South
Kesteven." (do the same for the postcode test and the other occurrences noted);
locate these in the test function test_parse_data_requires_paon and any other
tests invoking council.parse_data and replace the match="..." with match=r"...",
escaping literal parentheses and other regex metacharacters as needed.

In
`@uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_integration.py`:
- Around line 44-48: Replace the fragile slash/length checks on
bin_entry["collectionDate"] with strict parsing using the datetime parser: call
datetime.datetime.strptime on the collection date string (format "%d/%m/%Y") in
the test (e.g., inside the test_south_kesteven_integration test) and assert that
parsing succeeds (or that the returned datetime has expected day/month/year
properties) instead of asserting string lengths; add the necessary import for
datetime at the top of the test file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f3663b32-c409-4d9c-9168-80e1196eded9

📥 Commits

Reviewing files that changed from the base of the PR and between b65502c and bbb31d8.

📒 Files selected for processing (6)
  • uk_bin_collection/tests/input.json
  • uk_bin_collection/uk_bin_collection/collect_data.py
  • uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py
  • uk_bin_collection/uk_bin_collection/councils/tests/conftest.py
  • uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_district_council.py
  • uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_integration.py

Comment on lines +249 to +261
def _select_address(self, address_select, paon: str) -> None:
target = str(paon).strip().lower()
select = Select(address_select)

for option in select.options:
option_text = option.text.strip().lower()
if target in option_text:
select.select_by_visible_text(option.text)
return

raise RuntimeError(
f"Unable to find the property '{paon}' in the address dropdown."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid ambiguous substring matching when selecting the property.

Line 255 currently matches paon using substring containment, which can select the wrong address and return another property's collections. Match exact/anchored candidates and fail when multiple entries match.

Proposed fix
 def _select_address(self, address_select, paon: str) -> None:
     target = str(paon).strip().lower()
     select = Select(address_select)
-
-    for option in select.options:
-        option_text = option.text.strip().lower()
-        if target in option_text:
-            select.select_by_visible_text(option.text)
-            return
+    exact_or_anchored_matches = []
+    for option in select.options:
+        option_text = " ".join(option.text.split()).lower()
+        if (
+            option_text == target
+            or option_text.startswith(f"{target},")
+            or option_text.startswith(f"{target} ")
+        ):
+            exact_or_anchored_matches.append(option.text)
+
+    if len(exact_or_anchored_matches) == 1:
+        select.select_by_visible_text(exact_or_anchored_matches[0])
+        return
+    if len(exact_or_anchored_matches) > 1:
+        raise RuntimeError(
+            f"Property '{paon}' matched multiple addresses; provide a more specific value."
+        )
 
     raise RuntimeError(
         f"Unable to find the property '{paon}' in the address dropdown."
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py`
around lines 249 - 261, The _select_address function currently uses substring
matching which can pick the wrong option; change it to find exact/anchored
matches against the normalized paon: build a list of candidate options by
normalizing option.text and matching either equality or a regex anchored to the
start/end (e.g., full token match) to avoid substring hits, then if exactly one
candidate select it with select.select_by_visible_text(option.text), if zero or
>1 candidates raise a RuntimeError describing no match or ambiguous multiple
matches (include the list of matching option texts) so the caller can see the
ambiguity; keep references to the Select instance, option.text and the
_select_address(paon) signature when making the change.

Comment on lines +79 to +86
with pytest.raises(ValueError, match="Postcode is required for South Kesteven."):
council.parse_data("", paon="43")


def test_parse_data_requires_paon(council):
with pytest.raises(
ValueError,
match="Property number or name \\(paon\\) is required for South Kesteven.",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use raw/escaped regex strings in pytest.raises(..., match=...).

These match= patterns include regex metacharacters and trigger RUF043; tighten them with raw strings and escaped literals to keep assertions precise.

Proposed fix
-    with pytest.raises(ValueError, match="Postcode is required for South Kesteven."):
+    with pytest.raises(ValueError, match=r"Postcode is required for South Kesteven\."):
@@
-        match="Property number or name \\(paon\\) is required for South Kesteven.",
+        match=r"Property number or name \(paon\) is required for South Kesteven\.",
@@
-            match="Unable to find the property '99' in the address dropdown.",
+            match=r"Unable to find the property '99' in the address dropdown\.",
@@
-            match="Unable to find the address dropdown after searching for the postcode.",
+            match=r"Unable to find the address dropdown after searching for the postcode\.",

Also applies to: 125-125, 301-301

🧰 Tools
🪛 Ruff (0.15.15)

[warning] 79-79: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)


[warning] 86-86: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_district_council.py`
around lines 79 - 86, Update the pytest.raises match arguments to use
raw/escaped regex strings so metacharacters are treated literally: change
occurrences like match="Property number or name \\(paon\\) is required for South
Kesteven." to raw-string form match=r"Property number or name \(paon\) is
required for South Kesteven." (do the same for the postcode test and the other
occurrences noted); locate these in the test function
test_parse_data_requires_paon and any other tests invoking council.parse_data
and replace the match="..." with match=r"...", escaping literal parentheses and
other regex metacharacters as needed.

Source: Linters/SAST tools

Comment on lines +44 to +48
date_parts = bin_entry["collectionDate"].split("/")
assert len(date_parts) == 3
assert len(date_parts[0]) == 2
assert len(date_parts[1]) == 2
assert len(date_parts[2]) == 4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use strict date parsing in the integration assertion.

The current slash/length checks allow invalid dates (for example, 99/99/9999) to pass, which can hide parser regressions.

Proposed change
+from datetime import datetime
 import pytest
 from selenium.common.exceptions import WebDriverException
 from urllib3.exceptions import MaxRetryError
@@
         for bin_entry in result["bins"]:
             assert "type" in bin_entry
             assert "collectionDate" in bin_entry
-            date_parts = bin_entry["collectionDate"].split("/")
-            assert len(date_parts) == 3
-            assert len(date_parts[0]) == 2
-            assert len(date_parts[1]) == 2
-            assert len(date_parts[2]) == 4
+            datetime.strptime(bin_entry["collectionDate"], "%d/%m/%Y")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
date_parts = bin_entry["collectionDate"].split("/")
assert len(date_parts) == 3
assert len(date_parts[0]) == 2
assert len(date_parts[1]) == 2
assert len(date_parts[2]) == 4
datetime.strptime(bin_entry["collectionDate"], "%d/%m/%Y")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_integration.py`
around lines 44 - 48, Replace the fragile slash/length checks on
bin_entry["collectionDate"] with strict parsing using the datetime parser: call
datetime.datetime.strptime on the collection date string (format "%d/%m/%Y") in
the test (e.g., inside the test_south_kesteven_integration test) and assert that
parsing succeeds (or that the returned datetime has expected day/month/year
properties) instead of asserting string lengths; add the necessary import for
datetime at the top of the test file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants