Fix South Kesteven binday scraper#2121
Conversation
|
Need an answer fast? Review this PR in Change Stack to ask focused questions about the PR or a changed range. 📝 WalkthroughWalkthroughThis PR completely rewrites South Kesteven bin collection parsing from OCR-based calendar image analysis to live Selenium webdriver automation of the council's binday checker form, including CLI infrastructure for artifact capture, environment-driven test fixtures, rewritten unit and integration tests, and updated configuration documentation. ChangesSouth Kesteven Binday Checker Implementation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The PR involves a substantial rewrite of the core South Kesteven implementation (~1150 lines removed, ~300 lines added in the council class alone), complete replacement of two test modules with new paradigms, and updates to shared infrastructure (CLI, fixtures). The logic is dense with webdriver interactions, DOM state detection, and error handling paths. However, the changes are focused on a single council and follow a clear pattern, and many test assertions are straightforward validations. Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates the South Kesteven District Council scraper to use the council’s live “binday” checker via Selenium, adds debug artifact capture on failures, and refreshes test coverage + sample config accordingly.
Changes:
- Replaced the previous requests/OCR-based approach with a Selenium-driven binday flow that parses “Your Collections” results tables.
- Added artifact capture (HTML/screenshot/metadata) and a new CLI flag (
--artifact-dir) to control where artifacts are written. - Reworked unit/integration tests and updated
tests/input.jsonto reflect the new required inputs (postcode + house number/name).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py | Implements Selenium binday navigation, parsing, normalization, and failure artifact capture. |
| uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_integration.py | Updates integration coverage to exercise the Selenium flow and artifact output. |
| uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_district_council.py | Replaces older unit tests with Selenium-flow unit tests and parsing/artifact tests. |
| uk_bin_collection/uk_bin_collection/councils/tests/conftest.py | Adds env-driven fixtures for postcode/paon/url/webdriver/headless. |
| uk_bin_collection/uk_bin_collection/collect_data.py | Adds --artifact-dir arg and passes it through to the scraper. |
| uk_bin_collection/tests/input.json | Updates SKDC example configuration to use new inputs + Selenium URL. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _select_address(self, address_select, paon: str) -> None: | ||
| target = str(paon).strip().lower() | ||
| select = Select(address_select) | ||
|
|
||
| for option in select.options: | ||
| option_text = option.text.strip().lower() | ||
| if target in option_text: | ||
| select.select_by_visible_text(option.text) | ||
| return | ||
|
|
||
| raise RuntimeError( | ||
| f"Unable to find the property '{paon}' in the address dropdown." | ||
| ) |
| def _capture_debug_artifacts( | ||
| self, driver, artifact_root: Path, context: dict[str, str] | ||
| ) -> Path | None: | ||
| if not driver: | ||
| return None | ||
|
|
||
| def get_bin_type_from_calendar(self, collection_date, calendar_data=None): | ||
| """Determine the specific bin type from the parsed calendar data.""" | ||
| try: | ||
| # Parse the date | ||
| date_obj = datetime.strptime(collection_date, "%d/%m/%Y") | ||
| year = str(date_obj.year) | ||
| month = str(date_obj.month) | ||
| day = date_obj.day | ||
|
|
||
| # Determine which week of the month this is | ||
| week_of_month = str(((day - 1) // 7) + 1) | ||
|
|
||
| # Use provided calendar data or get it if not provided | ||
| if calendar_data is None: | ||
| calendar_data = self.parse_calendar_images() | ||
|
|
||
| # Look up the bin type from the calendar data | ||
| if year in calendar_data and month in calendar_data[year] and week_of_month in calendar_data[year][month]: | ||
| return calendar_data[year][month][week_of_month] | ||
| else: | ||
| # Raise error if not found in calendar instead of fallback | ||
| raise ValueError(f"No bin type found for {collection_date} (Week {week_of_month} of {month}/{year})") | ||
|
|
||
| except Exception as e: | ||
| print(f"Error determining bin type for {collection_date}: {e}") | ||
| raise | ||
| timestamp = datetime.now().strftime("%Y%m%d-%H%M%S") | ||
| artifact_path = artifact_root / timestamp | ||
| artifact_path.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| def parse_data(self, page: str, **kwargs) -> dict: | ||
| try: | ||
| user_postcode = kwargs.get("postcode") | ||
| metadata = { | ||
| "current_url": str(getattr(driver, "current_url", "")) or None, | ||
| **context, | ||
| } | ||
|
|
||
| # Validate postcode | ||
| if not user_postcode: | ||
| raise ValueError("Postcode is required for South Kesteven") | ||
| screenshot_path = artifact_path / "page.png" | ||
| html_path = artifact_path / "page.html" | ||
| metadata_path = artifact_path / "metadata.json" | ||
|
|
||
| # No WebDriver needed - using requests-based approach | ||
|
|
||
| # Get collection day for regular bins | ||
| collection_day = self.get_collection_day_from_postcode(None, user_postcode) | ||
| if not collection_day: | ||
| raise ValueError(f"Could not determine collection day for postcode {user_postcode}") | ||
| try: | ||
| html_path.write_text(driver.page_source, encoding="utf-8") | ||
| except Exception as exc: | ||
| metadata["page_html_error"] = str(exc) | ||
|
|
||
| # Get green bin info | ||
| green_bin_info = self.get_green_bin_info_from_postcode(None, user_postcode) | ||
| try: | ||
| metadata["screenshot_saved"] = bool(driver.save_screenshot(str(screenshot_path))) | ||
| except Exception as exc: | ||
| metadata["screenshot_error"] = str(exc) | ||
|
|
||
| metadata_path.write_text(json.dumps(metadata, indent=4), encoding="utf-8") | ||
| return artifact_path.resolve() |
| timestamp = datetime.now().strftime("%Y%m%d-%H%M%S") | ||
| artifact_path = artifact_root / timestamp |
| metadata = { | ||
| "current_url": str(getattr(driver, "current_url", "")) or None, | ||
| **context, | ||
| } |
| assert "bins" in result | ||
| assert isinstance(result["bins"], list) | ||
| assert result["bins"] | ||
| assert {bin_entry["type"] for bin_entry in result["bins"]} <= self.EXPECTED_BIN_TYPES |
| "house_number": "43", | ||
| "postcode": "NG31 8XG", | ||
| "skip_get_url": true, | ||
| "url": "https://pre.southkesteven.gov.uk/skdcNext/tempforms/checkmybin.aspx", | ||
| "url": "https://www.southkesteven.gov.uk/binday", | ||
| "web_driver": "http://selenium:4444", |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py`:
- Around line 249-261: The _select_address function currently uses substring
matching which can pick the wrong option; change it to find exact/anchored
matches against the normalized paon: build a list of candidate options by
normalizing option.text and matching either equality or a regex anchored to the
start/end (e.g., full token match) to avoid substring hits, then if exactly one
candidate select it with select.select_by_visible_text(option.text), if zero or
>1 candidates raise a RuntimeError describing no match or ambiguous multiple
matches (include the list of matching option texts) so the caller can see the
ambiguity; keep references to the Select instance, option.text and the
_select_address(paon) signature when making the change.
In
`@uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_district_council.py`:
- Around line 79-86: Update the pytest.raises match arguments to use raw/escaped
regex strings so metacharacters are treated literally: change occurrences like
match="Property number or name \\(paon\\) is required for South Kesteven." to
raw-string form match=r"Property number or name \(paon\) is required for South
Kesteven." (do the same for the postcode test and the other occurrences noted);
locate these in the test function test_parse_data_requires_paon and any other
tests invoking council.parse_data and replace the match="..." with match=r"...",
escaping literal parentheses and other regex metacharacters as needed.
In
`@uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_integration.py`:
- Around line 44-48: Replace the fragile slash/length checks on
bin_entry["collectionDate"] with strict parsing using the datetime parser: call
datetime.datetime.strptime on the collection date string (format "%d/%m/%Y") in
the test (e.g., inside the test_south_kesteven_integration test) and assert that
parsing succeeds (or that the returned datetime has expected day/month/year
properties) instead of asserting string lengths; add the necessary import for
datetime at the top of the test file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f3663b32-c409-4d9c-9168-80e1196eded9
📒 Files selected for processing (6)
uk_bin_collection/tests/input.jsonuk_bin_collection/uk_bin_collection/collect_data.pyuk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.pyuk_bin_collection/uk_bin_collection/councils/tests/conftest.pyuk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_district_council.pyuk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_integration.py
| def _select_address(self, address_select, paon: str) -> None: | ||
| target = str(paon).strip().lower() | ||
| select = Select(address_select) | ||
|
|
||
| for option in select.options: | ||
| option_text = option.text.strip().lower() | ||
| if target in option_text: | ||
| select.select_by_visible_text(option.text) | ||
| return | ||
|
|
||
| raise RuntimeError( | ||
| f"Unable to find the property '{paon}' in the address dropdown." | ||
| ) |
There was a problem hiding this comment.
Avoid ambiguous substring matching when selecting the property.
Line 255 currently matches paon using substring containment, which can select the wrong address and return another property's collections. Match exact/anchored candidates and fail when multiple entries match.
Proposed fix
def _select_address(self, address_select, paon: str) -> None:
target = str(paon).strip().lower()
select = Select(address_select)
-
- for option in select.options:
- option_text = option.text.strip().lower()
- if target in option_text:
- select.select_by_visible_text(option.text)
- return
+ exact_or_anchored_matches = []
+ for option in select.options:
+ option_text = " ".join(option.text.split()).lower()
+ if (
+ option_text == target
+ or option_text.startswith(f"{target},")
+ or option_text.startswith(f"{target} ")
+ ):
+ exact_or_anchored_matches.append(option.text)
+
+ if len(exact_or_anchored_matches) == 1:
+ select.select_by_visible_text(exact_or_anchored_matches[0])
+ return
+ if len(exact_or_anchored_matches) > 1:
+ raise RuntimeError(
+ f"Property '{paon}' matched multiple addresses; provide a more specific value."
+ )
raise RuntimeError(
f"Unable to find the property '{paon}' in the address dropdown."
)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py`
around lines 249 - 261, The _select_address function currently uses substring
matching which can pick the wrong option; change it to find exact/anchored
matches against the normalized paon: build a list of candidate options by
normalizing option.text and matching either equality or a regex anchored to the
start/end (e.g., full token match) to avoid substring hits, then if exactly one
candidate select it with select.select_by_visible_text(option.text), if zero or
>1 candidates raise a RuntimeError describing no match or ambiguous multiple
matches (include the list of matching option texts) so the caller can see the
ambiguity; keep references to the Select instance, option.text and the
_select_address(paon) signature when making the change.
| with pytest.raises(ValueError, match="Postcode is required for South Kesteven."): | ||
| council.parse_data("", paon="43") | ||
|
|
||
|
|
||
| def test_parse_data_requires_paon(council): | ||
| with pytest.raises( | ||
| ValueError, | ||
| match="Property number or name \\(paon\\) is required for South Kesteven.", |
There was a problem hiding this comment.
Use raw/escaped regex strings in pytest.raises(..., match=...).
These match= patterns include regex metacharacters and trigger RUF043; tighten them with raw strings and escaped literals to keep assertions precise.
Proposed fix
- with pytest.raises(ValueError, match="Postcode is required for South Kesteven."):
+ with pytest.raises(ValueError, match=r"Postcode is required for South Kesteven\."):
@@
- match="Property number or name \\(paon\\) is required for South Kesteven.",
+ match=r"Property number or name \(paon\) is required for South Kesteven\.",
@@
- match="Unable to find the property '99' in the address dropdown.",
+ match=r"Unable to find the property '99' in the address dropdown\.",
@@
- match="Unable to find the address dropdown after searching for the postcode.",
+ match=r"Unable to find the address dropdown after searching for the postcode\.",Also applies to: 125-125, 301-301
🧰 Tools
🪛 Ruff (0.15.15)
[warning] 79-79: Pattern passed to match= contains metacharacters but is neither escaped nor raw
(RUF043)
[warning] 86-86: Pattern passed to match= contains metacharacters but is neither escaped nor raw
(RUF043)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_district_council.py`
around lines 79 - 86, Update the pytest.raises match arguments to use
raw/escaped regex strings so metacharacters are treated literally: change
occurrences like match="Property number or name \\(paon\\) is required for South
Kesteven." to raw-string form match=r"Property number or name \(paon\) is
required for South Kesteven." (do the same for the postcode test and the other
occurrences noted); locate these in the test function
test_parse_data_requires_paon and any other tests invoking council.parse_data
and replace the match="..." with match=r"...", escaping literal parentheses and
other regex metacharacters as needed.
Source: Linters/SAST tools
| date_parts = bin_entry["collectionDate"].split("/") | ||
| assert len(date_parts) == 3 | ||
| assert len(date_parts[0]) == 2 | ||
| assert len(date_parts[1]) == 2 | ||
| assert len(date_parts[2]) == 4 |
There was a problem hiding this comment.
Use strict date parsing in the integration assertion.
The current slash/length checks allow invalid dates (for example, 99/99/9999) to pass, which can hide parser regressions.
Proposed change
+from datetime import datetime
import pytest
from selenium.common.exceptions import WebDriverException
from urllib3.exceptions import MaxRetryError
@@
for bin_entry in result["bins"]:
assert "type" in bin_entry
assert "collectionDate" in bin_entry
- date_parts = bin_entry["collectionDate"].split("/")
- assert len(date_parts) == 3
- assert len(date_parts[0]) == 2
- assert len(date_parts[1]) == 2
- assert len(date_parts[2]) == 4
+ datetime.strptime(bin_entry["collectionDate"], "%d/%m/%Y")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| date_parts = bin_entry["collectionDate"].split("/") | |
| assert len(date_parts) == 3 | |
| assert len(date_parts[0]) == 2 | |
| assert len(date_parts[1]) == 2 | |
| assert len(date_parts[2]) == 4 | |
| datetime.strptime(bin_entry["collectionDate"], "%d/%m/%Y") |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@uk_bin_collection/uk_bin_collection/councils/tests/test_south_kesteven_integration.py`
around lines 44 - 48, Replace the fragile slash/length checks on
bin_entry["collectionDate"] with strict parsing using the datetime parser: call
datetime.datetime.strptime on the collection date string (format "%d/%m/%Y") in
the test (e.g., inside the test_south_kesteven_integration test) and assert that
parsing succeeds (or that the returned datetime has expected day/month/year
properties) instead of asserting string lengths; add the necessary import for
datetime at the top of the test file.
Summary
--artifact-dirsupport to the data collection CLI so targeted council validation can retain debugging artifacts when needed.Why
The previous South Kesteven implementation no longer matched the live council collection service, which made the returned collection dates and bin labels unreliable.
Validation
python -m pytest --basetemp=C:\Users\mattm\AppData\Local\Temp\CodexPytestSkdcBindayPr --confcutdir=uk_bin_collection\uk_bin_collection\councils\tests uk_bin_collection\uk_bin_collection\councils\tests\test_south_kesteven_district_council.pypython -m compileall uk_bin_collection/uk_bin_collection/councils/SouthKestevenDistrictCouncil.py uk_bin_collection/uk_bin_collection/collect_data.pySummary by CodeRabbit
New Features
--artifact-dirCLI argument to save debug artifacts when collection encounters errorsImprovements
Related to #1907.
This PR addresses the South Kesteven issue discussed here:
#1907 (comment)