Skip to content

fix: BaberghDistrictCouncil - support new collection-day table format#2122

Open
pacso wants to merge 1 commit into
robbrad:masterfrom
pacso:fix/babergh-new-table-format
Open

fix: BaberghDistrictCouncil - support new collection-day table format#2122
pacso wants to merge 1 commit into
robbrad:masterfrom
pacso:fix/babergh-new-table-format

Conversation

@pacso

@pacso pacso commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

The Babergh District Council collection-day finder page changed significantly, breaking the existing parser (the integration test fails with a TimeoutException on the very first step). This PR rewrites the parser to handle the new page.

What changed on the council site

  • Form fields now use Liferay portlet ids (e.g. ..._postcode, ..._btnAddressLookup, ..._uprn) instead of aria-label attributes / lfr-btn-label class.
  • A new step was added — after selecting an address you must now click a "Find collection days" submit button (..._fcd_submit); previously the results loaded automatically.
  • Results format changed from div.collection-cardsdiv.card<h3>/<p> markup to a <table class="table"> inside div.collection-days-page.
  • Date format changed from Mon 27 Apr 2026 (%a %d %b %Y) to Monday 08 Jun 2026 (%A %d %b %Y).

What this PR does

  • Drives the new flow using the portlet element ids, waits for the async address dropdown to populate, and clicks the new "Find collection days" button.
  • Parses the new results table row-by-row, taking the type from the first column and dates from the "Next collection" (and optional "Following Collection Date") columns.
  • Because it iterates every row generically, it captures all bin types automatically, including the newly distributed ones (alongside the existing refuse/recycling/paper-card/food/garden services).

Testing

Verified against the live site with Selenium + headless Chrome:

  • Before: TimeoutException (test fails).
  • After: the BaberghDistrictCouncil integration test passes; output validates against output.schema.

Returns 12 collections across 5 types for the test address (Little Changes, CO6 4RA): Food Waste, Paper And Card, Recycling, Garden Waste (Brown Bin), Refuse (General Rubbish). Multiple future dates per type are fine — the Home Assistant coordinator already collapses them to the earliest future date per type.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved Babergh District Council collection-day lookup to work with the site’s updated address lookup form, including more reliable postcode and house-number matching.
    • Fixed results parsing to read table-based schedules, properly handle multiple dates per collection type, and wait for address options to populate for consistent lookups.

@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 148bca5f-684a-4306-b9f8-1c7f93876c49

📥 Commits

Reviewing files that changed from the base of the PR and between 27a3be4 and 0261c41.

📒 Files selected for processing (1)
  • uk_bin_collection/uk_bin_collection/councils/BaberghDistrictCouncil.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • uk_bin_collection/uk_bin_collection/councils/BaberghDistrictCouncil.py

📝 Walkthrough

Walkthrough

Babergh scraper now uses a module-level Liferay portlet namespace for element IDs, waits for and selects namespaced address dropdown options by postcode/paon patterns, submits the namespaced form, and parses collection days from a results table using full-weekday date parsing.

Changes

Babergh Portlet Integration and Parsing Overhaul

Layer / File(s) Summary
Imports and Portlet Namespace Setup
uk_bin_collection/uk_bin_collection/councils/BaberghDistrictCouncil.py
Import changed to from datetime import datetime, removed sleep-based delay, and new PORTLET_NS constant introduced for Liferay-namespaced element IDs.
parse_data docstring update
uk_bin_collection/uk_bin_collection/councils/BaberghDistrictCouncil.py
Expanded docstring documents Selenium postcode/address lookup flow, expected kwargs (postcode, paon, optional driver/headless), and returned bins structure.
Address Lookup and Selection Workflow
uk_bin_collection/uk_bin_collection/councils/BaberghDistrictCouncil.py
Postcode input, lookup button, and address dropdown now use namespaced IDs ({PORTLET_NS}postcode, {PORTLET_NS}btnAddressLookup, {PORTLET_NS}uprn); code waits for dropdown options, skips placeholders, and matches option text by postcode plus paon/house-number regex (optional single-letter suffix).
Collection Days Submission and Table Parsing
uk_bin_collection/uk_bin_collection/councils/BaberghDistrictCouncil.py
Namespaced submit ({PORTLET_NS}fcd_submit) clicked; waits for div.collection-days-page table; parses rows extracting collection type from first cell and one-or-more date cells, parsing dates with %A %d %b %Y and formatting to date_format.

Sequence Diagram(s)

sequenceDiagram
  participant Client_parse_data as parse_data
  participant Browser_WebDriver as WebDriver
  participant Address_Dropdown as AddressDropdown
  participant Results_Page as ResultsTable
  parse_data->>WebDriver: enter `{PORTLET_NS}postcode` and click `{PORTLET_NS}btnAddressLookup`
  WebDriver->>AddressDropdown: wait for `{PORTLET_NS}uprn` options to populate
  parse_data->>AddressDropdown: select best-matching option (postcode + paon regex)
  parse_data->>WebDriver: click `{PORTLET_NS}fcd_submit`
  WebDriver->>ResultsTable: wait for `div.collection-days-page table`
  ResultsTable->>parse_data: return rows of (collection type, date cells) parsed with `%A %d %b %Y`
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I tap the portlet namespace by moonlight,
Dropdowns fill and options line up right,
I pick the house and click to view the days,
Tables give full-week dates in tidy ways,
Hoppy scraper — bins scheduled for delight.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: rewriting the Babergh District Council parser to support a new collection-day table format after the website was updated.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
uk_bin_collection/uk_bin_collection/councils/BaberghDistrictCouncil.py (2)

1-1: ⚡ Quick win

Remove unused time import.

The time module is imported but never used in this file. All waits are handled by Selenium's WebDriverWait.

♻️ Proposed fix
-import time
 from datetime import datetime
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@uk_bin_collection/uk_bin_collection/councils/BaberghDistrictCouncil.py` at
line 1, Remove the unused import of the time module in the
BaberghDistrictCouncil.py file: locate the top-level import statement "import
time" and delete it since no functions or methods (e.g., any class or function
definitions within BaberghDistrictCouncil.py) reference time and all waits use
Selenium's WebDriverWait; run linting/tests to confirm no further references
remain.

119-154: ⚡ Quick win

Consider adding logging for skipped table rows and cells.

The parser silently skips rows/cells that don't meet expected criteria (lines 127, 131, 143) using continue statements. While this defensive approach prevents crashes, it could hide site format changes without alerting developers or users.

Based on learnings, council parsers consumed by Home Assistant should prefer defensive fallbacks with logger.warning over raising exceptions, ensuring resilience while still surfacing unexpected markup changes during debugging.

♻️ Proposed improvement with logging
+import logging
+
+logger = logging.getLogger(__name__)
+
 # ... (in parse_data method)
             results_page = soup.find("div", class_="collection-days-page")
             if results_page:
                 table = results_page.find("table")
                 if table:
                     body = table.find("tbody") or table
                     for row in body.find_all("tr"):
                         cells = row.find_all("td")
                         if len(cells) < 2:
+                            logger.warning(f"Skipping table row with fewer than 2 cells: {row}")
                             continue
 
                         collection_type = cells[0].get_text(strip=True)
                         if not collection_type:
+                            logger.warning(f"Skipping row with empty collection type: {row}")
                             continue
 
                         # ... date extraction ...
                         for date_cell in date_cells:
                             date_text = date_cell.get_text(strip=True)
                             if not date_text:
+                                logger.warning(f"Skipping empty date cell in {collection_type} row")
                                 continue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@uk_bin_collection/uk_bin_collection/councils/BaberghDistrictCouncil.py`
around lines 119 - 154, The parser in BaberghDistrictCouncil (the loop that
iterates rows in the results_page table) silently skips malformed rows/cells at
the three continue points; add logger.warning calls instead of silent continues
so unexpected markup changes are surfaced. Specifically, in the loop inside the
method that builds data["bins"] (the for row in body.find_all("tr") loop),
replace the silent continues after: a) if len(cells) < 2, b) if not
collection_type, and c) if not date_text with logger.warning messages that
include identifying context (e.g., number of cells, row HTML snippet or an
index, the raw cells content, and the problematic date_text) so operators can
see when rows are skipped but maintain the same continue behavior after logging.
Ensure you use the module logger instance used elsewhere in the file (logger)
and keep messages concise and consistent.

Source: Learnings

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/BaberghDistrictCouncil.py`:
- Line 89: The current string match in option_text that appends f",
{paon_str}A," only handles an "A" suffix and should be fixed; update the
construction that uses paon_str and option_text (the spot building the ",
{paon_str}A," fragment) to either remove this special-case fragment entirely or
generalize it to match any single-letter suffix (e.g., replace the literal "A"
with a character class like [A-Za-z] or [A-Z] in the pattern used), and ensure
the matching code treats letters consistently (case if relevant) so addresses
like "1B", "1C", etc. will match when paon="1".

---

Nitpick comments:
In `@uk_bin_collection/uk_bin_collection/councils/BaberghDistrictCouncil.py`:
- Line 1: Remove the unused import of the time module in the
BaberghDistrictCouncil.py file: locate the top-level import statement "import
time" and delete it since no functions or methods (e.g., any class or function
definitions within BaberghDistrictCouncil.py) reference time and all waits use
Selenium's WebDriverWait; run linting/tests to confirm no further references
remain.
- Around line 119-154: The parser in BaberghDistrictCouncil (the loop that
iterates rows in the results_page table) silently skips malformed rows/cells at
the three continue points; add logger.warning calls instead of silent continues
so unexpected markup changes are surfaced. Specifically, in the loop inside the
method that builds data["bins"] (the for row in body.find_all("tr") loop),
replace the silent continues after: a) if len(cells) < 2, b) if not
collection_type, and c) if not date_text with logger.warning messages that
include identifying context (e.g., number of cells, row HTML snippet or an
index, the raw cells content, and the problematic date_text) so operators can
see when rows are skipped but maintain the same continue behavior after logging.
Ensure you use the module logger instance used elsewhere in the file (logger)
and keep messages concise and consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7951fdc-e9ab-4ece-bd80-c2ed3c7f4f86

📥 Commits

Reviewing files that changed from the base of the PR and between b65502c and 7b3f6ff.

📒 Files selected for processing (1)
  • uk_bin_collection/uk_bin_collection/councils/BaberghDistrictCouncil.py

Comment thread uk_bin_collection/uk_bin_collection/councils/BaberghDistrictCouncil.py Outdated
pacso added a commit to pacso/UKBinCollectionData that referenced this pull request Jun 6, 2026
Address review feedback on PR robbrad#2122:
- Replace the brittle, "A"-only house-number suffix special case with a
  regex that matches the house name/number as a whole token with an
  optional single-letter suffix (e.g. "1A", "1B") and either a space or
  comma separator.
- Remove the now-unused `import time` (all waits use WebDriverWait).

Co-Authored-By: Claude Opus 4.8 <[email protected]>
@codecov

codecov Bot commented Jun 7, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.67%. Comparing base (b65502c) to head (0261c41).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2122   +/-   ##
=======================================
  Coverage   86.67%   86.67%           
=======================================
  Files           9        9           
  Lines        1141     1141           
=======================================
  Hits          989      989           
  Misses        152      152           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

The Babergh collection-day finder page changed significantly:
- Form fields now use Liferay portlet ids instead of aria labels
- A new "Find collection days" submit step is required after selecting
  an address
- Results moved from card markup to a table in div.collection-days-page
- Dates are now formatted like "Monday 08 Jun 2026"

Rewrite the parser to drive the new flow and read the results table
row-by-row, which captures all bin types including the newly added
ones. Match the address by whole-token house name/number with an
optional single-letter suffix, and fail loud if the results table
yields no collections (an empty parse means the format changed rather
than there being no upcoming collections). Verified against the live
site; output validates against the schema and the integration test
passes.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
@pacso pacso force-pushed the fix/babergh-new-table-format branch from 27a3be4 to 0261c41 Compare June 7, 2026 07:47
@pacso

pacso commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@pacso

pacso commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai resume

@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Reviews resumed.

@pacso

pacso commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant