feat(pipeline): GH-never-scrapes guardrail (scraping_enabled switch)#212
Merged
Conversation
…witch)
UNIFY-PR-04 — first half of the GH/local partition. GitHub Actions' datacenter
IPs are bot-blocked by Israeli news sites, so CI must never fetch article bodies
or source pages; the canonical state is scraped by local runs, and GH only
searches, classifies, and runs the deterministic phases.
Adds a top-level `Config.scraping_enabled` master switch (default True), set
`false` in agents/news/ci.overlay.yaml. Both body-fetch choke points respect it:
- `scrape_candidates()` no-ops (returns an empty batch, leaving candidates for
a local run to materialize) — covers the ingest source-native, scrape_candidates,
and backfill_scrape paths, which all route through it;
- the ingest job's `fetch_all_sources` call is skipped (records a
`scraping_disabled=true` warning).
The discover / backfill_discover jobs only search, so they are unaffected — GH
search keeps working. This is a guardrail: the operational workflows are still
`workflow_dispatch`-only, so it takes effect when they are enabled.
The search-backstop half (skip GH search when local already searched that day)
follows as UNIFY-PR-05.
Tests: the CI-overlay config test now asserts the flip (local True -> CI False);
scrape_candidates no-ops with a fetch-raising source wired in (proving nothing is
fetched and the candidate is left untouched); the ingest job skips
fetch_all_sources (wired to raise) and records the warning. ruff/mypy clean;
1371 unit tests pass.
Co-Authored-By: Claude Opus 4.8 <[email protected]>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a repo-wide guardrail to ensure GitHub Actions never scrapes source sites/article bodies (due to bot-blocked datacenter IPs), while still allowing CI to run search/classification and deterministic pipeline phases.
Changes:
- Add
Config.scraping_enabled(defaultTrue) and set it tofalseinagents/news/ci.overlay.yamlto enforce “GH never scrapes”. - Gate the two main body-fetch choke points:
fetch_all_sourcesduring ingest andscrape_candidates()during candidate materialization. - Add unit tests + docs updates to assert and explain the CI/local behavioral partition.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/denbust/config.py |
Adds scraping_enabled master switch to top-level config. |
agents/news/ci.overlay.yaml |
Flips scraping_enabled: false for CI overlay. |
src/denbust/pipeline.py |
Skips fetch_all_sources when scraping is disabled and records a warning. |
src/denbust/discovery/scrape_queue.py |
No-ops scrape_candidates() when scraping is disabled. |
tests/unit/test_pipeline_core.py |
Verifies ingest path skips source fetch and records warning under scraping_enabled=false. |
tests/unit/test_discovery_scrape_queue.py |
Verifies scrape queue no-ops without touching sources and leaves candidate status unchanged. |
tests/unit/test_config.py |
Asserts local defaults to scraping enabled and CI overlay disables it. |
docs/operational_reference.md |
Documents the CI overlay’s “GH-never-scrapes” guardrail behavior. |
.agent-plan.md |
Updates mainline status/planning notes to reflect UNIFY-PR-04 completion and UNIFY-PR-05 next. |
Comment on lines
+478
to
+485
| if not config.scraping_enabled: | ||
| # Body-fetching is disabled (e.g. on GitHub Actions, whose datacenter IPs | ||
| # are bot-blocked); leave the candidates for a local run to materialize. | ||
| logger.info( | ||
| "Scraping disabled (scraping_enabled=false); skipping %s candidate(s).", | ||
| len(candidates), | ||
| ) | ||
| return CandidateScrapeBatch([], [], [], [], [], []) |
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (87.50%) is below the target coverage (98.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #212 +/- ##
==========================================
- Coverage 92.81% 92.81% -0.01%
==========================================
Files 83 83
Lines 12268 12283 +15
==========================================
+ Hits 11387 11400 +13
- Misses 881 883 +2
🚀 New features to boost your workflow:
|
This comment has been minimized.
This comment has been minimized.
Review of #212 found the guardrail incomplete: the discover and backfill_discover jobs (which run on GitHub) fetch from source sites via source-native discovery (`_run_source_native_discovery` -> SourceDiscoveryAdapter.discover_candidates -> source.fetch), a path the first cut did not cover. So `scraping_enabled=false` would still let GH's discover workflow hammer the bot-blocked source sites — the opposite of the intent, and the PR's "discover only searches, unaffected" claim was wrong. Fix: gate inside `_run_source_native_discovery` and `_run_source_native_backfill_discovery` (a single choke covering all callers, addressing the root cause that a caller was missed). When scraping is disabled the enabled-source list is emptied, so no source.fetch runs and an empty result is returned; the Brave/Exa search engines are untouched, so GH still searches. Other source.fetch() sites audited: - diagnostics/source_health.py: the GH daily-review workflow already runs `diagnose-sources --artifacts-only` (include_live=false), so it does no live fetch on GH; that path also does not load the CI overlay, so a config gate there would be ineffective anyway. Documented rather than gated. - live_checks/runner.py, validation/collect.py: standalone CLI commands, not scheduled GH jobs. Out of scope. New test test_run_source_native_discovery_skips_source_fetch_when_scraping_disabled wires in a source whose fetch raises and asserts it is never called. Docs and .agent-plan.md corrected to enumerate every gated fetch path (and the artifacts-only note for diagnose-sources). Also fixed a stale `scripts/lib/...` path reference left over from #211. ruff/mypy clean; 1372 unit tests pass. Co-Authored-By: Claude Opus 4.8 <[email protected]>
|
pr-agent-context report: This run includes an unresolved review comment and a patch coverage gap on PR #212.
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, address the patch coverage gaps below, and push all of these changes in a single
commit.
# Copilot Comments
## COPILOT-1
Location: src/denbust/discovery/scrape_queue.py:485
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/212#discussion_r3406133207
Root author: copilot-pull-request-reviewer
Comment:
When `scraping_enabled` is false, this returns an empty `selected_candidates` list. Downstream, the pipeline treats `not scrape_batch.selected_candidates` as “no eligible candidates” and ends early (e.g. `run_news_scrape_candidates_job` returns "no queued candidates eligible for scrape"), which is misleading when candidates were actually present but intentionally skipped by the guardrail. Returning the input candidates as `selected_candidates` preserves the real selection count while still no-oping the fetch/update phases.
# Patch coverage
Patch test coverage is 88.24%; please raise it to 100%. These are the uncovered code lines:
- src/denbust/pipeline.py: 876, 881Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
UNIFY-PR-04— the first (and lower-risk) half of the GH/local partition. GitHub Actions' datacenter IPs are bot-blocked by Israeli news sites, so CI must never fetch article bodies or source pages. The canonical state is scraped by local runs; GH only searches (Brave/Exa), classifies, and runs the deterministic phases (prefilter, gates, balanced selection, budget math).Per the repo's "small, composable" preference I split the partition: this PR is the GH-never-scrapes guardrail; the search-backstop half ("skip GH search if local searched today") is
UNIFY-PR-05.Change
A top-level
Config.scraping_enabledmaster switch (defaultTrue), setfalseinagents/news/ci.overlay.yaml. Both body-fetch choke points respect it:scraping_enabled=falsescrape_candidates()(discovery/scrape_queue.py)scrape_candidates, andbackfill_scrape(all route through it).fetch_all_sourcescall (pipeline.py)scraping_disabled=truewarning.The
discover/backfill_discoverjobs only search, so they're unaffected — GH search keeps working.Why a guardrail (and why it's safe to merge)
The operational workflows are still
workflow_dispatch-only (disabled), so this takes effect only when they're enabled. It's a safety net so that even if a scraping job is ever dispatched on GH, it won't hammer (and get blocked by) the source sites. Inert until enablement.Tests
local.scraping_enabled is True,ci.scraping_enabled is False.test_scrape_candidates_no_op_when_scraping_disabled: a source whosefetchraises is wired in; the gate means it's never touched, the batch is empty, and the candidate is leftNEW.test_run_pipeline_async_skips_source_fetch_when_scraping_disabled:fetch_all_sourceswired to raise; the disabled ingest never calls it and records the warning.ruff / mypy clean; 1371 unit tests pass.
Notes
agents/news/ci.overlay.yamlis still the single, auditable place CI differs; this adds one key to it.🤖 Generated with Claude Code