feat(pipeline): GH-never-scrapes guardrail (scraping_enabled switch) by shaypal5 · Pull Request #212 · DataHackIL/tfht_enforce_idx

shaypal5 · 2026-06-12T20:38:29Z

Summary

UNIFY-PR-04 — the first (and lower-risk) half of the GH/local partition. GitHub Actions' datacenter IPs are bot-blocked by Israeli news sites, so CI must never fetch article bodies or source pages. The canonical state is scraped by local runs; GH only searches (Brave/Exa), classifies, and runs the deterministic phases (prefilter, gates, balanced selection, budget math).

Per the repo's "small, composable" preference I split the partition: this PR is the GH-never-scrapes guardrail; the search-backstop half ("skip GH search if local searched today") is UNIFY-PR-05.

Change

A top-level Config.scraping_enabled master switch (default True), set false in agents/news/ci.overlay.yaml. Both body-fetch choke points respect it:

Choke point	Behavior when `scraping_enabled=false`
`scrape_candidates()` (discovery/scrape_queue.py)	No-ops — returns an empty batch, candidates left for a local run. Covers ingest source-native, `scrape_candidates`, and `backfill_scrape` (all route through it).
ingest `fetch_all_sources` call (pipeline.py)	Skipped; records a `scraping_disabled=true` warning.

The discover / backfill_discover jobs only search, so they're unaffected — GH search keeps working.

Why a guardrail (and why it's safe to merge)

The operational workflows are still workflow_dispatch-only (disabled), so this takes effect only when they're enabled. It's a safety net so that even if a scraping job is ever dispatched on GH, it won't hammer (and get blocked by) the source sites. Inert until enablement.

Tests

CI-overlay config test now asserts the flip: local.scraping_enabled is True, ci.scraping_enabled is False.
test_scrape_candidates_no_op_when_scraping_disabled: a source whose fetch raises is wired in; the gate means it's never touched, the batch is empty, and the candidate is left NEW.
test_run_pipeline_async_skips_source_fetch_when_scraping_disabled: fetch_all_sources wired to raise; the disabled ingest never calls it and records the warning.

ruff / mypy clean; 1371 unit tests pass.

Notes

The classification-queue decouple (let GH classify already-scraped candidates) remains parked — out of scope here.
agents/news/ci.overlay.yaml is still the single, auditable place CI differs; this adds one key to it.

🤖 Generated with Claude Code

…witch) UNIFY-PR-04 — first half of the GH/local partition. GitHub Actions' datacenter IPs are bot-blocked by Israeli news sites, so CI must never fetch article bodies or source pages; the canonical state is scraped by local runs, and GH only searches, classifies, and runs the deterministic phases. Adds a top-level `Config.scraping_enabled` master switch (default True), set `false` in agents/news/ci.overlay.yaml. Both body-fetch choke points respect it: - `scrape_candidates()` no-ops (returns an empty batch, leaving candidates for a local run to materialize) — covers the ingest source-native, scrape_candidates, and backfill_scrape paths, which all route through it; - the ingest job's `fetch_all_sources` call is skipped (records a `scraping_disabled=true` warning). The discover / backfill_discover jobs only search, so they are unaffected — GH search keeps working. This is a guardrail: the operational workflows are still `workflow_dispatch`-only, so it takes effect when they are enabled. The search-backstop half (skip GH search when local already searched that day) follows as UNIFY-PR-05. Tests: the CI-overlay config test now asserts the flip (local True -> CI False); scrape_candidates no-ops with a fetch-raising source wired in (proving nothing is fetched and the candidate is left untouched); the ingest job skips fetch_all_sources (wired to raise) and records the warning. ruff/mypy clean; 1371 unit tests pass. Co-Authored-By: Claude Opus 4.8 <[email protected]>

Copilot

Pull request overview

This PR introduces a repo-wide guardrail to ensure GitHub Actions never scrapes source sites/article bodies (due to bot-blocked datacenter IPs), while still allowing CI to run search/classification and deterministic pipeline phases.

Changes:

Add Config.scraping_enabled (default True) and set it to false in agents/news/ci.overlay.yaml to enforce “GH never scrapes”.
Gate the two main body-fetch choke points: fetch_all_sources during ingest and scrape_candidates() during candidate materialization.
Add unit tests + docs updates to assert and explain the CI/local behavioral partition.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/denbust/config.py`	Adds `scraping_enabled` master switch to top-level config.
`agents/news/ci.overlay.yaml`	Flips `scraping_enabled: false` for CI overlay.
`src/denbust/pipeline.py`	Skips `fetch_all_sources` when scraping is disabled and records a warning.
`src/denbust/discovery/scrape_queue.py`	No-ops `scrape_candidates()` when scraping is disabled.
`tests/unit/test_pipeline_core.py`	Verifies ingest path skips source fetch and records warning under `scraping_enabled=false`.
`tests/unit/test_discovery_scrape_queue.py`	Verifies scrape queue no-ops without touching sources and leaves candidate status unchanged.
`tests/unit/test_config.py`	Asserts local defaults to scraping enabled and CI overlay disables it.
`docs/operational_reference.md`	Documents the CI overlay’s “GH-never-scrapes” guardrail behavior.
`.agent-plan.md`	Updates mainline status/planning notes to reflect `UNIFY-PR-04` completion and `UNIFY-PR-05` next.

+    if not config.scraping_enabled:
+        # Body-fetching is disabled (e.g. on GitHub Actions, whose datacenter IPs
+        # are bot-blocked); leave the candidates for a local run to materialize.
+        logger.info(
+            "Scraping disabled (scraping_enabled=false); skipping %s candidate(s).",
+            len(candidates),
+        )
+        return CandidateScrapeBatch([], [], [], [], [], [])


codecov · 2026-06-12T20:42:23Z

Codecov Report

❌ Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.81%. Comparing base (830fdb1) to head (81eea2a).

Files with missing lines	Patch %	Lines
src/denbust/pipeline.py	81.81%	2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (87.50%) is below the target coverage (98.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #212      +/-   ##
==========================================
- Coverage   92.81%   92.81%   -0.01%     
==========================================
  Files          83       83              
  Lines       12268    12283      +15     
==========================================
+ Hits        11387    11400      +13     
- Misses        881      883       +2

Files with missing lines	Coverage Δ
src/denbust/discovery/scrape_queue.py	`94.63% <100.00%> (+0.09%)`	⬆️
src/denbust/pipeline.py	`95.38% <81.81%> (-0.11%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Review of #212 found the guardrail incomplete: the discover and backfill_discover jobs (which run on GitHub) fetch from source sites via source-native discovery (`_run_source_native_discovery` -> SourceDiscoveryAdapter.discover_candidates -> source.fetch), a path the first cut did not cover. So `scraping_enabled=false` would still let GH's discover workflow hammer the bot-blocked source sites — the opposite of the intent, and the PR's "discover only searches, unaffected" claim was wrong. Fix: gate inside `_run_source_native_discovery` and `_run_source_native_backfill_discovery` (a single choke covering all callers, addressing the root cause that a caller was missed). When scraping is disabled the enabled-source list is emptied, so no source.fetch runs and an empty result is returned; the Brave/Exa search engines are untouched, so GH still searches. Other source.fetch() sites audited: - diagnostics/source_health.py: the GH daily-review workflow already runs `diagnose-sources --artifacts-only` (include_live=false), so it does no live fetch on GH; that path also does not load the CI overlay, so a config gate there would be ineffective anyway. Documented rather than gated. - live_checks/runner.py, validation/collect.py: standalone CLI commands, not scheduled GH jobs. Out of scope. New test test_run_source_native_discovery_skips_source_fetch_when_scraping_disabled wires in a source whose fetch raises and asserts it is never called. Docs and .agent-plan.md corrected to enumerate every gated fetch path (and the artifacts-only note for diagnose-sources). Also fixed a stale `scripts/lib/...` path reference left over from #211. ruff/mypy clean; 1372 unit tests pass. Co-Authored-By: Claude Opus 4.8 <[email protected]>

github-actions · 2026-06-13T06:54:14Z

pr-agent-context report:

This run includes an unresolved review comment and a patch coverage gap on PR #212.

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, address the patch coverage gaps below, and push all of these changes in a single
commit.

# Copilot Comments

## COPILOT-1
Location: src/denbust/discovery/scrape_queue.py:485
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/212#discussion_r3406133207
Root author: copilot-pull-request-reviewer

Comment:
    When `scraping_enabled` is false, this returns an empty `selected_candidates` list. Downstream, the pipeline treats `not scrape_batch.selected_candidates` as “no eligible candidates” and ends early (e.g. `run_news_scrape_candidates_job` returns "no queued candidates eligible for scrape"), which is misleading when candidates were actually present but intentionally skipped by the guardrail. Returning the input candidates as `selected_candidates` preserves the real selection count while still no-oping the fetch/update phases.

# Patch coverage

Patch test coverage is 88.24%; please raise it to 100%. These are the uncovered code lines:
- src/denbust/pipeline.py: 876, 881

Run metadata:

Tool ref: v4.0.19
Tool version: 4.0.19
Trigger: commit pushed
Workflow run: 27459546332 attempt 1
Comment timestamp: 2026-06-13T06:53:25.825140+00:00
PR head commit: 81eea2a8a073d1a04a5e0944f0dca02814e12750

Copilot AI review requested due to automatic review settings June 12, 2026 20:38

shaypal5 added this to the Local↔CI Unification milestone Jun 12, 2026

shaypal5 added the configuration label Jun 12, 2026

Copilot started reviewing on behalf of shaypal5 June 12, 2026 20:39 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

shaypal5 mentioned this pull request Jun 13, 2026

Decouple scrape→classify so GitHub can drain the classification queue #213

Open

shaypal5 merged commit 305936c into main Jun 13, 2026
11 checks passed

shaypal5 deleted the codex/gh-never-scrapes branch June 13, 2026 07:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pipeline): GH-never-scrapes guardrail (scraping_enabled switch)#212

feat(pipeline): GH-never-scrapes guardrail (scraping_enabled switch)#212
shaypal5 merged 2 commits into
mainfrom
codex/gh-never-scrapes

shaypal5 commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

github-actions Bot commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented Jun 12, 2026

Summary

Change

Why a guardrail (and why it's safe to merge)

Tests

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment has been minimized.

github-actions Bot commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 12, 2026 •

edited

Loading