Skip to content

feat(pipeline): GH-never-scrapes guardrail (scraping_enabled switch)#212

Merged
shaypal5 merged 2 commits into
mainfrom
codex/gh-never-scrapes
Jun 13, 2026
Merged

feat(pipeline): GH-never-scrapes guardrail (scraping_enabled switch)#212
shaypal5 merged 2 commits into
mainfrom
codex/gh-never-scrapes

Conversation

@shaypal5

Copy link
Copy Markdown
Member

Summary

UNIFY-PR-04 — the first (and lower-risk) half of the GH/local partition. GitHub Actions' datacenter IPs are bot-blocked by Israeli news sites, so CI must never fetch article bodies or source pages. The canonical state is scraped by local runs; GH only searches (Brave/Exa), classifies, and runs the deterministic phases (prefilter, gates, balanced selection, budget math).

Per the repo's "small, composable" preference I split the partition: this PR is the GH-never-scrapes guardrail; the search-backstop half ("skip GH search if local searched today") is UNIFY-PR-05.

Change

A top-level Config.scraping_enabled master switch (default True), set false in agents/news/ci.overlay.yaml. Both body-fetch choke points respect it:

Choke point Behavior when scraping_enabled=false
scrape_candidates() (discovery/scrape_queue.py) No-ops — returns an empty batch, candidates left for a local run. Covers ingest source-native, scrape_candidates, and backfill_scrape (all route through it).
ingest fetch_all_sources call (pipeline.py) Skipped; records a scraping_disabled=true warning.

The discover / backfill_discover jobs only search, so they're unaffected — GH search keeps working.

Why a guardrail (and why it's safe to merge)

The operational workflows are still workflow_dispatch-only (disabled), so this takes effect only when they're enabled. It's a safety net so that even if a scraping job is ever dispatched on GH, it won't hammer (and get blocked by) the source sites. Inert until enablement.

Tests

  • CI-overlay config test now asserts the flip: local.scraping_enabled is True, ci.scraping_enabled is False.
  • test_scrape_candidates_no_op_when_scraping_disabled: a source whose fetch raises is wired in; the gate means it's never touched, the batch is empty, and the candidate is left NEW.
  • test_run_pipeline_async_skips_source_fetch_when_scraping_disabled: fetch_all_sources wired to raise; the disabled ingest never calls it and records the warning.

ruff / mypy clean; 1371 unit tests pass.

Notes

  • The classification-queue decouple (let GH classify already-scraped candidates) remains parked — out of scope here.
  • agents/news/ci.overlay.yaml is still the single, auditable place CI differs; this adds one key to it.

🤖 Generated with Claude Code

…witch)

UNIFY-PR-04 — first half of the GH/local partition. GitHub Actions' datacenter
IPs are bot-blocked by Israeli news sites, so CI must never fetch article bodies
or source pages; the canonical state is scraped by local runs, and GH only
searches, classifies, and runs the deterministic phases.

Adds a top-level `Config.scraping_enabled` master switch (default True), set
`false` in agents/news/ci.overlay.yaml. Both body-fetch choke points respect it:
  - `scrape_candidates()` no-ops (returns an empty batch, leaving candidates for
    a local run to materialize) — covers the ingest source-native, scrape_candidates,
    and backfill_scrape paths, which all route through it;
  - the ingest job's `fetch_all_sources` call is skipped (records a
    `scraping_disabled=true` warning).

The discover / backfill_discover jobs only search, so they are unaffected — GH
search keeps working. This is a guardrail: the operational workflows are still
`workflow_dispatch`-only, so it takes effect when they are enabled.

The search-backstop half (skip GH search when local already searched that day)
follows as UNIFY-PR-05.

Tests: the CI-overlay config test now asserts the flip (local True -> CI False);
scrape_candidates no-ops with a fetch-raising source wired in (proving nothing is
fetched and the candidate is left untouched); the ingest job skips
fetch_all_sources (wired to raise) and records the warning. ruff/mypy clean;
1371 unit tests pass.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Copilot AI review requested due to automatic review settings June 12, 2026 20:38
@shaypal5 shaypal5 added this to the Local↔CI Unification milestone Jun 12, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a repo-wide guardrail to ensure GitHub Actions never scrapes source sites/article bodies (due to bot-blocked datacenter IPs), while still allowing CI to run search/classification and deterministic pipeline phases.

Changes:

  • Add Config.scraping_enabled (default True) and set it to false in agents/news/ci.overlay.yaml to enforce “GH never scrapes”.
  • Gate the two main body-fetch choke points: fetch_all_sources during ingest and scrape_candidates() during candidate materialization.
  • Add unit tests + docs updates to assert and explain the CI/local behavioral partition.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/denbust/config.py Adds scraping_enabled master switch to top-level config.
agents/news/ci.overlay.yaml Flips scraping_enabled: false for CI overlay.
src/denbust/pipeline.py Skips fetch_all_sources when scraping is disabled and records a warning.
src/denbust/discovery/scrape_queue.py No-ops scrape_candidates() when scraping is disabled.
tests/unit/test_pipeline_core.py Verifies ingest path skips source fetch and records warning under scraping_enabled=false.
tests/unit/test_discovery_scrape_queue.py Verifies scrape queue no-ops without touching sources and leaves candidate status unchanged.
tests/unit/test_config.py Asserts local defaults to scraping enabled and CI overlay disables it.
docs/operational_reference.md Documents the CI overlay’s “GH-never-scrapes” guardrail behavior.
.agent-plan.md Updates mainline status/planning notes to reflect UNIFY-PR-04 completion and UNIFY-PR-05 next.

Comment on lines +478 to +485
if not config.scraping_enabled:
# Body-fetching is disabled (e.g. on GitHub Actions, whose datacenter IPs
# are bot-blocked); leave the candidates for a local run to materialize.
logger.info(
"Scraping disabled (scraping_enabled=false); skipping %s candidate(s).",
len(candidates),
)
return CandidateScrapeBatch([], [], [], [], [], [])
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.81%. Comparing base (830fdb1) to head (81eea2a).

Files with missing lines Patch % Lines
src/denbust/pipeline.py 81.81% 2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (87.50%) is below the target coverage (98.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #212      +/-   ##
==========================================
- Coverage   92.81%   92.81%   -0.01%     
==========================================
  Files          83       83              
  Lines       12268    12283      +15     
==========================================
+ Hits        11387    11400      +13     
- Misses        881      883       +2     
Files with missing lines Coverage Δ
src/denbust/discovery/scrape_queue.py 94.63% <100.00%> (+0.09%) ⬆️
src/denbust/pipeline.py 95.38% <81.81%> (-0.11%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

This comment has been minimized.

Review of #212 found the guardrail incomplete: the discover and
backfill_discover jobs (which run on GitHub) fetch from source sites via
source-native discovery (`_run_source_native_discovery` ->
SourceDiscoveryAdapter.discover_candidates -> source.fetch), a path the first
cut did not cover. So `scraping_enabled=false` would still let GH's discover
workflow hammer the bot-blocked source sites — the opposite of the intent, and
the PR's "discover only searches, unaffected" claim was wrong.

Fix: gate inside `_run_source_native_discovery` and
`_run_source_native_backfill_discovery` (a single choke covering all callers,
addressing the root cause that a caller was missed). When scraping is disabled
the enabled-source list is emptied, so no source.fetch runs and an empty result
is returned; the Brave/Exa search engines are untouched, so GH still searches.

Other source.fetch() sites audited:
  - diagnostics/source_health.py: the GH daily-review workflow already runs
    `diagnose-sources --artifacts-only` (include_live=false), so it does no live
    fetch on GH; that path also does not load the CI overlay, so a config gate
    there would be ineffective anyway. Documented rather than gated.
  - live_checks/runner.py, validation/collect.py: standalone CLI commands, not
    scheduled GH jobs. Out of scope.

New test test_run_source_native_discovery_skips_source_fetch_when_scraping_disabled
wires in a source whose fetch raises and asserts it is never called. Docs and
.agent-plan.md corrected to enumerate every gated fetch path (and the
artifacts-only note for diagnose-sources). Also fixed a stale
`scripts/lib/...` path reference left over from #211. ruff/mypy clean; 1372 unit
tests pass.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

This run includes an unresolved review comment and a patch coverage gap on PR #212.

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, address the patch coverage gaps below, and push all of these changes in a single
commit.

# Copilot Comments

## COPILOT-1
Location: src/denbust/discovery/scrape_queue.py:485
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/212#discussion_r3406133207
Root author: copilot-pull-request-reviewer

Comment:
    When `scraping_enabled` is false, this returns an empty `selected_candidates` list. Downstream, the pipeline treats `not scrape_batch.selected_candidates` as “no eligible candidates” and ends early (e.g. `run_news_scrape_candidates_job` returns "no queued candidates eligible for scrape"), which is misleading when candidates were actually present but intentionally skipped by the guardrail. Returning the input candidates as `selected_candidates` preserves the real selection count while still no-oping the fetch/update phases.

# Patch coverage

Patch test coverage is 88.24%; please raise it to 100%. These are the uncovered code lines:
- src/denbust/pipeline.py: 876, 881

Run metadata:

Tool ref: v4.0.19
Tool version: 4.0.19
Trigger: commit pushed
Workflow run: 27459546332 attempt 1
Comment timestamp: 2026-06-13T06:53:25.825140+00:00
PR head commit: 81eea2a8a073d1a04a5e0944f0dca02814e12750

@shaypal5 shaypal5 merged commit 305936c into main Jun 13, 2026
11 checks passed
@shaypal5 shaypal5 deleted the codex/gh-never-scrapes branch June 13, 2026 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants