feat(discovery): GitHub search backstop (only search if local was idle) by shaypal5 · Pull Request #214 · DataHackIL/tfht_enforce_idx

shaypal5 · 2026-06-13T07:16:25Z

Summary

UNIFY-PR-05 — the second half of the GH/local partition, completing its code side. Brave/Exa are paid (~1,000 free queries/month each), so the day's search budget should be spent once. GitHub should search only as a backstop: when a local run did not already search that calendar day.

Change

SearchBudgetLedger.searched_on(day=..., engine=...) — True if a search (queries > 0) was recorded on the given UTC calendar day. A budget-skipped run that logged 0 queries doesn't count.
DiscoveryConfig.search_backstop_only (default False), set true in agents/news/ci.overlay.yaml. When True, run_news_discover_job checks the ledger before issuing engine queries and skips Brave/Exa/Google CSE for the day if a search was already recorded (by any run, local or CI), recording a search_backstop_skipped=true warning. Source-native discovery and the deterministic phases are unaffected.

Local runs search on their own cadence and take priority; GH searches only on a day local was idle, then records its spend so a later GH run the same day also skips.

A fix the backstop surfaced

The post-run check failed_runs == len(persisted_runs) fataled on 0 == 0 when the backstop skipped all engines (a zero-run day). Guarded it with persisted_runs and …, so "nothing ran" is non-fatal while "all attempted runs failed" still is.

Tests

searched_on: same-day detection, zero-query exclusion, UTC bucketing (a 23:30-05:00 record counts as the next UTC day).
CI-overlay config test now asserts the flip (local False → ci True).
Discover-job behavioral tests: skips search when a search is already recorded today (engine never awaited, warning set, non-fatal); runs search when the ledger is empty.

ruff/mypy clean; 1376 unit tests pass.

Scope / sequencing

This completes the code side of the partition (UNIFY-PR-04 = no-scrape guardrail; this = search-backstop).
Re-enabling the workflows + seeding the state repo is the operational follow-up (UNIFY-PR-06); confirm STATE_REPO_PAT push/force-push rights first.
The parked scrape→classify decouple is tracked as #213.
backfill_discover (a manual operator recovery job) is intentionally not backstop-gated.
Guardrail is inert until the workflows are enabled.

🤖 Generated with Claude Code

UNIFY-PR-05 — the second half of the GH/local partition. Brave/Exa are paid (~1,000 free queries/month each), so the day's search budget should be spent once. GitHub should search only as a backstop: when a local run did not already search that calendar day. - `SearchBudgetLedger.searched_on(day=..., engine=...)`: True if a search (queries > 0) was recorded on the given UTC calendar day. Bucketed in UTC; a budget-skipped run that logged 0 queries does not count. - `DiscoveryConfig.search_backstop_only` (default False), set `true` in `agents/news/ci.overlay.yaml`. When True, `run_news_discover_job` checks the ledger before issuing engine queries and skips Brave/Exa/Google CSE for the day if a search was already recorded (by any run, local or CI), recording a `search_backstop_skipped=true` warning. Source-native discovery and the deterministic phases are unaffected; local runs search on their own cadence and take priority. Also fixes the post-run check `failed_runs == len(persisted_runs)` to require `persisted_runs` first, so a zero-run day (the backstop skipping all engines) finishes non-fatal instead of tripping the all-failed branch on `0 == 0`. This completes the code side of the GH/local partition. Re-enabling the workflows + seeding the state repo is the operational follow-up (UNIFY-PR-06); the parked scrape->classify decouple is tracked as issue #213. Tests: searched_on (same-day detection, zero-query exclusion, UTC bucketing); the CI-overlay config flip; and the discover job skipping search when already searched today vs running it when the ledger is empty. ruff/mypy clean; 1376 unit tests pass. Co-Authored-By: Claude Opus 4.8 <[email protected]>

codecov · 2026-06-13T07:19:46Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.82%. Comparing base (305936c) to head (db72a9d).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #214      +/-   ##
==========================================
+ Coverage   92.81%   92.82%   +0.01%     
==========================================
  Files          83       83              
  Lines       12283    12302      +19     
==========================================
+ Hits        11400    11419      +19     
  Misses        883      883

Files with missing lines	Coverage Δ
src/denbust/discovery/search_budget.py	`98.71% <100.00%> (+0.23%)`	⬆️
src/denbust/pipeline.py	`95.40% <100.00%> (+0.02%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Adds a “search backstop” mode so GitHub Actions only spends paid open-web search budget (Brave/Exa/Google CSE) on days when no other run has already searched (per the search-budget ledger), preserving budget for local runs and preventing duplicate daily spend.

Changes:

Added SearchBudgetLedger.searched_on(day=..., engine=...) to detect “real” searches (queries > 0) on a given UTC calendar day.
Introduced DiscoveryConfig.search_backstop_only (default False) and enabled it in the CI overlay to skip engine discovery when the ledger shows an earlier search that day.
Adjusted run_news_discover_job fatality logic so “zero attempted runs” (e.g., all engines backstop-skipped) is non-fatal, and added unit tests/docs updates for the behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/denbust/discovery/search_budget.py`	Adds `searched_on()` using UTC-day bucketing and queries>0 semantics.
`src/denbust/config.py`	Adds `discovery.search_backstop_only` configuration flag.
`src/denbust/pipeline.py`	Implements backstop gating and fixes the `0 == 0` fatal condition for zero attempted runs.
`agents/news/ci.overlay.yaml`	Enables the backstop for CI runs.
`docs/operational_reference.md`	Documents CI overlay behavior and the backstop warning flag.
`tests/unit/test_search_budget.py`	Adds unit coverage for same-day detection, zero-query exclusion, and UTC bucketing.
`tests/unit/test_pipeline_core.py`	Adds behavioral tests for backstop skip vs. run behavior.
`tests/unit/test_config.py`	Asserts local vs CI overlay differences include the backstop flag.
`.agent-plan.md`	Updates planning/mainline status to reflect UNIFY-PR-05 being merged and sets next steps.

+        from datetime import UTC, datetime
+
+        from denbust.discovery.search_budget import SearchBudgetLedger
+
+        config = Config(
+            store={"state_root": tmp_path},
+            source_discovery={"enabled": False},
+            discovery={
+                "enabled": True,
+                "engines": {"brave": {"enabled": True}},
+                "search_backstop_only": True,
+            },
+        )
+        # A local run already searched earlier today.
+        SearchBudgetLedger(config.discovery_state_paths.search_budget_path).record(
+            engine="brave", queries=20, run_id="local-run", now=datetime.now(UTC)
+        )


Addresses review of #214: 1. Rolling 24h window instead of a UTC calendar day. The calendar-day check only deferred to local when local happened to run first that day; if GH's schedule fired before local, GH searched every day and the backstop saved nothing (local has no backstop and searches anyway -> two searches/day). It also double-searched across the UTC-midnight boundary. The ledger method is now `searched_since(since=...)`; the discover job gates on `run_timestamp - 24h`. With a rolling window GH defers to a recent local search regardless of clock ordering: as long as local runs at least daily, GH always skips and only searches once local has been idle longer than a day. Docs + overlay comment corrected to this honest framing, with a note to schedule GH discover at least daily. 2. Discover test now uses the real CI triple (source_discovery on + scraping_enabled off + search_backstop_only on): the backstop skips the engines while source-native discovery runs empty-but-SUCCEEDED, so the zero-engine day stays non-fatal — locking in the `persisted_runs and failed_runs == len(...)` guard against future source-native run-status changes. (The previous test used source_discovery off, which the real CI config never does.) 3. `searched_since` treats a tz-naive `recorded_at` as UTC rather than silently converting from the local machine tz. ruff/mypy clean; 1376 unit tests pass. Co-Authored-By: Claude Opus 4.8 <[email protected]>

github-actions · 2026-06-13T10:10:33Z

pr-agent-context report:

This run includes an unresolved review comment on PR #214.

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: tests/unit/test_pipeline_core.py
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/214#discussion_r3407594076
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    This test can be flaky around UTC midnight: the ledger record uses `datetime.now(UTC)` while `run_news_discover_job()` independently sets `result.run_timestamp` via `datetime.now(UTC)`. If the day flips between the two calls, `searched_on(day=run_day)` won't see the record and the test may fail intermittently. Freeze the pipeline clock (or patch `_build_run_snapshot`) so both use the same fixed timestamp.

Run metadata:

Tool ref: v4.0.19
Tool version: 4.0.19
Trigger: commit pushed
Workflow run: 27463715463 attempt 1
Comment timestamp: 2026-06-13T10:09:45.497314+00:00
PR head commit: db72a9df1b2a49727bc2ee6c2128e3cdf2606053

Copilot AI review requested due to automatic review settings June 13, 2026 07:16

shaypal5 added this to the Local↔CI Unification milestone Jun 13, 2026

shaypal5 added configuration discovery Discovery-layer and candidate-retention work labels Jun 13, 2026

Copilot started reviewing on behalf of shaypal5 June 13, 2026 07:16 View session

Copilot AI reviewed Jun 13, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

shaypal5 merged commit 0f41a85 into main Jun 14, 2026
11 checks passed

shaypal5 deleted the codex/search-backstop branch June 14, 2026 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(discovery): GitHub search backstop (only search if local was idle)#214

feat(discovery): GitHub search backstop (only search if local was idle)#214
shaypal5 merged 2 commits into
mainfrom
codex/search-backstop

shaypal5 commented Jun 13, 2026

Uh oh!

codecov Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

This comment has been minimized.

github-actions Bot commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented Jun 13, 2026

Summary

Change

A fix the backstop surfaced

Tests

Scope / sequencing

Uh oh!

codecov Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

This comment has been minimized.

github-actions Bot commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 13, 2026 •

edited

Loading