fix(security): redact secrets from persisted discovery error strings#217
Merged
Conversation
Root-cause fix for the secret-leak incident. Google CSE takes its API key as a
`?key=` URL query parameter, so an httpx HTTPStatusError (e.g. a 403) echoes the
key in its text. That text was stored verbatim in a discovery run's errors[] and
written to run snapshots and backfill batches — and a seed of that state to a
public repo leaked a live key.
Adds `denbust.discovery.redaction.redact_secrets()` — conservatively scrubs
credential shapes (URL key/token/auth params, Google AIza keys, Bearer/Basic
tokens, sk-ant/proj/... keys) without touching ordinary text or URL slugs (so
candidate URLs like ".../norsk-forbud-..." are untouched).
Applied at the source — every discovery engine and source error-capture site in
pipeline.py wraps its `{exc}` string in redact_secrets(), so a credential never
enters errors[] (and therefore never reaches run snapshots, backfill batches, or
metrics). Plus a safety net in write_discovery_run_snapshot(), which redacts the
serialized snapshot before writing (run snapshots hold no candidate bodies, so
blanket redaction there is safe).
Tests: redaction of the exact CSE-403 error shape, Bearer/sk- keys, idempotence,
and that URL slugs / plain text are left intact; plus that the run-snapshot
writer never persists a key. ruff/mypy clean; 1381 unit tests pass.
Incident remediation status: the leaked Google key was rotated/deleted by the
operator; the public state repo history was purged (clean root force-pushed);
this prevents recurrence. The unreachable old commit may persist by-SHA on
GitHub until GC — harmless now that the key is dead.
Co-Authored-By: Claude Opus 4.8 <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #217 +/- ##
==========================================
+ Coverage 92.82% 92.84% +0.02%
==========================================
Files 83 84 +1
Lines 12302 12337 +35
==========================================
+ Hits 11419 11454 +35
Misses 883 883
🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the discovery pipeline against credential leaks by introducing a secret-redaction utility and applying it to persisted discovery-run error strings, with a safety-net redaction pass when writing run snapshots.
Changes:
- Add
denbust.discovery.redaction.redact_secrets()to scrub common secret shapes (URLkey=/token=params,AIza...keys, Bearer/Basic tokens,sk-...keys). - Apply redaction when capturing discovery/source exceptions into
errors[]inpipeline.py. - Redact the serialized JSON in
write_discovery_run_snapshot()before writing to disk, plus add unit tests covering leak-shaped errors and snapshot persistence.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
src/denbust/discovery/redaction.py |
Introduces regex-based secret redaction utility used to scrub persisted error text. |
src/denbust/pipeline.py |
Wraps multiple discovery error-capture paths with redact_secrets() before persisting to errors[]. |
src/denbust/discovery/state_paths.py |
Adds a safety-net redaction pass over serialized run-snapshot JSON before writing to disk. |
tests/unit/test_redaction.py |
Adds unit tests for URL/query-param redaction, token redaction, idempotence, and snapshot writer behavior. |
Comment on lines
486
to
+488
| except Exception as exc: | ||
| logger.exception("Error fetching from %s: %s", source.name, exc) | ||
| errors.append(f"{source.name}: {exc}") | ||
| errors.append(redact_secrets(f"{source.name}: {exc}")) |
Comment on lines
915
to
920
| except Exception as exc: | ||
| logger.exception( | ||
| "Error discovering backfill candidates from %s: %s", source.name, exc | ||
| ) | ||
| errors.append(f"{source.name}: {exc}") | ||
| errors.append(redact_secrets(f"{source.name}: {exc}")) | ||
|
|
Comment on lines
974
to
977
| except Exception as exc: | ||
| logger.exception("Error discovering candidates from %s: %s", source.name, exc) | ||
| errors.append(f"{source.name}: {exc}") | ||
| errors.append(redact_secrets(f"{source.name}: {exc}")) | ||
|
|
Comment on lines
+29
to
+30
| # Google API keys (e.g. CSE): AIza + 35 chars. | ||
| _GOOGLE_API_KEY = re.compile(r"AIza[0-9A-Za-z_-]{30,}") |
Comment on lines
+9
to
+14
| ``redact_secrets`` scrubs known credential shapes so they never reach disk. | ||
| Apply it wherever an exception or external string is turned into stored state | ||
| (error lists, run snapshots). It is intentionally conservative — it targets | ||
| credential patterns (URL key/token params, Google ``AIza`` keys, Bearer/Basic | ||
| tokens, ``sk-…`` keys), not arbitrary text — so it will not mangle candidate | ||
| URLs or titles. |
|
|
||
| # A realistic Google-CSE 403 error that echoes the API key in the request URL — | ||
| # the exact shape that leaked. Built so no real key appears in this test file. | ||
| _FAKE_KEY = "AIza" + "Sy" + "C2xd" + "N" * 33 # AIza + 35 chars |
This comment has been minimized.
This comment has been minimized.
Reworks the redactor after review found the pattern-only version missed most of
this project's secrets — proven: a Supabase service-role JWT (in a JSON body),
Brave X-Subscription-Token, and Exa x-api-key all passed through unredacted; only
Google was caught. A security redactor that launders the one secret it has seen
leak while passing the Supabase admin key is false confidence.
Primary mechanism is now KNOWN-VALUE redaction: secret_values_from_env() collects
the literal values of credential-like env vars (names matching
KEY|TOKEN|SECRET|PASSWORD|CREDENTIAL, value >= 8 chars) — ANTHROPIC_API_KEY,
DENBUST_* engine/Supabase/object-store keys, KAGGLE_KEY, HF_TOKEN, etc. — and
strips their literal occurrences, so every secret is removed regardless of format
(JWT, UUID, opaque). Tests can inject known_values for determinism.
Shape patterns remain as a backstop for FOREIGN secrets not in our env, now
expanded to JWTs (eyJ…), header tokens (X-Subscription-Token / x-api-key), and
JSON secret fields ("…key…":"…"), alongside the URL-param/AIza/Bearer/sk- ones.
They target credential shapes only, so candidate URLs/slugs (e.g. "norsk-forbud-")
stay intact.
Coverage widened: the backfill-batch (write_json_snapshot) and metrics writers
now redact too, joining the run-snapshot safety net (all JSON metadata, no
candidate bodies); source-level error redaction already keeps the operational
(Supabase) write clean.
Tests are now threat-model based: each secret type this system holds is asserted
redacted (by known value and by shape backstop), plus the writers. The six
adversarial cases that previously leaked (4/6) are all redacted now. ruff/mypy
clean; 1382 unit tests pass.
Co-Authored-By: Claude Opus 4.8 <[email protected]>
|
pr-agent-context report: This run includes unresolved review comments on PR #217.
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.
# Copilot Comments
## COPILOT-1
Location: src/denbust/pipeline.py:488
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/217#discussion_r3409386505
Root author: copilot-pull-request-reviewer
Comment:
`logger.exception(..., exc)` will still emit the raw exception message (and traceback) to logs, which can include credential-bearing URLs (e.g., `?key=`). This undermines the goal of preventing secret exposure, even if the persisted `errors[]` entry is redacted.
Consider logging a redacted error string without `exc_info`, and reusing the redacted value for the persisted error entry.
## COPILOT-2
Location: src/denbust/pipeline.py:920
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/217#discussion_r3409386512
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
`logger.exception(..., exc)` will log the unredacted exception message/traceback, which may include credential-bearing URLs (e.g., `?key=`). Since this PR is meant to prevent secret exposure, it would be safer to log a redacted string without `exc_info`, and reuse the same redacted value for `errors[]`.
## COPILOT-3
Location: src/denbust/pipeline.py:977
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/217#discussion_r3409386522
Root author: copilot-pull-request-reviewer
Comment:
Same as above: `logger.exception` will print the original exception message/traceback (potentially including secrets in URLs). To avoid leaking credentials to logs, log a redacted string without `exc_info` and reuse it for the persisted error entry.
## COPILOT-4
Location: src/denbust/discovery/redaction.py
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/217#discussion_r3409386532
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
The comment says Google API keys are `AIza` + 35 chars, but the regex currently matches 30+ chars. Tightening this to the expected length (and adding a boundary) reduces the chance of over-redacting unrelated text while still covering real keys.
## COPILOT-5
Location: src/denbust/discovery/redaction.py
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/217#discussion_r3409386538
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
This PR redacts discovery-run `errors[]` and run snapshots, but other persisted discovery state still stores raw exception strings. For example, scrape attempts persist `ScrapeAttempt.error_message` to `scrape_attempts.jsonl` via `StateRepoDiscoveryPersistence._append_jsonl()` (no redaction), and `scrape_queue.py` builds `error_message` from `{exc}`. That means a credential echoed in an exception message could still reach durable state.
Consider applying `redact_secrets` when populating `ScrapeAttempt.error_message` (and/or just before `model_dump_json()` in `_append_jsonl`) for models that can carry external error strings.
## COPILOT-6
Location: tests/unit/test_redaction.py
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/217#discussion_r3409386539
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
This fake key is intended to be `AIza` + 35 chars (per the inline comment), but the current construction is longer. Keeping it at the expected length makes the fixture closer to the real CSE leak shape and aligns with a tighter Google-key regex.Run metadata: |
This was referenced Jun 14, 2026
shaypal5
added a commit
that referenced
this pull request
Jun 14, 2026
…edes #215) (#219) #215 recorded the state-repo seed as a clean success. It wasn't: that seed leaked a live Google CSE API key (captured into a discovery run's errors[] from a CSE-403 URL and pushed to the PUBLIC state repo). This corrects .agent-plan.md to mainline truth: - the incident and its remediation (key rotated; public-repo history purged with a clean root; redaction root-cause fix #217 merged; re-seeded state is key-scrubbed); - go-live (UNIFY-PR-06) is now gated on the state-push secret-scan guard (issue #218) in addition to the manual-dispatch verification; - last-merged status points at the #217 redaction fix. Doc-only; opened as a PR for review rather than merged directly. Co-authored-by: Claude Opus 4.8 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Root-cause fix for the secret-leak incident. A live Google CSE API key was exposed in the public state repo
tfht_enforce_idx_state.How it happened: Google CSE takes its API key as a
?key=URL query param. A 403 raised anhttpx.HTTPStatusErrorwhose text echoed the request URL including the key. That text was stored verbatim in a discovery run'serrors[], written todiscover/runs/*.jsonand backfill-batch records, and then seeded to the public repo.Already handled (operational): key rotated/deleted by the operator; public-repo history purged (clean root force-pushed; the leaked seed commit is off
main). This PR is the code fix so it can't recur.Change
denbust.discovery.redaction.redact_secrets()— conservatively scrubs credential shapes: URLkey=/token=/auth=params, GoogleAIza…keys,Bearer/Basictokens,sk-ant/proj/…keys. It deliberately does not touch ordinary text or URL slugs (verified:…/norsk-forbud-…is left intact, so candidate URLs aren't mangled).pipeline.pywraps its{exc}inredact_secrets(), so a credential never enterserrors[]— and therefore never reaches run snapshots, backfill batches, or metrics.write_discovery_run_snapshot()redacts the serialized snapshot before writing (run snapshots hold no candidate bodies, so blanket redaction there is safe).Tests
tests/unit/test_redaction.py: the exact CSE-403 error shape redacts tokey=REDACTED(key gone,403/cxcontext preserved); Bearer/sk-keys; idempotence; URL slugs + plain text untouched; and the run-snapshot writer never persists a key. ruff/mypy clean; 1381 unit tests pass.Follow-ups (not in this PR)
state-runwrapper before it commits/pushes to the state repo (defense-in-depth at the actual leak vector — a code-repo gitleaks hook would not have caught this, since the secret lived in untrackeddata/pushed to a separate repo).🤖 Generated with Claude Code