feat(security): three-layer gitleaks secret-scan guard (#218) by shaypal5 · Pull Request #220 · DataHackIL/tfht_enforce_idx

shaypal5 · 2026-06-14T13:22:11Z

Why

Closes #218. The structural follow-up to the seed-time incident where a live Google CSE API key — captured into a discovery run's errors[] from a CSE 403 error URL (?key=…) — was pushed to the public state repo. The key was rotated, the public history purged, and the root-cause redaction landed in #217 (redact_secrets() scrubs credential values from discovery state at write time).

This PR adds the outer defense: scan with a real industry secret scanner (gitleaks), not ad-hoc regex, at every point where content can leave the machine. It satisfies gate (a) of UNIFY-PR-06 (go-live).

What

One config, three independent guards.

Shared config — `.gitleaks.toml`

useDefault = true (the full industry ruleset), plus a strict, no-entropy AIza… Google API-key rule — gitleaks' default Google rule applies an entropy gate that missed the exact key that leaked.
An [allowlist] for bulk news-candidate data files (article URLs/titles/snippets — scanning them yields only third-party false positives) and captured-page test fixtures (which carry third-party ad/analytics tokens, not our secrets).

Guard 1 — git pre-push hook (the "general git" guard)

.pre-commit-config.yaml gains the upstream gitleaks hook pinned at v8.30.1, stages: [pre-push]. The repo already sets default_install_hook_types: [pre-commit, pre-push]. Gates what leaves the machine via git push. (Verified live on this very push — see the "Detect hardcoded secrets" pre-push step.)

Guard 2 — state-run push guard

scripts/state-run.sh gains scan_state_for_secrets(), called before each git commit/push to the public state repo. It scans only the committed subtrees (falling back to the whole worktree) with the repo-root config. It fails closed: if gitleaks isn't installed it refuses to push, with install guidance. STATE_RUN_SKIP_SECRET_SCAN=1 is the discouraged, explicit escape hatch.

Guard 3 — Claude Code pre-push hook (the "Claude ability")

A new checked-in .claude/settings.json registers a PreToolUse/Bash hook → scripts/hooks/gitleaks_prepush_guard.py (stdlib-only). It no-ops unless the Bash command is a git push, then runs gitleaks git . and blocks the tool call (exit 2) with redacted findings when secrets are present. Stops an agent from pushing a secret in the first place. .gitignore is adjusted (.claude/* + !.claude/settings.json) so this one shared file is tracked while settings.local.json and worktrees stay ignored.

Docs & plan

AGENTS.md gains a Secret Scanning (defense in depth) section: the gitleaks install requirement, the config, and all three guards (incl. state-run's fail-closed behavior and the override var).
.agent-plan.md: new [done] GUARD-PR-SECRET-SCAN ledger entry; UNIFY-PR-06 gate (a) marked satisfied.

Tests

tests/integration/test_state_run.py: 3 gitleaks-gated tests (skipif when the binary is absent) — a planted Google key is blocked before push, clean state passes, and bulk candidate data is not false-flagged. _run_wrapper gains a scan_secrets param defaulting to off (sets STATE_RUN_SKIP_SECRET_SCAN=1) so mechanics tests run anywhere.
tests/integration/test_state_squash.py: squash/coexistence tests opt out of the scan via the same env var.

Verification

ruff format --check . / ruff check . — clean
mypy src/ — clean (105 files)
pytest -q tests/unit tests/integration — 1565 passed, 2 skipped
All three guards exercised by hand: pre-push hook passed live on this push; the Claude hook allows non-push + clean push and blocks a committed key (exit 2); state-run blocks a planted key and passes clean state.

Note on activation

.claude/settings.json is newly created. Claude Code's settings watcher only tracks .claude/ dirs that had a settings file at session start, so the hook becomes active for collaborators on their next claude session (or after opening /hooks once). The git pre-push and state-run guards are active immediately.

Add defense-in-depth secret scanning with gitleaks (the industry tool, not ad-hoc regex), the structural follow-up to the seed-time Google CSE key leak. A repo-root .gitleaks.toml (default ruleset + a no-entropy AIza Google API-key rule + an allowlist for bulk news-candidate data and captured-page test fixtures) drives all three guards: 1. pre-commit gitleaks hook at stages: [pre-push] — the general git guard. 2. scripts/state-run.sh scan_state_for_secrets() — scans the state worktree before every commit/push and fails closed when gitleaks is absent (STATE_RUN_SKIP_SECRET_SCAN=1 is the discouraged escape hatch). 3. Claude Code PreToolUse/Bash hook (.claude/settings.json -> scripts/hooks/gitleaks_prepush_guard.py) blocking any agent-issued git push when gitleaks finds secrets in tracked content. Documented under AGENTS.md "Secret Scanning". gitleaks-gated integration tests skip when the binary is absent; squash/coexistence mechanics tests opt out via STATE_RUN_SKIP_SECRET_SCAN=1. .gitignore now lets the shared .claude/settings.json through while keeping settings.local.json and worktrees ignored. Satisfies gate (a) of UNIFY-PR-06. Co-Authored-By: Claude Opus 4.8 <[email protected]>

Co-Authored-By: Claude Opus 4.8 <[email protected]>

Copilot

Pull request overview

This PR adds defense-in-depth secret scanning at the main “egress” points (developer git pushes, state-repo publishing via state-run, and agent-initiated pushes) using a shared gitleaks configuration to prevent a repeat of the state-seeding secret leak incident.

Changes:

Add a shared gitleaks rules+allowlist config (.gitleaks.toml) and wire it into multiple guard layers.
Enforce secret scanning before state repo commit/push in scripts/state-run.sh, with integration tests covering blocked/allowed cases.
Add developer/agent push guards via pre-commit pre-push gitleaks hook and a Claude Code PreToolUse hook, plus documentation/plan updates.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`scripts/state-run.sh`	Adds a fail-closed gitleaks scan step before committing/pushing state.
`scripts/hooks/gitleaks_prepush_guard.py`	Adds a Claude Code hook to block `git push` when gitleaks finds secrets.
`.pre-commit-config.yaml`	Adds gitleaks as a `pre-push` hook.
`.gitleaks.toml`	Introduces the shared gitleaks configuration (default rules + strict Google key rule + allowlist).
`.claude/settings.json`	Registers the Claude Code `PreToolUse`/`Bash` hook to run the gitleaks guard on push.
`.gitignore`	Ignores all `.claude/*` except the shared `.claude/settings.json`.
`tests/integration/test_state_run.py`	Adds gitleaks-gated integration tests for the state-run secret scan and an opt-out for non-scan tests.
`tests/integration/test_state_squash.py`	Opts squash/coexistence tests out of the secret scan so they run without gitleaks installed.
`AGENTS.md`	Documents the three-layer secret-scanning approach and the gitleaks install requirement.
`.agent-plan.md`	Records the secret-scan guard as done and updates UNIFY-PR-06 gating text accordingly.

+  if [[ ${#subtrees[@]} -gt 0 ]]; then
+    for target in "${subtrees[@]}"; do
+      [[ -e "$STATE_REPO_DIR/$target" ]] && targets+=("$STATE_REPO_DIR/$target")
+    done
+  else
+    targets+=("$STATE_REPO_DIR")
+  fi


+    scan = subprocess.run(
+        ["gitleaks", "git", ".", "--no-banner", "--redact"],
+        capture_output=True,
+        text=True,
+    )


codecov · 2026-06-14T13:28:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.84%. Comparing base (4f011be) to head (38065e3).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #220   +/-   ##
=======================================
  Coverage   92.84%   92.84%           
=======================================
  Files          84       84           
  Lines       12337    12337           
=======================================
  Hits        11454    11454           
  Misses        883      883

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Address review findings that the guard had a hole exactly at the incident's location and was under-tested: - .gitleaks.toml: replace the blanket path allowlist (which skipped ALL rules on candidate data — blinding the scanner where the leaked key actually rode in) with narrow per-rule allowlists. Only generic-api-key is suppressed on candidate paths (and jwt on captured fixtures); the strict Google rule, jwt, and all provider rules stay active there, so a real key/JWT in the candidate stream is still caught. - state-run.sh: fail closed when .gitleaks.toml is missing (no silent fallback to gitleaks defaults, which miss the low-entropy Google key class); scan the staged diff via `gitleaks git --staged` (not the whole working tree, so it no longer blocks on pre-existing secrets in unchanged files); single run via a JSON report; distinguish leaks from operational errors; resolve the config via BASH_SOURCE. - gitleaks_prepush_guard.py: shlex-based push detection so it no longer fails open on `git --no-pager push`, env-prefixed, or reordered-flag forms; scan the `-C <dir>` target rather than a hardcoded `.`; pass an explicit script-relative --config; distinguish leaks from gitleaks errors. - ci-test.yml: install a pinned gitleaks in the integration-tests job so the guard's blocking tests run in CI instead of silently skipping. - tests: the old test asserted a Google key in candidate data passes the scan — i.e. it encoded the hole. Replace it with a generic-noise test, and add a regression test that a real Google key AND a JWT in candidate data ARE blocked. Co-Authored-By: Claude Opus 4.8 <[email protected]>

shaypal5 · 2026-06-15T11:24:16Z

Self-review hardening (commit `38065e3`)

A brutal pass over the first cut found that the guard had a hole exactly where the incident happened and was under-tested. Fixed:

Finding	Fix
Allowlist blinded the scanner on candidate data — the leaked key rode in inside `latest_candidates.jsonl`, yet those paths were blanket-allowlisted, so a Google key there scanned as "no leaks found"	Replaced the blanket path allowlist with per-rule `[[allowlists]]` (`targetRules`): only `generic-api-key` is suppressed on candidate paths (and `jwt` on captured fixtures); the strict Google rule, `jwt`, and all provider rules stay active there
Silent fallback to weak defaults when `.gitleaks.toml` is missing (defaults miss the low-entropy `AIza` class the strict rule exists for)	state-run fails closed if the config is absent; the Claude hook resolves the config script-relative and passes `--config` explicitly
Hook scanned `.` even for `git -C <dir> push`	Parses `-C <dir>` and scans that repo
Hook failed open on `git --no-pager push`, env-prefixed, or reordered-flag forms (over-specific regex)	shlex tokenization across `&&`/`;`/`\|`, skipping git global options
Guard tests skipped in CI (gitleaks not installed) → "verified" was hollow	CI integration job now installs a pinned gitleaks; the blocking tests run
state-run scanned the whole subtree (blocked on pre-existing secrets; slow); any non-zero exit printed "SECRET DETECTED"	Scans the staged diff via `gitleaks git --staged`; single JSON-report run; distinguishes leaks from operational errors
A test asserted a Google key in candidate data passes — the hole encoded as green	Replaced with a generic-noise test; added a regression test that a real Google key and a JWT in candidate data are blocked

Local: ruff/mypy clean · 1566 passed, 2 skipped · all three guards hand-exercised (incl. the now-closed candidate-data hole).

github-actions · 2026-06-15T11:28:35Z

pr-agent-context report:

This run includes unresolved review comments on PR #220.

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: scripts/state-run.sh
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/220#discussion_r3409602690
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    If `--subtree` arguments are provided but none of them exist under `STATE_REPO_DIR` (or a typo slips in), `targets` stays empty and the function returns success without scanning anything. This creates a fail-open path for the secret scan.

## COPILOT-2
Location: scripts/hooks/gitleaks_prepush_guard.py
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/220#discussion_r3409602705
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The Claude pre-push guard claims to use the repo-root `.gitleaks.toml`, but the gitleaks invocation doesn’t pass `--config` and runs relative to the hook’s current working directory. If the hook executes from a non-repo cwd, the scan may run against the wrong directory and/or ignore this repo’s allowlist/strict rules.

Run metadata:

Tool ref: v4.0.19
Tool version: 4.0.19
Trigger: commit pushed
Workflow run: 27542897361 attempt 1
Comment timestamp: 2026-06-15T11:27:46.504593+00:00
PR head commit: 38065e3cda5d4b000a949a9850ebbd4f13b891e4

Copilot

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 6 comments.

+_FAKE_GOOGLE_KEY = "AIza" + "B1cD3fGh4JkLmN0pQrStUvWxYz123456789"
+# A structurally-valid but meaningless JWT (the Supabase-JWT incident class).
+_FAKE_JWT = (
+    "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiIxMjM0NTY3ODkwIn0.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c"
+)


+# A high-entropy assignment that trips gitleaks' catch-all `generic-api-key`
+# rule — the kind of false positive news text produces. Suppressed on candidate
+# paths, so it must NOT block a push.
+_GENERIC_NOISE = "api_key = 8f3Hq9ZxR2bN7vKpL4wYtCgD6sErA1mU0oP"


+    config_arg = ["--config", str(_CONFIG)] if _CONFIG.is_file() else []
+    if not config_arg:
+        print(
+            f"gitleaks-prepush: WARNING — {_CONFIG} not found; scanning with gitleaks "
+            "defaults, which miss the low-entropy Google key class.",


+        body = Path(report).read_text() if Path(report).exists() else ""
+        if '"RuleID"' in body:
+            return True, (scan.stderr or body).strip()[-2000:]


-  run/backfill-batch/metrics snapshot writers, so an API error that echoes a key never reaches
-  state. Threat-model tested across the project's secret types. This is the last step of the
-  search-backstop code (`UNIFY-PR-05`) plus the incident fix.
+- Last merged PR on main: `#220` (`GUARD-PR-SECRET-SCAN`, closes #218) — the three-layer


  defers to a recent local search regardless of clock ordering). A zero-run day now finishes
  non-fatal.
  Covered by ledger + config + discover-job tests.
+- [done] `GUARD-PR-SECRET-SCAN` (#220, closes #218): three-surface [gitleaks](https://github.com/gitleaks/gitleaks)


shaypal5 added this to the Local↔CI Unification milestone Jun 14, 2026

Copilot AI review requested due to automatic review settings June 14, 2026 13:22

shaypal5 added ci security Security hardening / incident follow-up labels Jun 14, 2026

Copilot started reviewing on behalf of shaypal5 June 14, 2026 13:22 View session

docs(plan): record PR #220 for the secret-scan guard

1605b48

Co-Authored-By: Claude Opus 4.8 <[email protected]>

This comment has been minimized.

Sign in to view

Copilot AI reviewed Jun 14, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

Copilot AI review requested due to automatic review settings June 15, 2026 11:24

Copilot started reviewing on behalf of shaypal5 June 15, 2026 11:24 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(security): three-layer gitleaks secret-scan guard (#218)#220

feat(security): three-layer gitleaks secret-scan guard (#218)#220
shaypal5 wants to merge 3 commits into
mainfrom
codex/secret-scan-prepush-guard

shaypal5 commented Jun 14, 2026

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Uh oh!

codecov Bot commented Jun 14, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

shaypal5 commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented Jun 14, 2026

Why

What

Shared config — .gitleaks.toml

Guard 1 — git pre-push hook (the "general git" guard)

Guard 2 — state-run push guard

Guard 3 — Claude Code pre-push hook (the "Claude ability")

Docs & plan

Tests

Verification

Note on activation

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment has been minimized.

shaypal5 commented Jun 15, 2026

Self-review hardening (commit 38065e3)

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shared config — `.gitleaks.toml`

codecov Bot commented Jun 14, 2026 •

edited

Loading

Self-review hardening (commit `38065e3`)