feat(security): three-layer gitleaks secret-scan guard (#218)#220
feat(security): three-layer gitleaks secret-scan guard (#218)#220shaypal5 wants to merge 3 commits into
Conversation
Add defense-in-depth secret scanning with gitleaks (the industry tool, not ad-hoc regex), the structural follow-up to the seed-time Google CSE key leak. A repo-root .gitleaks.toml (default ruleset + a no-entropy AIza Google API-key rule + an allowlist for bulk news-candidate data and captured-page test fixtures) drives all three guards: 1. pre-commit gitleaks hook at stages: [pre-push] — the general git guard. 2. scripts/state-run.sh scan_state_for_secrets() — scans the state worktree before every commit/push and fails closed when gitleaks is absent (STATE_RUN_SKIP_SECRET_SCAN=1 is the discouraged escape hatch). 3. Claude Code PreToolUse/Bash hook (.claude/settings.json -> scripts/hooks/gitleaks_prepush_guard.py) blocking any agent-issued git push when gitleaks finds secrets in tracked content. Documented under AGENTS.md "Secret Scanning". gitleaks-gated integration tests skip when the binary is absent; squash/coexistence mechanics tests opt out via STATE_RUN_SKIP_SECRET_SCAN=1. .gitignore now lets the shared .claude/settings.json through while keeping settings.local.json and worktrees ignored. Satisfies gate (a) of UNIFY-PR-06. Co-Authored-By: Claude Opus 4.8 <[email protected]>
Co-Authored-By: Claude Opus 4.8 <[email protected]>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
This PR adds defense-in-depth secret scanning at the main “egress” points (developer git pushes, state-repo publishing via state-run, and agent-initiated pushes) using a shared gitleaks configuration to prevent a repeat of the state-seeding secret leak incident.
Changes:
- Add a shared gitleaks rules+allowlist config (
.gitleaks.toml) and wire it into multiple guard layers. - Enforce secret scanning before state repo commit/push in
scripts/state-run.sh, with integration tests covering blocked/allowed cases. - Add developer/agent push guards via pre-commit pre-push gitleaks hook and a Claude Code
PreToolUsehook, plus documentation/plan updates.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/state-run.sh |
Adds a fail-closed gitleaks scan step before committing/pushing state. |
scripts/hooks/gitleaks_prepush_guard.py |
Adds a Claude Code hook to block git push when gitleaks finds secrets. |
.pre-commit-config.yaml |
Adds gitleaks as a pre-push hook. |
.gitleaks.toml |
Introduces the shared gitleaks configuration (default rules + strict Google key rule + allowlist). |
.claude/settings.json |
Registers the Claude Code PreToolUse/Bash hook to run the gitleaks guard on push. |
.gitignore |
Ignores all .claude/* except the shared .claude/settings.json. |
tests/integration/test_state_run.py |
Adds gitleaks-gated integration tests for the state-run secret scan and an opt-out for non-scan tests. |
tests/integration/test_state_squash.py |
Opts squash/coexistence tests out of the secret scan so they run without gitleaks installed. |
AGENTS.md |
Documents the three-layer secret-scanning approach and the gitleaks install requirement. |
.agent-plan.md |
Records the secret-scan guard as done and updates UNIFY-PR-06 gating text accordingly. |
| if [[ ${#subtrees[@]} -gt 0 ]]; then | ||
| for target in "${subtrees[@]}"; do | ||
| [[ -e "$STATE_REPO_DIR/$target" ]] && targets+=("$STATE_REPO_DIR/$target") | ||
| done | ||
| else | ||
| targets+=("$STATE_REPO_DIR") | ||
| fi |
| scan = subprocess.run( | ||
| ["gitleaks", "git", ".", "--no-banner", "--redact"], | ||
| capture_output=True, | ||
| text=True, | ||
| ) |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #220 +/- ##
=======================================
Coverage 92.84% 92.84%
=======================================
Files 84 84
Lines 12337 12337
=======================================
Hits 11454 11454
Misses 883 883 🚀 New features to boost your workflow:
|
This comment has been minimized.
This comment has been minimized.
Address review findings that the guard had a hole exactly at the incident's location and was under-tested: - .gitleaks.toml: replace the blanket path allowlist (which skipped ALL rules on candidate data — blinding the scanner where the leaked key actually rode in) with narrow per-rule allowlists. Only generic-api-key is suppressed on candidate paths (and jwt on captured fixtures); the strict Google rule, jwt, and all provider rules stay active there, so a real key/JWT in the candidate stream is still caught. - state-run.sh: fail closed when .gitleaks.toml is missing (no silent fallback to gitleaks defaults, which miss the low-entropy Google key class); scan the staged diff via `gitleaks git --staged` (not the whole working tree, so it no longer blocks on pre-existing secrets in unchanged files); single run via a JSON report; distinguish leaks from operational errors; resolve the config via BASH_SOURCE. - gitleaks_prepush_guard.py: shlex-based push detection so it no longer fails open on `git --no-pager push`, env-prefixed, or reordered-flag forms; scan the `-C <dir>` target rather than a hardcoded `.`; pass an explicit script-relative --config; distinguish leaks from gitleaks errors. - ci-test.yml: install a pinned gitleaks in the integration-tests job so the guard's blocking tests run in CI instead of silently skipping. - tests: the old test asserted a Google key in candidate data passes the scan — i.e. it encoded the hole. Replace it with a generic-noise test, and add a regression test that a real Google key AND a JWT in candidate data ARE blocked. Co-Authored-By: Claude Opus 4.8 <[email protected]>
Self-review hardening (commit
|
| Finding | Fix |
|---|---|
Allowlist blinded the scanner on candidate data — the leaked key rode in inside latest_candidates.jsonl, yet those paths were blanket-allowlisted, so a Google key there scanned as "no leaks found" |
Replaced the blanket path allowlist with per-rule [[allowlists]] (targetRules): only generic-api-key is suppressed on candidate paths (and jwt on captured fixtures); the strict Google rule, jwt, and all provider rules stay active there |
Silent fallback to weak defaults when .gitleaks.toml is missing (defaults miss the low-entropy AIza class the strict rule exists for) |
state-run fails closed if the config is absent; the Claude hook resolves the config script-relative and passes --config explicitly |
Hook scanned . even for git -C <dir> push |
Parses -C <dir> and scans that repo |
Hook failed open on git --no-pager push, env-prefixed, or reordered-flag forms (over-specific regex) |
shlex tokenization across &&/;/|, skipping git global options |
| Guard tests skipped in CI (gitleaks not installed) → "verified" was hollow | CI integration job now installs a pinned gitleaks; the blocking tests run |
| state-run scanned the whole subtree (blocked on pre-existing secrets; slow); any non-zero exit printed "SECRET DETECTED" | Scans the staged diff via gitleaks git --staged; single JSON-report run; distinguishes leaks from operational errors |
| A test asserted a Google key in candidate data passes — the hole encoded as green | Replaced with a generic-noise test; added a regression test that a real Google key and a JWT in candidate data are blocked |
Local: ruff/mypy clean · 1566 passed, 2 skipped · all three guards hand-exercised (incl. the now-closed candidate-data hole).
|
pr-agent-context report: This run includes unresolved review comments on PR #220.
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.
# Copilot Comments
## COPILOT-1
Location: scripts/state-run.sh
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/220#discussion_r3409602690
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
If `--subtree` arguments are provided but none of them exist under `STATE_REPO_DIR` (or a typo slips in), `targets` stays empty and the function returns success without scanning anything. This creates a fail-open path for the secret scan.
## COPILOT-2
Location: scripts/hooks/gitleaks_prepush_guard.py
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/220#discussion_r3409602705
Status: outdated
Root author: copilot-pull-request-reviewer
Comment:
The Claude pre-push guard claims to use the repo-root `.gitleaks.toml`, but the gitleaks invocation doesn’t pass `--config` and runs relative to the hook’s current working directory. If the hook executes from a non-repo cwd, the scan may run against the wrong directory and/or ignore this repo’s allowlist/strict rules.Run metadata: |
| _FAKE_GOOGLE_KEY = "AIza" + "B1cD3fGh4JkLmN0pQrStUvWxYz123456789" | ||
| # A structurally-valid but meaningless JWT (the Supabase-JWT incident class). | ||
| _FAKE_JWT = ( | ||
| "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiIxMjM0NTY3ODkwIn0.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c" | ||
| ) |
| # A high-entropy assignment that trips gitleaks' catch-all `generic-api-key` | ||
| # rule — the kind of false positive news text produces. Suppressed on candidate | ||
| # paths, so it must NOT block a push. | ||
| _GENERIC_NOISE = "api_key = 8f3Hq9ZxR2bN7vKpL4wYtCgD6sErA1mU0oP" |
| config_arg = ["--config", str(_CONFIG)] if _CONFIG.is_file() else [] | ||
| if not config_arg: | ||
| print( | ||
| f"gitleaks-prepush: WARNING — {_CONFIG} not found; scanning with gitleaks " | ||
| "defaults, which miss the low-entropy Google key class.", |
| body = Path(report).read_text() if Path(report).exists() else "" | ||
| if '"RuleID"' in body: | ||
| return True, (scan.stderr or body).strip()[-2000:] |
| run/backfill-batch/metrics snapshot writers, so an API error that echoes a key never reaches | ||
| state. Threat-model tested across the project's secret types. This is the last step of the | ||
| search-backstop code (`UNIFY-PR-05`) plus the incident fix. | ||
| - Last merged PR on main: `#220` (`GUARD-PR-SECRET-SCAN`, closes #218) — the three-layer |
| defers to a recent local search regardless of clock ordering). A zero-run day now finishes | ||
| non-fatal. | ||
| Covered by ledger + config + discover-job tests. | ||
| - [done] `GUARD-PR-SECRET-SCAN` (#220, closes #218): three-surface [gitleaks](https://github.com/gitleaks/gitleaks) |
Why
Closes #218. The structural follow-up to the seed-time incident where a live Google CSE API key — captured into a discovery run's
errors[]from a CSE403error URL (?key=…) — was pushed to the public state repo. The key was rotated, the public history purged, and the root-cause redaction landed in #217 (redact_secrets()scrubs credential values from discovery state at write time).This PR adds the outer defense: scan with a real industry secret scanner (gitleaks), not ad-hoc regex, at every point where content can leave the machine. It satisfies gate (a) of
UNIFY-PR-06(go-live).What
One config, three independent guards.
Shared config —
.gitleaks.tomluseDefault = true(the full industry ruleset), plus a strict, no-entropyAIza…Google API-key rule — gitleaks' default Google rule applies an entropy gate that missed the exact key that leaked.[allowlist]for bulk news-candidate data files (article URLs/titles/snippets — scanning them yields only third-party false positives) and captured-page test fixtures (which carry third-party ad/analytics tokens, not our secrets).Guard 1 — git pre-push hook (the "general git" guard)
.pre-commit-config.yamlgains the upstreamgitleakshook pinned atv8.30.1,stages: [pre-push]. The repo already setsdefault_install_hook_types: [pre-commit, pre-push]. Gates what leaves the machine viagit push. (Verified live on this very push — see the "Detect hardcoded secrets" pre-push step.)Guard 2 — state-run push guard
scripts/state-run.shgainsscan_state_for_secrets(), called before eachgit commit/push to the public state repo. It scans only the committed subtrees (falling back to the whole worktree) with the repo-root config. It fails closed: if gitleaks isn't installed it refuses to push, with install guidance.STATE_RUN_SKIP_SECRET_SCAN=1is the discouraged, explicit escape hatch.Guard 3 — Claude Code pre-push hook (the "Claude ability")
A new checked-in
.claude/settings.jsonregisters aPreToolUse/Bashhook →scripts/hooks/gitleaks_prepush_guard.py(stdlib-only). It no-ops unless the Bash command is agit push, then runsgitleaks git .and blocks the tool call (exit 2) with redacted findings when secrets are present. Stops an agent from pushing a secret in the first place..gitignoreis adjusted (.claude/*+!.claude/settings.json) so this one shared file is tracked whilesettings.local.jsonand worktrees stay ignored.Docs & plan
AGENTS.mdgains a Secret Scanning (defense in depth) section: the gitleaks install requirement, the config, and all three guards (incl. state-run's fail-closed behavior and the override var)..agent-plan.md: new[done] GUARD-PR-SECRET-SCANledger entry;UNIFY-PR-06gate (a) marked satisfied.Tests
tests/integration/test_state_run.py: 3 gitleaks-gated tests (skipifwhen the binary is absent) — a planted Google key is blocked before push, clean state passes, and bulk candidate data is not false-flagged._run_wrappergains ascan_secretsparam defaulting to off (setsSTATE_RUN_SKIP_SECRET_SCAN=1) so mechanics tests run anywhere.tests/integration/test_state_squash.py: squash/coexistence tests opt out of the scan via the same env var.Verification
ruff format --check ./ruff check .— cleanmypy src/— clean (105 files)pytest -q tests/unit tests/integration— 1565 passed, 2 skippedNote on activation
.claude/settings.jsonis newly created. Claude Code's settings watcher only tracks.claude/dirs that had a settings file at session start, so the hook becomes active for collaborators on their nextclaudesession (or after opening/hooksonce). The git pre-push and state-run guards are active immediately.