Skip to content

feat(security): three-layer gitleaks secret-scan guard (#218)#220

Open
shaypal5 wants to merge 3 commits into
mainfrom
codex/secret-scan-prepush-guard
Open

feat(security): three-layer gitleaks secret-scan guard (#218)#220
shaypal5 wants to merge 3 commits into
mainfrom
codex/secret-scan-prepush-guard

Conversation

@shaypal5

Copy link
Copy Markdown
Member

Why

Closes #218. The structural follow-up to the seed-time incident where a live Google CSE API key — captured into a discovery run's errors[] from a CSE 403 error URL (?key=…) — was pushed to the public state repo. The key was rotated, the public history purged, and the root-cause redaction landed in #217 (redact_secrets() scrubs credential values from discovery state at write time).

This PR adds the outer defense: scan with a real industry secret scanner (gitleaks), not ad-hoc regex, at every point where content can leave the machine. It satisfies gate (a) of UNIFY-PR-06 (go-live).

What

One config, three independent guards.

Shared config — .gitleaks.toml

  • useDefault = true (the full industry ruleset), plus a strict, no-entropy AIza… Google API-key rule — gitleaks' default Google rule applies an entropy gate that missed the exact key that leaked.
  • An [allowlist] for bulk news-candidate data files (article URLs/titles/snippets — scanning them yields only third-party false positives) and captured-page test fixtures (which carry third-party ad/analytics tokens, not our secrets).

Guard 1 — git pre-push hook (the "general git" guard)

.pre-commit-config.yaml gains the upstream gitleaks hook pinned at v8.30.1, stages: [pre-push]. The repo already sets default_install_hook_types: [pre-commit, pre-push]. Gates what leaves the machine via git push. (Verified live on this very push — see the "Detect hardcoded secrets" pre-push step.)

Guard 2 — state-run push guard

scripts/state-run.sh gains scan_state_for_secrets(), called before each git commit/push to the public state repo. It scans only the committed subtrees (falling back to the whole worktree) with the repo-root config. It fails closed: if gitleaks isn't installed it refuses to push, with install guidance. STATE_RUN_SKIP_SECRET_SCAN=1 is the discouraged, explicit escape hatch.

Guard 3 — Claude Code pre-push hook (the "Claude ability")

A new checked-in .claude/settings.json registers a PreToolUse/Bash hook → scripts/hooks/gitleaks_prepush_guard.py (stdlib-only). It no-ops unless the Bash command is a git push, then runs gitleaks git . and blocks the tool call (exit 2) with redacted findings when secrets are present. Stops an agent from pushing a secret in the first place. .gitignore is adjusted (.claude/* + !.claude/settings.json) so this one shared file is tracked while settings.local.json and worktrees stay ignored.

Docs & plan

  • AGENTS.md gains a Secret Scanning (defense in depth) section: the gitleaks install requirement, the config, and all three guards (incl. state-run's fail-closed behavior and the override var).
  • .agent-plan.md: new [done] GUARD-PR-SECRET-SCAN ledger entry; UNIFY-PR-06 gate (a) marked satisfied.

Tests

  • tests/integration/test_state_run.py: 3 gitleaks-gated tests (skipif when the binary is absent) — a planted Google key is blocked before push, clean state passes, and bulk candidate data is not false-flagged. _run_wrapper gains a scan_secrets param defaulting to off (sets STATE_RUN_SKIP_SECRET_SCAN=1) so mechanics tests run anywhere.
  • tests/integration/test_state_squash.py: squash/coexistence tests opt out of the scan via the same env var.

Verification

  • ruff format --check . / ruff check . — clean
  • mypy src/ — clean (105 files)
  • pytest -q tests/unit tests/integration1565 passed, 2 skipped
  • All three guards exercised by hand: pre-push hook passed live on this push; the Claude hook allows non-push + clean push and blocks a committed key (exit 2); state-run blocks a planted key and passes clean state.

Note on activation

.claude/settings.json is newly created. Claude Code's settings watcher only tracks .claude/ dirs that had a settings file at session start, so the hook becomes active for collaborators on their next claude session (or after opening /hooks once). The git pre-push and state-run guards are active immediately.

Add defense-in-depth secret scanning with gitleaks (the industry tool,
not ad-hoc regex), the structural follow-up to the seed-time Google CSE
key leak. A repo-root .gitleaks.toml (default ruleset + a no-entropy
AIza Google API-key rule + an allowlist for bulk news-candidate data and
captured-page test fixtures) drives all three guards:

1. pre-commit gitleaks hook at stages: [pre-push] — the general git guard.
2. scripts/state-run.sh scan_state_for_secrets() — scans the state
   worktree before every commit/push and fails closed when gitleaks is
   absent (STATE_RUN_SKIP_SECRET_SCAN=1 is the discouraged escape hatch).
3. Claude Code PreToolUse/Bash hook (.claude/settings.json ->
   scripts/hooks/gitleaks_prepush_guard.py) blocking any agent-issued
   git push when gitleaks finds secrets in tracked content.

Documented under AGENTS.md "Secret Scanning". gitleaks-gated integration
tests skip when the binary is absent; squash/coexistence mechanics tests
opt out via STATE_RUN_SKIP_SECRET_SCAN=1. .gitignore now lets the shared
.claude/settings.json through while keeping settings.local.json and
worktrees ignored. Satisfies gate (a) of UNIFY-PR-06.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
@shaypal5 shaypal5 added this to the Local↔CI Unification milestone Jun 14, 2026
Copilot AI review requested due to automatic review settings June 14, 2026 13:22
@shaypal5 shaypal5 added ci security Security hardening / incident follow-up labels Jun 14, 2026
@github-actions

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds defense-in-depth secret scanning at the main “egress” points (developer git pushes, state-repo publishing via state-run, and agent-initiated pushes) using a shared gitleaks configuration to prevent a repeat of the state-seeding secret leak incident.

Changes:

  • Add a shared gitleaks rules+allowlist config (.gitleaks.toml) and wire it into multiple guard layers.
  • Enforce secret scanning before state repo commit/push in scripts/state-run.sh, with integration tests covering blocked/allowed cases.
  • Add developer/agent push guards via pre-commit pre-push gitleaks hook and a Claude Code PreToolUse hook, plus documentation/plan updates.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/state-run.sh Adds a fail-closed gitleaks scan step before committing/pushing state.
scripts/hooks/gitleaks_prepush_guard.py Adds a Claude Code hook to block git push when gitleaks finds secrets.
.pre-commit-config.yaml Adds gitleaks as a pre-push hook.
.gitleaks.toml Introduces the shared gitleaks configuration (default rules + strict Google key rule + allowlist).
.claude/settings.json Registers the Claude Code PreToolUse/Bash hook to run the gitleaks guard on push.
.gitignore Ignores all .claude/* except the shared .claude/settings.json.
tests/integration/test_state_run.py Adds gitleaks-gated integration tests for the state-run secret scan and an opt-out for non-scan tests.
tests/integration/test_state_squash.py Opts squash/coexistence tests out of the secret scan so they run without gitleaks installed.
AGENTS.md Documents the three-layer secret-scanning approach and the gitleaks install requirement.
.agent-plan.md Records the secret-scan guard as done and updates UNIFY-PR-06 gating text accordingly.

Comment thread scripts/state-run.sh Outdated
Comment on lines +144 to +150
if [[ ${#subtrees[@]} -gt 0 ]]; then
for target in "${subtrees[@]}"; do
[[ -e "$STATE_REPO_DIR/$target" ]] && targets+=("$STATE_REPO_DIR/$target")
done
else
targets+=("$STATE_REPO_DIR")
fi
Comment thread scripts/hooks/gitleaks_prepush_guard.py Outdated
Comment on lines +47 to +51
scan = subprocess.run(
["gitleaks", "git", ".", "--no-banner", "--redact"],
capture_output=True,
text=True,
)
@codecov

codecov Bot commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.84%. Comparing base (4f011be) to head (38065e3).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #220   +/-   ##
=======================================
  Coverage   92.84%   92.84%           
=======================================
  Files          84       84           
  Lines       12337    12337           
=======================================
  Hits        11454    11454           
  Misses        883      883           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

This comment has been minimized.

Address review findings that the guard had a hole exactly at the incident's
location and was under-tested:

- .gitleaks.toml: replace the blanket path allowlist (which skipped ALL rules on
  candidate data — blinding the scanner where the leaked key actually rode in)
  with narrow per-rule allowlists. Only generic-api-key is suppressed on
  candidate paths (and jwt on captured fixtures); the strict Google rule, jwt,
  and all provider rules stay active there, so a real key/JWT in the candidate
  stream is still caught.
- state-run.sh: fail closed when .gitleaks.toml is missing (no silent fallback
  to gitleaks defaults, which miss the low-entropy Google key class); scan the
  staged diff via `gitleaks git --staged` (not the whole working tree, so it no
  longer blocks on pre-existing secrets in unchanged files); single run via a
  JSON report; distinguish leaks from operational errors; resolve the config via
  BASH_SOURCE.
- gitleaks_prepush_guard.py: shlex-based push detection so it no longer fails
  open on `git --no-pager push`, env-prefixed, or reordered-flag forms; scan the
  `-C <dir>` target rather than a hardcoded `.`; pass an explicit script-relative
  --config; distinguish leaks from gitleaks errors.
- ci-test.yml: install a pinned gitleaks in the integration-tests job so the
  guard's blocking tests run in CI instead of silently skipping.
- tests: the old test asserted a Google key in candidate data passes the scan —
  i.e. it encoded the hole. Replace it with a generic-noise test, and add a
  regression test that a real Google key AND a JWT in candidate data ARE blocked.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Copilot AI review requested due to automatic review settings June 15, 2026 11:24
@shaypal5

Copy link
Copy Markdown
Member Author

Self-review hardening (commit 38065e3)

A brutal pass over the first cut found that the guard had a hole exactly where the incident happened and was under-tested. Fixed:

Finding Fix
Allowlist blinded the scanner on candidate data — the leaked key rode in inside latest_candidates.jsonl, yet those paths were blanket-allowlisted, so a Google key there scanned as "no leaks found" Replaced the blanket path allowlist with per-rule [[allowlists]] (targetRules): only generic-api-key is suppressed on candidate paths (and jwt on captured fixtures); the strict Google rule, jwt, and all provider rules stay active there
Silent fallback to weak defaults when .gitleaks.toml is missing (defaults miss the low-entropy AIza class the strict rule exists for) state-run fails closed if the config is absent; the Claude hook resolves the config script-relative and passes --config explicitly
Hook scanned . even for git -C <dir> push Parses -C <dir> and scans that repo
Hook failed open on git --no-pager push, env-prefixed, or reordered-flag forms (over-specific regex) shlex tokenization across &&/;/|, skipping git global options
Guard tests skipped in CI (gitleaks not installed) → "verified" was hollow CI integration job now installs a pinned gitleaks; the blocking tests run
state-run scanned the whole subtree (blocked on pre-existing secrets; slow); any non-zero exit printed "SECRET DETECTED" Scans the staged diff via gitleaks git --staged; single JSON-report run; distinguishes leaks from operational errors
A test asserted a Google key in candidate data passes — the hole encoded as green Replaced with a generic-noise test; added a regression test that a real Google key and a JWT in candidate data are blocked

Local: ruff/mypy clean · 1566 passed, 2 skipped · all three guards hand-exercised (incl. the now-closed candidate-data hole).

@github-actions

Copy link
Copy Markdown

pr-agent-context report:

This run includes unresolved review comments on PR #220.

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, and push all of these changes in a single commit.

# Copilot Comments

## COPILOT-1
Location: scripts/state-run.sh
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/220#discussion_r3409602690
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    If `--subtree` arguments are provided but none of them exist under `STATE_REPO_DIR` (or a typo slips in), `targets` stays empty and the function returns success without scanning anything. This creates a fail-open path for the secret scan.

## COPILOT-2
Location: scripts/hooks/gitleaks_prepush_guard.py
URL: https://github.com/DataHackIL/tfht_enforce_idx/pull/220#discussion_r3409602705
Status: outdated
Root author: copilot-pull-request-reviewer

Comment:
    The Claude pre-push guard claims to use the repo-root `.gitleaks.toml`, but the gitleaks invocation doesn’t pass `--config` and runs relative to the hook’s current working directory. If the hook executes from a non-repo cwd, the scan may run against the wrong directory and/or ignore this repo’s allowlist/strict rules.

Run metadata:

Tool ref: v4.0.19
Tool version: 4.0.19
Trigger: commit pushed
Workflow run: 27542897361 attempt 1
Comment timestamp: 2026-06-15T11:27:46.504593+00:00
PR head commit: 38065e3cda5d4b000a949a9850ebbd4f13b891e4

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 6 comments.

Comment on lines +266 to +270
_FAKE_GOOGLE_KEY = "AIza" + "B1cD3fGh4JkLmN0pQrStUvWxYz123456789"
# A structurally-valid but meaningless JWT (the Supabase-JWT incident class).
_FAKE_JWT = (
"eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiIxMjM0NTY3ODkwIn0.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c"
)
Comment on lines +271 to +274
# A high-entropy assignment that trips gitleaks' catch-all `generic-api-key`
# rule — the kind of false positive news text produces. Suppressed on candidate
# paths, so it must NOT block a push.
_GENERIC_NOISE = "api_key = 8f3Hq9ZxR2bN7vKpL4wYtCgD6sErA1mU0oP"
Comment on lines +100 to +104
config_arg = ["--config", str(_CONFIG)] if _CONFIG.is_file() else []
if not config_arg:
print(
f"gitleaks-prepush: WARNING — {_CONFIG} not found; scanning with gitleaks "
"defaults, which miss the low-entropy Google key class.",
Comment on lines +128 to +130
body = Path(report).read_text() if Path(report).exists() else ""
if '"RuleID"' in body:
return True, (scan.stderr or body).strip()[-2000:]
Comment thread .agent-plan.md
run/backfill-batch/metrics snapshot writers, so an API error that echoes a key never reaches
state. Threat-model tested across the project's secret types. This is the last step of the
search-backstop code (`UNIFY-PR-05`) plus the incident fix.
- Last merged PR on main: `#220` (`GUARD-PR-SECRET-SCAN`, closes #218) — the three-layer
Comment thread .agent-plan.md
defers to a recent local search regardless of clock ordering). A zero-run day now finishes
non-fatal.
Covered by ledger + config + discover-job tests.
- [done] `GUARD-PR-SECRET-SCAN` (#220, closes #218): three-surface [gitleaks](https://github.com/gitleaks/gitleaks)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci security Security hardening / incident follow-up

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a secret-scan guard to state-run before committing/pushing state

2 participants