Fix purge-old-images workflow logging thousands of 404 errors#12301
Fix purge-old-images workflow logging thousands of 404 errors#12301brooke-hamilton wants to merge 2 commits into
Conversation
…script GHCR's list-package-versions API is eventually consistent and keeps returning already-deleted pr-* versions for days, so the previous action re-DELETEd them and logged thousands of harmless 404 errors every run. This replaces the action with an owned, idempotent purge script that treats 404/410 as the expected already-gone state (silent) and persists observed already-gone IDs in a self-healing skip cache that trims to only what GHCR still lists. Co-authored-by: Copilot App <[email protected]> Signed-off-by: Brooke Hamilton <[email protected]>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.OpenSSF Scorecard
Scanned Files
|
There was a problem hiding this comment.
Pull request overview
This PR replaces the snok/container-retention-policy GitHub Action in the scheduled purge-old-images workflow with a repo-owned Bash script that purges old dev/* GHCR package versions whose tags are exclusively pr-*, while treating 404/410 deletes as an expected “already gone” state and using a cached “ghost ID” set to avoid repeatedly re-attempting deletes that GHCR still lists due to eventual consistency.
Changes:
- Replace
snok/container-retention-policywith.github/scripts/purge-ghcr-dev-images.sh, using explicit HTTP status handling and retries. - Add a “ghost version” cache via
actions/cacheto suppress repeated already-gone delete attempts across runs. - Add
workflow_dispatchinputdry_run(defaulttrue) to validate selection without deletion.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
.github/workflows/purge-old-images.yaml |
Switch workflow from third-party action to owned purge script; add ghost cache restore/save and dry_run input. |
.github/scripts/purge-ghcr-dev-images.sh |
New idempotent GHCR purge script with tag/age filtering, delete status branching, and self-healing ghost-ID cache logic. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #12301 +/- ##
==========================================
- Coverage 52.97% 52.96% -0.02%
==========================================
Files 754 754
Lines 48686 48686
==========================================
- Hits 25791 25785 -6
- Misses 20469 20472 +3
- Partials 2426 2429 +3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
- validate_requirements now checks for awk (used for delete pacing) - header Deps comment now lists curl and awk - workflow creates .ghcr-ghost-cache before restore/save so dry-run / cache-miss runs are idempotent Co-authored-by: Copilot App <[email protected]> Signed-off-by: Brooke Hamilton <[email protected]>
Radius functional test overviewClick here to see the test run details
Test Status⌛ Building Radius and pushing container images for functional tests... |
Problem
The
purge-old-imagesworkflow succeeds (exit 0) but every run's log contains thousands of404 Not Found"Failed to delete package version" errors (~4,795 of ~5,632 selected versions in a recent run; ~80% of the same version IDs recur across runs 12h apart).Root cause
GHCR's list-package-versions API is eventually consistent and keeps returning versions that were already deleted (with their old
pr-*tags) for days.snok/container-retention-policyre-selects and re-DELETEs those ghosts every run; each 404 lands in the action's catch-allerror!arm (logged, marked failed, but does not fail the job). There is no upstream toggle to silence it, and no GHCR API to force the listing to refresh.Fix
Replace the action with an owned, idempotent purge script plus a self-healing ghost-ID skip cache:
.github/scripts/purge-ghcr-dev-images.shlistsdev/*container packages and selects versions where every tag matches^pr-(solatest/longr*shared images are never touched) andupdated_atis older than the cutoff (7 days).curlwith status branching:204= deleted,404/410= already-gone (silent),403/429= back off + retry, anything else = real failure -> job exits non-zero. So the job now fails only on genuine errors.actions/cache, rollingrun_idkey +restore-keysprefix). The cache is self-healing: each run persists only the already-gone IDs observed that run (cached IDs GHCR still lists + this run's fresh 404s + this run's successful deletes), so an ID GHCR finally stops listing drops out next run.GHOST_CACHE_MAX(sort -n | tail) is a hard backstop.workflow_dispatchdry_runinput (defaulttrue) for safe validation, scopedpermissions: contents: read, and keeps the classic-PAT auth and the failure-issue job.Validation
bash -nclean; workflow YAML parses.How to validate in CI
workflow_dispatchwithdry_run=true— confirm candidate counts look right, zero deletes, zero 404 ERROR lines.workflow_dispatchwithdry_run=falseonce — confirm deletes succeed, ghosts silent, cache saved.Follow-ups (not in this PR)
latest/longr*are still skipped by design; truly cleaning them needs a separate tag-level policy.create_issue_on_failurehas no dedup/auto-close.