Skip to content

Fix purge-old-images workflow logging thousands of 404 errors#12301

Draft
brooke-hamilton wants to merge 2 commits into
mainfrom
brooke-hamilton-fix-purge-old-images-ghost-404s
Draft

Fix purge-old-images workflow logging thousands of 404 errors#12301
brooke-hamilton wants to merge 2 commits into
mainfrom
brooke-hamilton-fix-purge-old-images-ghost-404s

Conversation

@brooke-hamilton

Copy link
Copy Markdown
Member

Problem

The purge-old-images workflow succeeds (exit 0) but every run's log contains thousands of 404 Not Found "Failed to delete package version" errors (~4,795 of ~5,632 selected versions in a recent run; ~80% of the same version IDs recur across runs 12h apart).

Root cause

GHCR's list-package-versions API is eventually consistent and keeps returning versions that were already deleted (with their old pr-* tags) for days. snok/container-retention-policy re-selects and re-DELETEs those ghosts every run; each 404 lands in the action's catch-all error! arm (logged, marked failed, but does not fail the job). There is no upstream toggle to silence it, and no GHCR API to force the listing to refresh.

Fix

Replace the action with an owned, idempotent purge script plus a self-healing ghost-ID skip cache:

  • .github/scripts/purge-ghcr-dev-images.sh lists dev/* container packages and selects versions where every tag matches ^pr- (so latest/longr* shared images are never touched) and updated_at is older than the cutoff (7 days).
  • Deletes via curl with status branching: 204 = deleted, 404/410 = already-gone (silent), 403/429 = back off + retry, anything else = real failure -> job exits non-zero. So the job now fails only on genuine errors.
  • Already-gone IDs are persisted in a ghost cache (via actions/cache, rolling run_id key + restore-keys prefix). The cache is self-healing: each run persists only the already-gone IDs observed that run (cached IDs GHCR still lists + this run's fresh 404s + this run's successful deletes), so an ID GHCR finally stops listing drops out next run. GHOST_CACHE_MAX (sort -n | tail) is a hard backstop.
  • Workflow gains a workflow_dispatch dry_run input (default true) for safe validation, scoped permissions: contents: read, and keeps the classic-PAT auth and the failure-issue job.

Validation

  • shfmt / shellcheck / bash -n clean; workflow YAML parses.
  • Mocked end-to-end tests cover selection (skips mixed-tag / recent / untagged), all delete branches, dry-run, real-failure exit code, and self-healing eviction (run 1 caches deleted+404 IDs; run 2 evicts the ID GHCR stopped listing; run 3 steady state = zero 404 noise).

How to validate in CI

  1. workflow_dispatch with dry_run=true — confirm candidate counts look right, zero deletes, zero 404 ERROR lines.
  2. workflow_dispatch with dry_run=false once — confirm deletes succeed, ghosts silent, cache saved.
  3. Next scheduled run should show a smaller candidate set (cache warmed) and a clean log.

Follow-ups (not in this PR)

  • Shared recipe images tagged latest/longr* are still skipped by design; truly cleaning them needs a separate tag-level policy.
  • create_issue_on_failure has no dedup/auto-close.
  • Report the GHCR stale-listing behavior to GitHub Support.

…script

GHCR's list-package-versions API is eventually consistent and keeps returning already-deleted pr-* versions for days, so the previous action re-DELETEd them and logged thousands of harmless 404 errors every run. This replaces the action with an owned, idempotent purge script that treats 404/410 as the expected already-gone state (silent) and persists observed already-gone IDs in a self-healing skip cache that trims to only what GHCR still lists.

Co-authored-by: Copilot App <[email protected]>
Signed-off-by: Brooke Hamilton <[email protected]>
Copilot AI review requested due to automatic review settings July 1, 2026 22:02
@brooke-hamilton brooke-hamilton requested review from a team as code owners July 1, 2026 22:02
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

OpenSSF Scorecard

PackageVersionScoreDetails
actions/actions/cache/restore 27d5ce7f107fe9357f9df03efb73ab90386fccae 🟢 6.4
Details
CheckScoreReason
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Maintained🟢 1016 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 10
Code-Review🟢 10all changesets reviewed
Binary-Artifacts🟢 10no binaries found in the repo
Packaging⚠️ -1packaging workflow not detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Pinned-Dependencies⚠️ 3dependency not pinned by hash detected -- score normalized to 3
License🟢 10license file detected
Fuzzing⚠️ 0project is not fuzzed
Signed-Releases⚠️ -1no releases found
Security-Policy🟢 9security policy file detected
SAST🟢 10SAST tool is run on all commits
Branch-Protection⚠️ 1branch protection is not maximal on development and all release branches
actions/actions/cache/save 27d5ce7f107fe9357f9df03efb73ab90386fccae 🟢 6.4
Details
CheckScoreReason
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
Maintained🟢 1016 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 10
Code-Review🟢 10all changesets reviewed
Binary-Artifacts🟢 10no binaries found in the repo
Packaging⚠️ -1packaging workflow not detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Pinned-Dependencies⚠️ 3dependency not pinned by hash detected -- score normalized to 3
License🟢 10license file detected
Fuzzing⚠️ 0project is not fuzzed
Signed-Releases⚠️ -1no releases found
Security-Policy🟢 9security policy file detected
SAST🟢 10SAST tool is run on all commits
Branch-Protection⚠️ 1branch protection is not maximal on development and all release branches
actions/actions/checkout 9c091bb21b7c1c1d1991bb908d89e4e9dddfe3e0 🟢 6.9
Details
CheckScoreReason
Code-Review🟢 10all changesets reviewed
Maintained🟢 1016 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 10
Binary-Artifacts🟢 10no binaries found in the repo
Dangerous-Workflow🟢 10no dangerous workflow patterns detected
CII-Best-Practices⚠️ 0no effort to earn an OpenSSF best practices badge detected
Token-Permissions⚠️ 0detected GitHub workflow tokens with excessive permissions
Fuzzing⚠️ 0project is not fuzzed
License🟢 10license file detected
Packaging⚠️ -1packaging workflow not detected
Pinned-Dependencies⚠️ 3dependency not pinned by hash detected -- score normalized to 3
Signed-Releases⚠️ -1no releases found
Security-Policy🟢 9security policy file detected
Branch-Protection🟢 5branch protection is not maximal on development and all release branches
SAST🟢 10SAST tool is run on all commits

Scanned Files

  • .github/workflows/purge-old-images.yaml

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Unit Tests

    2 files  ±0    452 suites  ±0   7m 36s ⏱️ +7s
5 656 tests ±0  5 654 ✅ ±0  2 💤 ±0  0 ❌ ±0 
6 853 runs  ±0  6 851 ✅ ±0  2 💤 ±0  0 ❌ ±0 

Results for commit 5f57dec. ± Comparison against base commit bf1015c.

♻️ This comment has been updated with latest results.

@brooke-hamilton brooke-hamilton changed the title Replace snok/container-retention-policy with self-healing GHCR purge script Fix purge-old-images workflow logging thousands of 404 errors Jul 1, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the snok/container-retention-policy GitHub Action in the scheduled purge-old-images workflow with a repo-owned Bash script that purges old dev/* GHCR package versions whose tags are exclusively pr-*, while treating 404/410 deletes as an expected “already gone” state and using a cached “ghost ID” set to avoid repeatedly re-attempting deletes that GHCR still lists due to eventual consistency.

Changes:

  • Replace snok/container-retention-policy with .github/scripts/purge-ghcr-dev-images.sh, using explicit HTTP status handling and retries.
  • Add a “ghost version” cache via actions/cache to suppress repeated already-gone delete attempts across runs.
  • Add workflow_dispatch input dry_run (default true) to validate selection without deletion.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
.github/workflows/purge-old-images.yaml Switch workflow from third-party action to owned purge script; add ghost cache restore/save and dry_run input.
.github/scripts/purge-ghcr-dev-images.sh New idempotent GHCR purge script with tag/age filtering, delete status branching, and self-healing ghost-ID cache logic.

Comment thread .github/workflows/purge-old-images.yaml
Comment thread .github/scripts/purge-ghcr-dev-images.sh Outdated
Comment thread .github/scripts/purge-ghcr-dev-images.sh
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.96%. Comparing base (bf1015c) to head (5f57dec).

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #12301      +/-   ##
==========================================
- Coverage   52.97%   52.96%   -0.02%     
==========================================
  Files         754      754              
  Lines       48686    48686              
==========================================
- Hits        25791    25785       -6     
- Misses      20469    20472       +3     
- Partials     2426     2429       +3     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- validate_requirements now checks for awk (used for delete pacing)
- header Deps comment now lists curl and awk
- workflow creates .ghcr-ghost-cache before restore/save so dry-run / cache-miss runs are idempotent

Co-authored-by: Copilot App <[email protected]>
Signed-off-by: Brooke Hamilton <[email protected]>
@radius-functional-tests

radius-functional-tests Bot commented Jul 1, 2026

Copy link
Copy Markdown

Radius functional test overview

🔍 Go to test action run

Click here to see the test run details
Name Value
Repository radius-project/radius
Commit ref 5f57dec
Unique ID func5281175526
Image tag pr-func5281175526
  • Dapr: 1.14.4
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.3.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-func5281175526
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-func5281175526
  • dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-func5281175526
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-func5281175526
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-func5281175526
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting corerp-cloud functional tests...
⌛ Starting ucp-cloud functional tests...
✅ ucp-cloud functional tests succeeded
✅ corerp-cloud functional tests succeeded

@brooke-hamilton brooke-hamilton marked this pull request as draft July 1, 2026 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants