Skip to content

OCPBUGS-86310: Handle CA bundle aggregation delay by requeuing revocation#8563

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
YamunadeviShanmugam:fix_race_condition_ca_bundle
May 21, 2026
Merged

OCPBUGS-86310: Handle CA bundle aggregation delay by requeuing revocation#8563
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
YamunadeviShanmugam:fix_race_condition_ca_bundle

Conversation

@YamunadeviShanmugam

@YamunadeviShanmugam YamunadeviShanmugam commented May 21, 2026

Copy link
Copy Markdown
Contributor

** CA bundle aggregation delay**
During End-to-End (e2e) testing, CertificateRevocationRequest (CRR) successfully processes initial steps such as leaf and root certificate regeneration, but permanently stuck on Condition: PreviousCertificatesRevoked=False Reason: WaitingForAvailable(Previous signer certificate not yet revoked.)

The failure is caused by race condition between CertificateRevocationController and TargetConfigController

Fix:
Modified certificaterevocationcontroller.go to explicitly request a requeue when the old signer certificate has not yet removed from the total trust bundle by TargetConfigController.

Summary by CodeRabbit

  • Bug Fixes
    • PKI operator now automatically retries when newly issued signer certificates haven't yet propagated to client trust bundles, reducing transient failures during rollout.
    • PKI operator now automatically retries when revoked or old signer certificates remain present in client trust bundles, improving safety and consistency during certificate rotation.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2026
@openshift-ci

openshift-ci Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@YamunadeviShanmugam YamunadeviShanmugam changed the title fix the race condition in CA bundle Fix the race condition in CA bundle May 21, 2026
@coderabbitai

coderabbitai Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request adjusts the certificate revocation controller's trust-bundle propagation logic to gate operations on synchronization state. The controller now issues synthetic requeues in two scenarios: when the new signer certificate has not yet propagated to the trust bundle, and when the old signer certificate has not yet been revoked from the trust bundle. Tests were updated to expect these requeue outcomes.

Sequence Diagram(s)

sequenceDiagram
  participant CertificateRevocationController
  participant TotalClientTrustBundle
  CertificateRevocationController->>TotalClientTrustBundle: trustedCertificates(newSigner)
  TotalClientTrustBundle-->>CertificateRevocationController: count = 0
  CertificateRevocationController->>CertificateRevocationController: synthetic requeue (new signer not propagated)
  CertificateRevocationController->>TotalClientTrustBundle: trustedCertificates(oldSigner)
  TotalClientTrustBundle-->>CertificateRevocationController: count > 0
  CertificateRevocationController->>CertificateRevocationController: synthetic requeue (old signer still trusted)
Loading

Possibly related PRs

  • openshift/hypershift#8263: Modifies the same certificaterevocationcontroller propagation/revocation gating to return requeue=true until trust/revocation conditions are satisfied.

Suggested reviewers

  • csrwng
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Tests use Go's standard testing (t.Run), not Ginkgo. Custom check applies only to Ginkgo test names (It/Describe/Context/When), which are absent.
Test Structure And Quality ✅ Passed The custom check is for Ginkgo test code, but the PR modifies standard Go tests using *testing.T with t.Run() subtests, not Ginkgo's It/Describe blocks.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are added. Only existing standard Go unit tests are modified, updating requeue expectations in two test cases.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR does not add new Ginkgo e2e tests. Changes are to unit tests (standard Go testing.T) and controller logic only, so SNO compatibility check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed PR changes only controller logic (requeue behavior). No deployment manifests, scheduling constraints, affinity rules, or topology assumptions added.
Ote Binary Stdout Contract ✅ Passed No stdout-writing code added at process-level. Changes are only to controller business logic and test assertions—no fmt.Print*, log.Print*, klog config, os.Stdout, or println patterns detected.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds no new Ginkgo e2e tests. Changes are limited to controller code and unit tests using standard Go testing framework. Check not applicable.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: handling CA bundle aggregation delay through requeuing revocation logic in the certificate revocation controller.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/control-plane-pki-operator Indicates the PR includes changes for the control plane PKI operator - in an OCP release and removed do-not-merge/needs-area labels May 21, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go (2)

1035-1037: ⚡ Quick win

Consider adding inline comment and debug logging for observability.

Following the pattern at lines 1061 and 1073, consider adding an inline comment explaining why synthetic requeue is needed here (waiting for TargetConfigController to remove the old signer from the trust bundle), plus debug logging to help diagnose similar stuck conditions in the future.

📝 Suggested improvement for observability
 if len(trustedCertificates(totalClientTrustBundle, []*certificateSecret{{cert: oldCerts[0]}}, now)) != 0 {
+	klog.V(4).Infof("Old signer certificate still present in total trust bundle for %s/%s, requeueing", namespace, name)
-	return true, nil, true, nil
+	return true, nil, true, nil // synthetic re-queue needed for trust bundle revocation by TargetConfigController
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`
around lines 1035 - 1037, Add an inline comment above the conditional using
trustedCertificates(totalClientTrustBundle, []*certificateSecret{{cert:
oldCerts[0]}}, now) explaining that we return a synthetic requeue here to wait
for the TargetConfigController to remove the old signer from the trust bundle;
also add a debug log (using the controller's logger) just before returning that
records the computed trust length (len(...)) and identifying info about
oldCerts[0] (e.g., subject/serial or a stable identifier) so operators can
diagnose stuck conditions when this branch fires.

735-737: ⚡ Quick win

Consider adding inline comment and debug logging for observability.

Following the pattern at lines 761 and 780, consider adding an inline comment explaining why synthetic requeue is needed here (waiting for TargetConfigController to propagate the new signer to the trust bundle), plus debug logging similar to other requeue conditions in this controller.

📝 Suggested improvement for observability
 if len(trustedCertificates(totalClientTrustBundle, []*certificateSecret{{cert: signers[0]}}, now)) == 0 {
+	klog.V(4).Infof("New signer certificate %s/%s not yet propagated to total trust bundle for %s/%s, requeueing", signer.Namespace, signer.Name, namespace, name)
-	return true, nil, true, nil
+	return true, nil, true, nil // synthetic re-queue needed for trust bundle propagation by TargetConfigController
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`
around lines 735 - 737, Add an inline comment above the if that uses
trustedCertificates(totalClientTrustBundle, []*certificateSecret{{cert:
signers[0]}}, now) == 0 explaining this is a synthetic requeue to wait for the
TargetConfigController to propagate the new signer into the trust bundle; also
add a debug log (using the controller's logger, e.g. r.log or processLogger) in
the same branch that logs the reason and relevant context (e.g. signer
fingerprint/identifier from signers[0], and that trustedCertificates returned 0)
so the requeue is observable and consistent with other requeue debug logs in
this controller.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`:
- Around line 1035-1037: Add an inline comment above the conditional using
trustedCertificates(totalClientTrustBundle, []*certificateSecret{{cert:
oldCerts[0]}}, now) explaining that we return a synthetic requeue here to wait
for the TargetConfigController to remove the old signer from the trust bundle;
also add a debug log (using the controller's logger) just before returning that
records the computed trust length (len(...)) and identifying info about
oldCerts[0] (e.g., subject/serial or a stable identifier) so operators can
diagnose stuck conditions when this branch fires.
- Around line 735-737: Add an inline comment above the if that uses
trustedCertificates(totalClientTrustBundle, []*certificateSecret{{cert:
signers[0]}}, now) == 0 explaining this is a synthetic requeue to wait for the
TargetConfigController to propagate the new signer into the trust bundle; also
add a debug log (using the controller's logger, e.g. r.log or processLogger) in
the same branch that logs the reason and relevant context (e.g. signer
fingerprint/identifier from signers[0], and that trustedCertificates returned 0)
so the requeue is observable and consistent with other requeue debug logs in
this controller.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: f40fd551-96db-4148-97eb-11c0ed360cc5

📥 Commits

Reviewing files that changed from the base of the PR and between 1d34880 and 7146e9a.

📒 Files selected for processing (1)
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go

@codecov

codecov Bot commented May 21, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.38%. Comparing base (1d34880) to head (16b595b).
⚠️ Report is 8 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8563   +/-   ##
=======================================
  Coverage   40.37%   40.38%           
=======================================
  Files         755      755           
  Lines       93227    93227           
=======================================
+ Hits        37644    37648    +4     
+ Misses      52879    52875    -4     
  Partials     2704     2704           
Files with missing lines Coverage Δ
...ationcontroller/certificaterevocationcontroller.go 56.00% <100.00%> (+0.44%) ⬆️
Flag Coverage Δ
cmd-support 34.34% <ø> (ø)
cpo-hostedcontrolplane 41.76% <ø> (ø)
cpo-other 40.31% <ø> (ø)
hypershift-operator 50.72% <ø> (ø)
other 31.58% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go (2)

466-466: ⚡ Quick win

Clarify test name to reflect requeue behavior.

The test name "not yet propagated, nothing to do" is misleading since the controller now requeues (line 470). Consider renaming to follow the "When ... it should ..." pattern and accurately describe the behavior.

📝 Suggested test name
-			name:            "not yet propagated, nothing to do",
+			name:            "When new signer not yet propagated it should requeue",

As per coding guidelines, use "When ... it should ..." format for describing test cases when creating unit tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`
at line 466, Update the test case name string "not yet propagated, nothing to
do" to accurately reflect that the controller requeues; follow the "When ... it
should ..." pattern (e.g., "When CR not yet propagated it should requeue for
later processing") so readers understand expected requeue behavior — locate the
table-driven test entry that sets name: "not yet propagated, nothing to do" in
certificaterevocationcontroller_test.go and replace it with a descriptive "When
... it should ..." phrasing.

851-851: ⚡ Quick win

Align test name with coding guidelines.

Consider renaming this test to follow the "When ... it should ..." pattern for consistency with the coding guidelines and to more clearly indicate the expected behavior.

📝 Suggested test name
-			name:            "validating, previous still valid",
+			name:            "When old signer still in total bundle it should requeue",

As per coding guidelines, use "When ... it should ..." format for describing test cases when creating unit tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`
at line 851, Rename the test case name string in the table of test cases inside
certificaterevocationcontroller_test.go (the entry currently named "validating,
previous still valid") to follow the "When ... it should ..." pattern; for
example change it to something like "When validating and previous certificate is
still valid it should not revoke" so the test case name used by the test runner
(the table entry `name` field) follows the coding guideline and clearly
describes expected behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`:
- Line 466: Update the test case name string "not yet propagated, nothing to do"
to accurately reflect that the controller requeues; follow the "When ... it
should ..." pattern (e.g., "When CR not yet propagated it should requeue for
later processing") so readers understand expected requeue behavior — locate the
table-driven test entry that sets name: "not yet propagated, nothing to do" in
certificaterevocationcontroller_test.go and replace it with a descriptive "When
... it should ..." phrasing.
- Line 851: Rename the test case name string in the table of test cases inside
certificaterevocationcontroller_test.go (the entry currently named "validating,
previous still valid") to follow the "When ... it should ..." pattern; for
example change it to something like "When validating and previous certificate is
still valid it should not revoke" so the test case name used by the test runner
(the table entry `name` field) follows the coding guideline and clearly
describes expected behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b312a748-4eb5-4172-8613-b06d9e0f64eb

📥 Commits

Reviewing files that changed from the base of the PR and between 7146e9a and 66b66af.

📒 Files selected for processing (1)
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go

@YamunadeviShanmugam YamunadeviShanmugam force-pushed the fix_race_condition_ca_bundle branch from 66b66af to 6a58287 Compare May 21, 2026 03:49

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`:
- Line 466: Rename the test case title string in the table-driven tests in
certificaterevocationcontroller_test.go: change the 'name' field value "not yet
propagated, nothing to do" to the Gherkin-style "When not yet propagated it
should do nothing"; apply the same "When ... it should ..." renaming pattern to
the other affected test case mentioned (around the other occurrence referenced)
so all '*_test.go' cases use the required Gherkin format.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5ab2a0a5-2e7f-4df7-a412-adaea7fccadc

📥 Commits

Reviewing files that changed from the base of the PR and between 66b66af and 6a58287.

📒 Files selected for processing (2)
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go

now: postRevocationClock.Now,
crrNamespace: "crr-ns",
crrName: "crr-name",
name: "not yet propagated, nothing to do",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Rename edited test cases to the required Gherkin format.

The updated test titles should follow When ... it should ... to match repo test naming requirements.

Suggested rename
- name:            "not yet propagated, nothing to do",
+ name:            "When new signer cert is not yet propagated it should requeue",
...
- name:            "validating, previous still valid",
+ name:            "When previous signer cert is still trusted it should requeue",

As per coding guidelines, "**/*_test.{go,ts,tsx,js,jsx}: Always use "When ... it should ..." format for describing test cases when creating unit tests" and "**/*_test.go: Use Gherkin Syntax to define unit test cases with 'When... it should...' pattern".

Also applies to: 851-851

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`
at line 466, Rename the test case title string in the table-driven tests in
certificaterevocationcontroller_test.go: change the 'name' field value "not yet
propagated, nothing to do" to the Gherkin-style "When not yet propagated it
should do nothing"; apply the same "When ... it should ..." renaming pattern to
the other affected test case mentioned (around the other occurrence referenced)
so all '*_test.go' cases use the required Gherkin format.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go (1)

466-470: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Rename updated table-case titles to the required Gherkin format.

At Line 466 and Line 851, the case names still don’t follow When ... it should .... Please rename both while touching these scenarios.

As per coding guidelines, "**/*_test.{go,ts,tsx,js,jsx}: Always use "When ... it should ..." format for describing test cases when creating unit tests" and "**/*_test.go: Use Gherkin Syntax to define unit test cases with 'When... it should...' pattern".

Also applies to: 851-855

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`
around lines 466 - 470, Rename the table-driven test case titles (the "name"
fields) in certificaterevocationcontroller_test.go from plain phrases to
Gherkin-style "When ... it should ..." sentences; specifically update the case
with name "not yet propagated, nothing to do" (the entry where now is
postRevocationClock.Now and expectedRequeue is true) to something like "When new
cert not yet in total bundle it should requeue", and likewise change the other
case around the second occurrence (the one mentioned at lines 851-855) to follow
the same "When ... it should ..." pattern so all test case names conform to the
Gherkin guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`:
- Around line 466-470: Rename the table-driven test case titles (the "name"
fields) in certificaterevocationcontroller_test.go from plain phrases to
Gherkin-style "When ... it should ..." sentences; specifically update the case
with name "not yet propagated, nothing to do" (the entry where now is
postRevocationClock.Now and expectedRequeue is true) to something like "When new
cert not yet in total bundle it should requeue", and likewise change the other
case around the second occurrence (the one mentioned at lines 851-855) to follow
the same "When ... it should ..." pattern so all test case names conform to the
Gherkin guideline.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 132a6468-5157-46cd-a08c-352458ba0e01

📥 Commits

Reviewing files that changed from the base of the PR and between 6a58287 and 510bbda.

📒 Files selected for processing (1)
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go

@YamunadeviShanmugam YamunadeviShanmugam force-pushed the fix_race_condition_ca_bundle branch from 510bbda to 71a71c2 Compare May 21, 2026 03:59

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go (2)

466-466: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Test case name still doesn't follow required Gherkin format.

This test name was already flagged in a previous review but hasn't been updated. As per coding guidelines, test cases must use "When ... it should ..." format.

📝 Suggested rename
-			name:            "not yet propagated, nothing to do",
+			name:            "When new signer cert not yet in total bundle it should requeue",

As per coding guidelines, "**/*_test.{go,ts,tsx,js,jsx}: Always use 'When ... it should ...' format for describing test cases when creating unit tests".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`
at line 466, Rename the failing test case name string "not yet propagated,
nothing to do" to follow the required Gherkin format "When ... it should ...";
locate the test table entry in certificaterevocationcontroller_test.go where
name: "not yet propagated, nothing to do" appears and replace it with a
descriptive Gherkin-style name such as "When revocation is not yet propagated it
should do nothing" (or similar) so the test description follows the "When ... it
should ..." guideline.

851-851: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Test case name still doesn't follow required Gherkin format.

This test name was already flagged in a previous review but hasn't been updated. As per coding guidelines, test cases must use "When ... it should ..." format.

📝 Suggested rename
-			name:            "validating, previous still valid",
+			name:            "When old signer cert still in total bundle it should requeue",

As per coding guidelines, "**/*_test.{go,ts,tsx,js,jsx}: Always use 'When ... it should ...' format for describing test cases when creating unit tests".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`
at line 851, Rename the test case name value "validating, previous still valid"
to follow the required Gherkin-style "When ... it should ..." format (e.g.,
"When validating and previous certificate is still valid it should ...") by
updating the test case's name field in the table-driven test (the entry with
name: "validating, previous still valid") so it starts with "When" and includes
"it should" describing the expected outcome.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`:
- Line 466: Rename the failing test case name string "not yet propagated,
nothing to do" to follow the required Gherkin format "When ... it should ...";
locate the test table entry in certificaterevocationcontroller_test.go where
name: "not yet propagated, nothing to do" appears and replace it with a
descriptive Gherkin-style name such as "When revocation is not yet propagated it
should do nothing" (or similar) so the test description follows the "When ... it
should ..." guideline.
- Line 851: Rename the test case name value "validating, previous still valid"
to follow the required Gherkin-style "When ... it should ..." format (e.g.,
"When validating and previous certificate is still valid it should ...") by
updating the test case's name field in the table-driven test (the entry with
name: "validating, previous still valid") so it starts with "When" and includes
"it should" describing the expected outcome.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 0dbdd987-da82-446c-bc9c-5ab1fc34113f

📥 Commits

Reviewing files that changed from the base of the PR and between 510bbda and 71a71c2.

📒 Files selected for processing (2)
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go

@YamunadeviShanmugam YamunadeviShanmugam force-pushed the fix_race_condition_ca_bundle branch from 71a71c2 to a3eaabb Compare May 21, 2026 04:03
@sdminonne

Copy link
Copy Markdown
Contributor

/test e2e-aks

@sdminonne

Copy link
Copy Markdown
Contributor

I may be wrong but I the title is somehow misleading

The controller already has three mechanisms to re-reconcile when the trust bundle changes:

  1. ConfigMap informer watch (certificaterevocationcontroller.go): The enqueueConfigMap handler explicitly watches TotalKASClientCABundle and re-enqueues ALL CRRs when it changes.
  2. ResyncEvery(time.Minute) (certificaterevocationcontroller.go): Periodic re-reconciliation every 60 seconds regardless of events.
  3. Secret and Pod informers (certificaterevocationcontroller.go): Related resource changes also trigger reconciliation.

So I'm not positive we've a RACE here but definitively it's speeding up the revocation mechanism so this PR is safe and it's adding a real value. Thanks!

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

I may be wrong but I the title is somehow misleading

The controller already has three mechanisms to re-reconcile when the trust bundle changes:

  1. ConfigMap informer watch (certificaterevocationcontroller.go): The enqueueConfigMap handler explicitly watches TotalKASClientCABundle and re-enqueues ALL CRRs when it changes.
  2. ResyncEvery(time.Minute) (certificaterevocationcontroller.go): Periodic re-reconciliation every 60 seconds regardless of events.
  3. Secret and Pod informers (certificaterevocationcontroller.go): Related resource changes also trigger reconciliation.

So I'm not positive we've a RACE here but definitively it's speeding up the revocation mechanism so this PR is safe and it's adding a real value. Thanks!

@sdminonne ,
"Race condition" was probably the wrong phrase. The system isn't deadlocking permanently, it's just getting stuck waiting for a long fallback resync. Since we were returning requeue as false , the revocation controller was falling asleep and doesnt get an update that the TargetConfigController finished its job 1 minute later. It ended up sleeping until a fallback resync happens. Because the e2e test framework has a hard 10-minute timeout, it failed and being flaky. The PR is essentially a speed-up/optimization. By explicitly returning requeue as true we drop that wait time.

I’ll update the PR description to remove the "race condition" phrasing and focus accurately on the unhandled resync delay causing the timeout. Thanks for your inputs!

@YamunadeviShanmugam YamunadeviShanmugam changed the title Fix the race condition in CA bundle Handle CA bundle aggregation delay by requeuing revocation May 21, 2026
@sdminonne

Copy link
Copy Markdown
Contributor

/cc @sdminonne

@openshift-ci openshift-ci Bot requested a review from sdminonne May 21, 2026 06:24
@sdminonne

Copy link
Copy Markdown
Contributor

/approve

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2057332427440787456 | Cost: $1.94119975 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

Update logic of CertificateRevocationController to request a requeue when encountering stale total-client-ca bundle cache.

@jparrill jparrill left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@YamunadeviShanmugam YamunadeviShanmugam force-pushed the fix_race_condition_ca_bundle branch from 63d7067 to 16b595b Compare May 21, 2026 07:12
@jparrill

Copy link
Copy Markdown
Contributor

/test e2e-conformance

@jparrill

Copy link
Copy Markdown
Contributor

/retitle OCPBUGS-86310: Handle CA bundle aggregation delay by requeuing revocation

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 21, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@jparrill

Copy link
Copy Markdown
Contributor

/test e2e-aws

@cwbotbot

cwbotbot commented May 21, 2026

Copy link
Copy Markdown

Test Results

e2e-aws

e2e-aks

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2057376114996678656 | Cost: $2.9737307499999996 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/test e2e-azure-self-managed

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-azure-self-managed | Build: 2057402147137392640 | Cost: $2.93901125 | Failed step: hypershift-azure-run-e2e-self-managed

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/test e2e-azure-self-managed

@hypershift-jira-solve-ci

Copy link
Copy Markdown

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Failed to destroy cluster: failed to clean up secrets in clusters namespace:
Delete "https://a2fb55450a1f4470aa86c1dc6cd7a133-179590118.us-east-1.elb.amazonaws.com:7443/api/v1/namespaces/clusters/secrets?labelSelector=hypershift.openshift.io%2Fauto-created-for-infra%3D3c56b339dc-mgmt":
stream error: stream ID 3; INTERNAL_ERROR; received from peer

Summary

All 334 e2e tests passed (0 failures, 29 skipped). The pre phase (cluster installation) and test phase (e2e execution) both succeeded. The job was marked as failed solely because the post-phase destroy-management-cluster cleanup step encountered a transient HTTP/2 stream error when attempting to delete leftover secrets from the CI cluster's kube-apiserver. This is a CI infrastructure flake completely unrelated to PR #8563.

Root Cause

The failure is a transient CI infrastructure issue, not a product or test bug. Here is the precise chain of events:

  1. The destroy-management-cluster post-phase step ran after all tests passed.
  2. It successfully deleted the hosted cluster, cleaned up all 19 Azure role assignments across 7 components (cpo, capz, cloud-provider, azure-file, azure-disk, ingress, cncc, ciro, ciroWI, azure-diskWI, azure-fileWI), and deleted all 3 Azure resource groups (3c56b339dc-mgmt-3c56b339dc-mgmt, 3c56b339dc-mgmt-vnet-3c56b339dc-mgmt, 3c56b339dc-mgmt-nsg-3c56b339dc-mgmt).
  3. At the final step — deleting auto-created secrets with label hypershift.openshift.io/auto-created-for-infra=3c56b339dc-mgmt from the clusters namespace — the HTTP DELETE request to the CI cluster's kube-apiserver (hosted behind AWS ELB a2fb55450a1f4470aa86c1dc6cd7a133-179590118.us-east-1.elb.amazonaws.com:7443) failed with an HTTP/2 INTERNAL_ERROR after a 60-second timeout (request sent at 13:58:37, error at 13:59:37).
  4. This is a well-known transient failure mode: the AWS ELB or the underlying kube-apiserver dropped the HTTP/2 stream mid-request, likely due to connection reset, load balancer idle timeout, or brief API server unavailability on the CI infrastructure cluster.

This failure has zero relation to PR #8563 (which adds CA bundle aggregation delay handling). The PR's code changes were never exercised during cluster teardown, and all functional tests covering the PR's behavior passed.

Recommendations
  1. Retest / Ignore: This job should be retried with /retest. The failure is a CI infrastructure flake in the post-phase cleanup step and does not indicate any issue with the PR.
  2. No code changes needed: All 334 e2e tests passed, confirming the PR's functionality works correctly on Azure self-managed HyperShift clusters.
  3. Infrastructure improvement (optional): The destroy-management-cluster step could benefit from retry logic around the final secret cleanup API call to be resilient against transient HTTP/2 stream errors from the CI cluster's kube-apiserver. This is a pre-existing issue unrelated to this PR.
Evidence
Evidence Detail
E2E test result All 334 tests PASSED, 29 skipped, 0 failures (junit.xml confirms failures="0")
Test step status hypershift-azure-run-e2e-self-managed finished with "passed": true, "result": "SUCCESS"
Failed step destroy-management-cluster (post-phase cleanup only) finished with "passed": false
Pre phase All 4 steps succeeded (ipi-install-rbac, create-management-cluster, hypershift-azure-setup-private-link, hypershift-install) in 17m6s
Test phase Single test step succeeded in 59m0s
Error type HTTP/2 stream error: INTERNAL_ERROR on DELETE request to CI cluster kube-apiserver
Error endpoint a2fb55450a1f4470aa86c1dc6cd7a133-179590118.us-east-1.elb.amazonaws.com:7443 (AWS CI cluster, NOT the Azure test cluster)
Azure cleanup Fully completed: all role assignments deleted, all 3 resource groups deleted
Failure timing 13:58:37 → 13:59:37 (60s timeout on secret deletion API call)
Analyze step hypershift-analyze-e2e-failure confirmed: "Test step passed — skipping analysis"
Stack trace origin cmd/cluster/azure/destroy.go:63 — the final cleanup function in the destroy command

@sdminonne

Copy link
Copy Markdown
Contributor

/test e2e-azure-self-managed

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/verified by CI

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 21, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@YamunadeviShanmugam: This PR has been marked as verified by CI.

Details

In response to this:

/verified by CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

@YamunadeviShanmugam: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit b39a15b into openshift:main May 21, 2026
40 checks passed
@openshift-ci-robot

Copy link
Copy Markdown

@YamunadeviShanmugam: Jira Issue Verification Checks: Jira Issue OCPBUGS-86310
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-86310 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

** CA bundle aggregation delay**
During End-to-End (e2e) testing, CertificateRevocationRequest (CRR) successfully processes initial steps such as leaf and root certificate regeneration, but permanently stuck on Condition: PreviousCertificatesRevoked=False Reason: WaitingForAvailable(Previous signer certificate not yet revoked.)

The failure is caused by race condition between CertificateRevocationController and TargetConfigController

Fix:
Modified certificaterevocationcontroller.go to explicitly request a requeue when the old signer certificate has not yet removed from the total trust bundle by TargetConfigController.

Summary by CodeRabbit

  • Bug Fixes
  • PKI operator now automatically retries when newly issued signer certificates haven't yet propagated to client trust bundles, reducing transient failures during rollout.
  • PKI operator now automatically retries when revoked or old signer certificates remain present in client trust bundles, improving safety and consistency during certificate rotation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot

Copy link
Copy Markdown
Contributor

Fix included in release 5.0.0-0.nightly-2026-05-22-041009

@bryan-cox

Copy link
Copy Markdown
Member

/jira backport release-4.22

@openshift-ci-robot

Copy link
Copy Markdown

@bryan-cox: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22

Details

In response to this:

/jira backport release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

Copy link
Copy Markdown

@openshift-ci-robot: new pull request created: #8746

Details

In response to this:

@bryan-cox: The following backport issues have been created:

Queuing cherrypicks to the requested branches to be created after this PR merges:
/cherrypick release-4.22

In response to this:

/jira backport release-4.22

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-pki-operator Indicates the PR includes changes for the control plane PKI operator - in an OCP release jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants