Skip to content

CNTRLPLANE-3515: test(e2e): add day-2 label/taint no-rollout verification#8595

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
mgencur:node_labels_and_taints
Jun 16, 2026
Merged

CNTRLPLANE-3515: test(e2e): add day-2 label/taint no-rollout verification#8595
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
mgencur:node_labels_and_taints

Conversation

@mgencur

@mgencur mgencur commented May 27, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Extend NodePoolDay2TagsTest to patch nodeLabels and taints day-2 and verify no rolling upgrade is triggered via NodePool conditions and MachineDeployment generation.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-3515

Special notes for your reviewer:

The original test also scaled up the NodePool and verified that the new nodes actually have the labels/taints. But I think it's not necessary to test it. The existing NodePool upgrade tests already set labels/taints at the beginning and verify these are properly applied to nodes after initial start. These tests together with the new one from this PR cover the whole original test

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Tests
    • Enhanced end-to-end testing for NodePool day‑2 scenarios.
    • Validate applying EC2 resource tags, node labels, and taints without triggering a rolling update.
    • Monitor update-related NodePool conditions and assert rolling-update is not triggered.
    • Re-check MachineDeployment to ensure its generation remains unchanged after the updates.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 27, 2026
@openshift-ci-robot

openshift-ci-robot commented May 27, 2026

Copy link
Copy Markdown

@mgencur: This pull request references CNTRLPLANE-3515 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

Extend NodePoolDay2TagsTest to patch nodeLabels and taints day-2 and verify no rolling upgrade is triggered via NodePool conditions and MachineDeployment generation.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-3515

Special notes for your reviewer:

The original test also scaled up the NodePool and verified that the new nodes actually have the labels/taints. But I think it's not necessary to test it. The existing NodePool upgrade tests already set labels/taints at the beginning and verify these are properly applied to nodes after initial start. These tests together with the new one from this PR cover the whole original test

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/needs-area area/testing Indicates the PR includes changes for e2e testing labels May 27, 2026
@openshift-ci openshift-ci Bot requested review from devguyio and enxebre May 27, 2026 10:06
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8e7e552d-a4b2-4ebe-8341-38bc0cd39fc7

📥 Commits

Reviewing files that changed from the base of the PR and between 4f8b5f2 and b5d3554.

📒 Files selected for processing (1)
  • test/e2e/nodepool_day2_tags_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/nodepool_day2_tags_test.go

📝 Walkthrough

Walkthrough

This PR updates the nodepool_day2_tags e2e test: it adds the k8s sets import, captures the MachineDeployment.Generation before applying changes, applies EC2 resource tag updates plus node label and taint changes, polls NodePool status while tracking updating-related condition types (using a set and a rollingUpdateTriggered flag), asserts rollingUpdateTriggered is false, and re-fetches the MachineDeployment to verify its Generation is unchanged.


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Container-Privileges ❌ Error PR adds container manifests with privileged:true, hostPID, hostNetwork, and allowPrivilegeEscalation settings in new files without documented justification. Document justification for privilege requirements in kubelet-config-daemonset.yaml and KubeVirt CSI daemonset manifests.
Test Structure And Quality ⚠️ Warning Two assertions lack meaningful failure messages: lines 128 (HaveKeyWithValue) and 138-140 (ContainElement) have no messages to diagnose failures. Add failure messages to lines 128 and 138-140 assertions following the pattern of other assertions in the test (e.g., "expected day-2 tag to be applied to AWSMachine" and "expected EC2 instance tags to contain day-2 tag").
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: adding day-2 label/taint no-rollout verification to an E2E test, which aligns with the core purpose of the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Test uses static name "TestNodePoolDay2Tags" with no dynamic values. All dynamic resource names correctly placed in test body, not test titles.
Microshift Test Compatibility ✅ Passed HyperShift (hosted control planes) and MicroShift (edge distribution) are separate projects. This check is not applicable to HyperShift repository tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Test creates single-replica NodePool and doesn't assume multiple nodes, multi-node communication, node scaling, or scheduling across different nodes—all operations work on SNO.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only a test file (nodepool_day2_tags_test.go), not deployment manifests, operators, or controllers. No scheduling constraints are introduced in the code changes.
Ote Binary Stdout Contract ✅ Passed File contains no process-level stdout writes. All code runs within test methods; new sets import has no stdout side effects; no init(), TestMain(), or top-level initializers present.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Test contains no hardcoded IPv4 addresses, IPv4-only IP parsing, IPv4 CIDRs, or external public internet connectivity; AWS API calls are account-internal, and test properly skips on non-AWS platforms.
No-Weak-Crypto ✅ Passed The PR modifies a Kubernetes e2e test file with no cryptographic operations, weak crypto algorithms, custom crypto implementations, or non-constant-time secret comparisons.
No-Sensitive-Data-In-Logs ✅ Passed Test file contains one logging call with a static message and no sensitive data. No credentials, tokens, API keys, PII, or other sensitive data are logged anywhere in the code.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented May 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 41.49%. Comparing base (8ea786c) to head (26c6325).
⚠️ Report is 54 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8595   +/-   ##
=======================================
  Coverage   41.49%   41.49%           
=======================================
  Files         756      756           
  Lines       93648    93648           
=======================================
  Hits        38855    38855           
  Misses      52057    52057           
  Partials     2736     2736           
Flag Coverage Δ
cmd-support 34.87% <ø> (ø)
cpo-hostedcontrolplane 43.50% <ø> (ø)
cpo-other 43.17% <ø> (ø)
hypershift-operator 51.57% <ø> (ø)
other 31.64% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/nodepool_day2_tags_test.go`:
- Around line 79-88: The test currently replaces NodePool.Spec.NodeLabels and
NodePool.Spec.Taints outright; instead modify the test to merge the day-2 values
into existing fields: ensure NodePool.Spec.NodeLabels is non-nil (create if nil)
and set the "e2e.day2.validation" key to "true" without dropping other keys, and
for NodePool.Spec.Taints append the new hyperv1.Taint (Key:"e2e-day2",
Value:"test", Effect:corev1.TaintEffectPreferNoSchedule) only if an equivalent
taint (same Key, Value, Effect) does not already exist, so existing taints are
preserved rather than overwritten.
- Around line 153-154: The MachineDeployment generation assertion is hardcoded
to 1; instead capture the pre-update generation (e.g., store md.Generation in a
variable like initialGeneration or preUpdateGen before performing the day-2
tag/label/taint updates) and then assert that md.Generation is still equal to
that stored value after the update; update the test in
test/e2e/nodepool_day2_tags_test.go to use the stored pre-update generation for
the final assertion against md.Generation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9ed09c00-4205-40e5-b9d2-1c5734b44f66

📥 Commits

Reviewing files that changed from the base of the PR and between 89e19f8 and 9ce5563.

📒 Files selected for processing (1)
  • test/e2e/nodepool_day2_tags_test.go

Comment thread test/e2e/nodepool_day2_tags_test.go Outdated
Comment thread test/e2e/nodepool_day2_tags_test.go Outdated
@mgencur mgencur force-pushed the node_labels_and_taints branch from 9ce5563 to 4f8b5f2 Compare May 27, 2026 10:44
Comment thread test/e2e/nodepool_day2_tags_test.go Outdated
@mgencur mgencur force-pushed the node_labels_and_taints branch from 4f8b5f2 to b5d3554 Compare May 29, 2026 04:37
@muraee

muraee commented May 29, 2026

Copy link
Copy Markdown
Contributor

/approve

@openshift-ci

openshift-ci Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mgencur, muraee

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 29, 2026
@jiezhao16

Copy link
Copy Markdown

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@cwbotbot

cwbotbot commented May 29, 2026

Copy link
Copy Markdown

Test Results

e2e-aws

e2e-aks

@mgencur

mgencur commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

/retest

5 similar comments
@mgencur

mgencur commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@mgencur

mgencur commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@mgencur

mgencur commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@mgencur

mgencur commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@mgencur

mgencur commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

/retest

Extend NodePoolDay2TagsTest to patch nodeLabels and taints day-2
and verify no rolling upgrade is triggered via NodePool conditions
and MachineDeployment generation.

Ref: CNTRLPLANE-3515

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@mgencur mgencur force-pushed the node_labels_and_taints branch from b5d3554 to 26c6325 Compare June 9, 2026 08:20
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 9, 2026
@mgencur

mgencur commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Rebasing because Konflux enterprise contract checks were constantly failing.

@jiezhao16

Copy link
Copy Markdown

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 9, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@mgencur

mgencur commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/retest

4 similar comments
@mgencur

mgencur commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@mgencur

mgencur commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@mgencur

mgencur commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@mgencur

mgencur commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown

I now have complete evidence. All 4 distinct failures occurred within a narrow time window (~10:01–10:04 UTC) when the management cluster's API server became unstable. The PR only modifies nodepool_day2_tags_test.go, and that test passed. These are infrastructure-level failures unrelated to the PR.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

4 independent test failures during a management cluster API server disruption (~10:01–10:04 UTC):

1. TestAutoscaling/Main/TestAutoscalerRespectsNodePoolPause:
   "Failed to wait for MachineDeployment to be paused in 2m0s: context deadline exceeded"

2. TestKarpenter/Main/Billing_vCPUs,_consolidation,_and_cluster_deletion_with_blocking_PDB:
   "dial tcp 54.145.232.161:6443: connect: connection refused"

3. TestCreateCluster/Teardown:
   "namespace still exists after deletion timeout: context deadline exceeded"

4. TestUpgradeControlPlane/Main/EnsureNoCrashingPods:
   "Container control-plane-operator has a restartCount > 0 (6)"

Summary

All 4 test failures are caused by a transient management cluster API server outage that occurred around 10:01–10:04 UTC, impacting every test that was communicating with the management cluster at that moment. The management cluster kube-apiserver (ELB aebbb1060beb44ad9bf1aeab80173a8a, IP 54.145.232.161:6443) started returning connection refused at the TCP level, then progressed to context deadline exceeded, stream error: INTERNAL_ERROR, and the server is currently unable to handle the request. This caused the Karpenter metrics validation to fail, the autoscaling MachineDeployment pause annotation to never propagate, the CreateCluster teardown to time out deleting a namespace, and control-plane pods to crash-loop (restartCount 3–6). None of these failures are related to the PR, which only modifies test/e2e/nodepool_day2_tags_test.go — and that test passed (355.17s).

Root Cause

Transient management cluster API server outage (infrastructure flake, unrelated to PR #8595).

The management cluster's kube-apiserver became unreachable starting around 10:01 UTC. The evidence shows a progression of connectivity failures:

  1. 10:01:58 UTC — Repeated cluster.x-k8s.io/v1beta1 MachineDeployment is deprecated warnings appear every 3 seconds (the autoscaling test polling for the pause annotation while the API is degraded).

  2. 10:02 UTCTestCreateCluster/Teardown gets context deadline exceeded trying to GET the namespace (e2e-clusters-hl882) via the ELB endpoint on port 6443.

  3. 10:03:46 UTCTestKarpenter gets hard connection refused (dial tcp 54.145.232.161:6443: connect: connection refused) attempting to exec into the hypershift-operator pod for metrics collection. This repeats 5+ times.

  4. 10:03:43 UTCTestAutoscaling times out after 2 minutes waiting for the cluster.x-k8s.io/paused annotation to appear on the MachineDeployment. The pause annotation propagation requires the management API to be healthy.

  5. 10:04+ UTCstream error: INTERNAL_ERROR errors appear for both Karpenter and UpgradeControlPlane HostedCluster GET requests, and the server is currently unable to handle the request during cluster dumps.

The TestUpgradeControlPlane/Main/EnsureNoCrashingPods failure (capi-provider restart=4, control-plane-operator restart=6, hosted-cluster-config-operator restart=3) is a direct consequence — these control-plane components crash-looped during the API server disruption.

The PR's test (TestNodePoolDay2Tags) passed successfully in 355.17s. The PR only adds day-2 label/taint no-rollout verification to nodepool_day2_tags_test.go and does not touch any of the failing test files (autoscaling_test.go, karpenter_test.go, hypershift_framework.go, util.go).

Recommendations
  1. Retest the PR — Run /test e2e-aws-4-22 to get a clean run without the management cluster instability. All failures are infrastructure flakes.

  2. No code changes needed — The PR's own test (TestNodePoolDay2Tags) passed, and the day-2 label/taint no-rollout verification logic works correctly.

  3. Known flaky testsTestAutoscalerRespectsNodePoolPause is particularly sensitive to API server latency since it has a tight 2-minute timeout waiting for annotation propagation through the control loop. TestKarpenter's billing vCPUs validation requires exec'ing into pods for metrics, which is fragile during API disruptions.

Evidence
Evidence Detail
PR changed file test/e2e/nodepool_day2_tags_test.go only (+50/-8 lines)
PR's test result TestNodePoolDay2Tags: PASSED (355.17s)
Failure window All 4 failures occurred within ~10:01–10:04 UTC
Management API ELB aebbb1060beb44ad9bf1aeab80173a8a-0b578b4960e41714.elb.us-east-1.amazonaws.com:6443
Connection refused IP 54.145.232.161:6443 — TCP-level rejection
TestAutoscaling error MachineDeployment not yet paused after 2m timeout, context deadline exceeded
TestKarpenter error dial tcp 54.145.232.161:6443: connect: connection refused (5+ occurrences)
TestCreateCluster/Teardown context deadline exceeded getting namespace, the server is currently unable to handle the request
TestUpgradeControlPlane capi-provider restart=4, control-plane-operator restart=6, hosted-cluster-config-operator restart=3
API server HTTP/2 errors stream error: stream ID 12095; INTERNAL_ERROR; received from peer
Test stats 527 passed / 12 failed / 32 skipped out of 571 total

@mgencur

mgencur commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/verified by E2E

This PR improves an E2E test and it passes

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 16, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mgencur: This PR has been marked as verified by E2E.

Details

In response to this:

/verified by E2E

This PR improves an E2E test and it passes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit cfa8fc6 into openshift:main Jun 16, 2026
41 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Indicates the PR includes changes for e2e testing jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants