Skip to content

AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays#8613

Open
cssjr wants to merge 7 commits into
openshift:mainfrom
cssjr:etcd-endpointslice-self-register
Open

AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays#8613
cssjr wants to merge 7 commits into
openshift:mainfrom
cssjr:etcd-endpointslice-self-register

Conversation

@cssjr

@cssjr cssjr commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds EndpointSlice self-registration to the etcd ensure-dns init container, bypassing stale kube-controller-manager informer caches that delay DNS resolution under high HCP density
  • Creates a dedicated etcd service account with minimal RBAC for EndpointSlice and pod get permissions
  • Replaces the conditional etcd-defrag-controller SA with the always-present etcd SA, binding both self-register and defrag (HA-only) roles to it

Problem

Under high cluster density (~2,500 HCP namespaces), the management cluster's kube-controller-manager EndpointSlice informer cache goes stale, causing etcd-discovery headless Service DNS records to be delayed by minutes to hours. The ensure-dns init container blocks waiting for DNS, preventing etcd from starting, which cascades into cluster creation timeouts.

Evidence from Kusto (HostedControlPlaneLogs, hcp-dev-us-2.eastus2)

Scale: Hundreds to thousands of clusters affected daily across CI:

Date         Clusters   Failures   Prolonged(>5m)   Brief(<=5m)
2026-05-13        675     15,947           37            638
2026-05-14      1,210     24,869          142          1,063
2026-05-19      1,542     29,622          248          1,297
2026-05-20      1,228     22,563          180          1,039
2026-05-21      1,311     23,130          174          1,135
2026-05-26        980     17,884          138            842

Breakdown (last 3 days): Of 2,299 etcd pods that experienced DNS failures, 89.5% resolved within 30 seconds, but 8.1% (187 pods) took >5 minutes — these are the ones that cause E2E test timeouts.

Root Cause: EndpointSlice informer cache staleness

Confirmed via aksEvents table (category kube-controller-manager) on May 14:

"Error syncing endpoint slices for service, retrying"
  key="ocm-arohcpprow-<cluster-id>/etcd-discovery"
  err="EndpointSlice informer cache is out of date"

This appeared for dozens of HCP namespaces — individual namespaces saw 136–227 sync errors. A secondary pattern ("skipping Pod for Service, Node not found") shows the same informer staleness affecting node visibility, with top nodes seeing 330–557 skip events.

What's ruled out:

Hypothesis Evidence Verdict
CoreDNS unhealthy No CoreDNS errors in logs Ruled out
Node-specific issue Failures evenly distributed across all nodes Ruled out
CNI/IP assignment delay CNS assigned IPs within seconds Ruled out
Service creation race Framework guarantees Service before StatefulSet Ruled out

Specific case timeline (etcd-0 in a CI namespace, May 26):

Time Event
17:57:52 CPO reconciles etcd-discovery Service
17:57:53 CPO reconciles etcd StatefulSet
17:58:22 ensure-dns starts failing — "no such host"
17:58:26 CNS assigns pod IP 10.128.64.186
18:18:54 DNS finally resolves — 20 minutes after IP assignment

The pod had its IP within 4 seconds, the Service existed 30 seconds before the pod, CoreDNS was healthy — the EndpointSlice controller simply hadn't updated the headless Service's endpoints.

Impact on E2E stability

All three stage E2E jobs show poor stability:

Job Success Rate (last 20 runs)
periodic-stage-e2e-parallel 40%
periodic-stage-e2e-parallel-ocp-nightly 20%
branch-...-stage-e2e-parallel 60%

The Jira's specific failure was in branch-ci-Azure-ARO-HCP-main-e2e-stage-e2e-parallel build 2053695569343287296 (May 11 04:36 UTC), where the patch-name-cluster hit the 45-minute E2E timeout due to this etcd DNS issue.

Solution: Etcd Pod Self-Registration

Each etcd pod creates its own EndpointSlice during the ensure-dns init container phase:

etcd-discovery Service (headless, selector: app=etcd)
    |
    +-- etcd-discovery-xxxxx  (managed-by: endpointslice-controller.k8s.io)
    |   [Standard controller — may be slow/stale under high density]
    |
    +-- etcd-discovery-self-etcd-0  (managed-by: control-plane-operator)
    |   [Created by etcd-0 init container — immediate]
    |
    +-- etcd-discovery-self-etcd-1  (managed-by: control-plane-operator)
    +-- etcd-discovery-self-etcd-2  (managed-by: control-plane-operator)

Why this works:

  • The standard EndpointSlice controller only manages slices labeled managed-by: endpointslice-controller.k8s.io — it completely ignores slices with a different managed-by label
  • CoreDNS merges endpoints from ALL EndpointSlices for a given Service regardless of who created them
  • Once the self-registered slice exists, CoreDNS picks it up via its watch within ~100ms
  • When the standard controller catches up, duplicate IPs are harmless for DNS
  • OwnerReference to the Pod ensures cleanup on pod deletion

Precedent: The HCCO already uses this pattern for the kubernetes EndpointSlice in the guest cluster (ReconcileKASEndpointSlice in hostedclusterconfigoperator/controllers/resources/kas/reconcile.go).

Changed files

File Change
dnsresolver/cmd.go Added --self-register flag and selfRegisterEndpointSlice() — creates EndpointSlice via K8s API before polling DNS, falls back to DNS-only on failure
dnsresolver/cmd_test.go Unit tests for DNS name parsing
v2/assets/etcd/statefulset.yaml Added POD_IP env var and --self-register flag to ensure-dns init container
v2/assets/etcd/etcd-serviceaccount.yaml New etcd service account (always created)
v2/assets/etcd/etcd-self-register-role.yaml Role granting EndpointSlice get/create/update and pod get
v2/assets/etcd/etcd-self-register-rolebinding.yaml Binds role to etcd SA
v2/assets/etcd/defrag-rolebinding.yaml Updated subject from etcd-defrag-controller to etcd SA
v2/etcd/component.go Registered new RBAC assets, removed old defrag SA
v2/etcd/statefulset.go Always set ServiceAccountName = "etcd" (not conditionally for HA)
manifests/etcd.go Added EtcdServiceAccount, EtcdSelfRegisterRole, EtcdSelfRegisterRoleBinding helpers

Risks and mitigations

Risk Mitigation
Duplicate endpoints when controller catches up CoreDNS deduplicates; harmless for DNS resolution
Self-registered EndpointSlice outlives pod OwnerReference to Pod triggers GC on pod deletion
RBAC escalation Scoped to get/create/update EndpointSlices + get pods in own namespace only
Init container fails to create EndpointSlice Falls back to DNS-only behavior with a warning log

Test plan

  • Unit tests for DNS name parsing (dnsresolver/cmd_test.go)
  • Existing etcd component tests pass (14/14)
  • make build compiles cleanly
  • make lint passes with 0 issues
  • E2E validation in staging environment
  • Verify EndpointSlice creation in a live HCP namespace
  • Verify DNS resolution timing improvement

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Etcd pods can self-register EndpointSlices via an optional DNS-resolver self-register mode (warns and falls back to DNS-only on failure).
    • RBAC and ServiceAccount updates to support etcd self-registration and unify controller bindings; defrag controller now uses the standardized etcd ServiceAccount.
    • Init behavior updated to expose pod IP for self-registration.
  • Tests

    • Added unit tests for DNS service-name parsing and EndpointSlice reconciliation.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 27, 2026
@openshift-ci

openshift-ci Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot

openshift-ci-robot commented May 27, 2026

Copy link
Copy Markdown

@cssjr: This pull request references AROSLSRE-830 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Adds EndpointSlice self-registration to the etcd ensure-dns init container, bypassing stale kube-controller-manager informer caches that delay DNS resolution under high HCP density
  • Creates a dedicated etcd service account with minimal RBAC for EndpointSlice and pod get permissions
  • Replaces the conditional etcd-defrag-controller SA with the always-present etcd SA, binding both self-register and defrag (HA-only) roles to it

Problem

Under high cluster density (~2,500 HCP namespaces), the management cluster's kube-controller-manager EndpointSlice informer cache goes stale, causing etcd-discovery headless Service DNS records to be delayed by minutes to hours. The ensure-dns init container blocks waiting for DNS, preventing etcd from starting, which cascades into cluster creation timeouts.

Evidence from Kusto (HostedControlPlaneLogs, hcp-dev-us-2.eastus2)

Scale: Hundreds to thousands of clusters affected daily across CI:

Date         Clusters   Failures   Prolonged(>5m)   Brief(<=5m)
2026-05-13        675     15,947           37            638
2026-05-14      1,210     24,869          142          1,063
2026-05-19      1,542     29,622          248          1,297
2026-05-20      1,228     22,563          180          1,039
2026-05-21      1,311     23,130          174          1,135
2026-05-26        980     17,884          138            842

Breakdown (last 3 days): Of 2,299 etcd pods that experienced DNS failures, 89.5% resolved within 30 seconds, but 8.1% (187 pods) took >5 minutes — these are the ones that cause E2E test timeouts.

Root Cause: EndpointSlice informer cache staleness

Confirmed via aksEvents table (category kube-controller-manager) on May 14:

"Error syncing endpoint slices for service, retrying"
 key="ocm-arohcpprow-<cluster-id>/etcd-discovery"
 err="EndpointSlice informer cache is out of date"

This appeared for dozens of HCP namespaces — individual namespaces saw 136–227 sync errors. A secondary pattern ("skipping Pod for Service, Node not found") shows the same informer staleness affecting node visibility, with top nodes seeing 330–557 skip events.

What's ruled out:

Hypothesis Evidence Verdict
CoreDNS unhealthy No CoreDNS errors in logs Ruled out
Node-specific issue Failures evenly distributed across all nodes Ruled out
CNI/IP assignment delay CNS assigned IPs within seconds Ruled out
Service creation race Framework guarantees Service before StatefulSet Ruled out

Specific case timeline (etcd-0 in a CI namespace, May 26):

Time Event
17:57:52 CPO reconciles etcd-discovery Service
17:57:53 CPO reconciles etcd StatefulSet
17:58:22 ensure-dns starts failing — "no such host"
17:58:26 CNS assigns pod IP 10.128.64.186
18:18:54 DNS finally resolves — 20 minutes after IP assignment

The pod had its IP within 4 seconds, the Service existed 30 seconds before the pod, CoreDNS was healthy — the EndpointSlice controller simply hadn't updated the headless Service's endpoints.

Impact on E2E stability

All three stage E2E jobs show poor stability:

Job Success Rate (last 20 runs)
periodic-stage-e2e-parallel 40%
periodic-stage-e2e-parallel-ocp-nightly 20%
branch-...-stage-e2e-parallel 60%

The Jira's specific failure was in branch-ci-Azure-ARO-HCP-main-e2e-stage-e2e-parallel build 2053695569343287296 (May 11 04:36 UTC), where the patch-name-cluster hit the 45-minute E2E timeout due to this etcd DNS issue.

Solution: Etcd Pod Self-Registration

Each etcd pod creates its own EndpointSlice during the ensure-dns init container phase:

etcd-discovery Service (headless, selector: app=etcd)
   |
   +-- etcd-discovery-xxxxx  (managed-by: endpointslice-controller.k8s.io)
   |   [Standard controller — may be slow/stale under high density]
   |
   +-- etcd-discovery-self-etcd-0  (managed-by: control-plane-operator)
   |   [Created by etcd-0 init container — immediate]
   |
   +-- etcd-discovery-self-etcd-1  (managed-by: control-plane-operator)
   +-- etcd-discovery-self-etcd-2  (managed-by: control-plane-operator)

Why this works:

  • The standard EndpointSlice controller only manages slices labeled managed-by: endpointslice-controller.k8s.io — it completely ignores slices with a different managed-by label
  • CoreDNS merges endpoints from ALL EndpointSlices for a given Service regardless of who created them
  • Once the self-registered slice exists, CoreDNS picks it up via its watch within ~100ms
  • When the standard controller catches up, duplicate IPs are harmless for DNS
  • OwnerReference to the Pod ensures cleanup on pod deletion

Precedent: The HCCO already uses this pattern for the kubernetes EndpointSlice in the guest cluster (ReconcileKASEndpointSlice in hostedclusterconfigoperator/controllers/resources/kas/reconcile.go).

Changed files

File Change
dnsresolver/cmd.go Added --self-register flag and selfRegisterEndpointSlice() — creates EndpointSlice via K8s API before polling DNS, falls back to DNS-only on failure
dnsresolver/cmd_test.go Unit tests for DNS name parsing
v2/assets/etcd/statefulset.yaml Added POD_IP env var and --self-register flag to ensure-dns init container
v2/assets/etcd/etcd-serviceaccount.yaml New etcd service account (always created)
v2/assets/etcd/etcd-self-register-role.yaml Role granting EndpointSlice get/create/update and pod get
v2/assets/etcd/etcd-self-register-rolebinding.yaml Binds role to etcd SA
v2/assets/etcd/defrag-rolebinding.yaml Updated subject from etcd-defrag-controller to etcd SA
v2/etcd/component.go Registered new RBAC assets, removed old defrag SA
v2/etcd/statefulset.go Always set ServiceAccountName = "etcd" (not conditionally for HA)
manifests/etcd.go Added EtcdServiceAccount, EtcdSelfRegisterRole, EtcdSelfRegisterRoleBinding helpers

Risks and mitigations

Risk Mitigation
Duplicate endpoints when controller catches up CoreDNS deduplicates; harmless for DNS resolution
Self-registered EndpointSlice outlives pod OwnerReference to Pod triggers GC on pod deletion
RBAC escalation Scoped to get/create/update EndpointSlices + get pods in own namespace only
Init container fails to create EndpointSlice Falls back to DNS-only behavior with a warning log

Test plan

  • Unit tests for DNS name parsing (dnsresolver/cmd_test.go)
  • Existing etcd component tests pass (14/14)
  • make build compiles cleanly
  • make lint passes with 0 issues
  • E2E validation in staging environment
  • Verify EndpointSlice creation in a live HCP namespace
  • Verify DNS resolution timing improvement

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR implements etcd self-registration: adds manifest constructors and YAML assets (ServiceAccount, etcd-self-register Role/RoleBinding; updates defrag RoleBinding subject), updates the StatefulSet initContainer to pass --self-register and POD_IP and uses the unified etcd ServiceAccount, updates component wiring, and extends dnsresolver with a --self-register flag plus selfRegisterEndpointSlice/ensureEndpointSlice/parseServiceName and unit tests.

Sequence Diagram(s)

sequenceDiagram
  participant dnsresolver
  participant KubeAPI
  participant EndpointSlice
  participant DNS

  dnsresolver->>KubeAPI: ensureEndpointSlice(create/update EndpointSlice for pod IP and targetRef)
  KubeAPI->>EndpointSlice: persist EndpointSlice
  dnsresolver->>DNS: resolveDNS(dnsName)
  DNS-->>dnsresolver: resolution results
Loading

Suggested reviewers

  • patjlm
🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Tests lack meaningful assertion failure messages (criterion #4). 26+ assertions use bare Expect() calls without messages; custom check requires context messages per documented examples. Add failure context messages to all assertions: e.g., Expect(err).NotTo(HaveOccurred(), "failed to ensure EndpointSlice") and Expect(slice.Labels[...]).To(Equal(...), "service name label mismatch").
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: adding etcd EndpointSlice self-registration to resolve DNS delays, which aligns with the core purpose of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names added in PR are stable, deterministic. No dynamic values in titles—pod names, UUIDs, timestamps, node names, IPs appear only in test bodies, not names.
Topology-Aware Scheduling Compatibility ✅ Passed No topology-unaware scheduling constraints found in statefulset, RBAC, PDB, or code. Defrag controller is topology-aware and only runs on HighlyAvailable clusters.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds standard Go unit tests (testing package), not Ginkgo e2e tests. Custom check applies only to Ginkgo e2e tests, which are absent.
No-Weak-Crypto ✅ Passed No weak cryptography found. PR adds etcd EndpointSlice self-registration using only standard Kubernetes APIs with no cryptographic operations, weak algorithms, or insecure comparisons.
Container-Privileges ✅ Passed No privileged containers, host access, elevated capabilities, or inappropriate privilege escalation settings found. All RBAC is properly scoped with minimal necessary permissions.
No-Sensitive-Data-In-Logs ✅ Passed No passwords, tokens, API keys, PII, session IDs, or customer data found in logging statements. Infrastructure metadata (pod IPs, DNS names) are logged but not classified as sensitive.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels May 27, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@dnsresolver/cmd.go`:
- Around line 150-164: The Get call in selfRegisterEndpointSlice currently
treats any error as "not found" and proceeds to Create; change the error
handling to return immediately for non-NotFound errors from
client.DiscoveryV1().EndpointSlices(...).Get(ctx, sliceName, ...). Specifically,
import and use apierrors.IsNotFound(err) (k8s.io/apimachinery/pkg/api/errors)
and: if err == nil keep the Update path, else if apierrors.IsNotFound(err)
perform the Create path, otherwise return the original Get error (wrap with
fmt.Errorf). This prevents masking RBAC/timeout errors while preserving the
existing Update/Create logic and messages.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 64afd625-994c-4a36-ad76-4ecbbb5a19ca

📥 Commits

Reviewing files that changed from the base of the PR and between 0aa20fc and 87cf121.

📒 Files selected for processing (10)
  • control-plane-operator/controllers/hostedcontrolplane/manifests/etcd.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/defrag-rolebinding.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-self-register-role.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-self-register-rolebinding.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-serviceaccount.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/component.go
  • control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go
  • dnsresolver/cmd.go
  • dnsresolver/cmd_test.go

Comment thread dnsresolver/cmd.go
@codecov

codecov Bot commented May 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 49.67320% with 77 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.76%. Comparing base (392fd5a) to head (d1662cf).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
dnsresolver/cmd.go 65.51% 38 Missing and 2 partials ⚠️
...r/controllers/hostedcontrolplane/manifests/etcd.go 0.00% 21 Missing ⚠️
...ontrollers/hostedcontrolplane/v2/etcd/component.go 0.00% 13 Missing ⚠️
...trollers/hostedcontrolplane/v2/etcd/statefulset.go 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8613      +/-   ##
==========================================
+ Coverage   41.75%   41.76%   +0.01%     
==========================================
  Files         758      758              
  Lines       93981    94132     +151     
==========================================
+ Hits        39240    39316      +76     
- Misses      51988    52061      +73     
- Partials     2753     2755       +2     
Files with missing lines Coverage Δ
...trollers/hostedcontrolplane/v2/etcd/statefulset.go 40.16% <0.00%> (-0.67%) ⬇️
...ontrollers/hostedcontrolplane/v2/etcd/component.go 10.20% <0.00%> (-3.69%) ⬇️
...r/controllers/hostedcontrolplane/manifests/etcd.go 45.83% <0.00%> (-7.83%) ⬇️
dnsresolver/cmd.go 53.52% <65.51%> (+53.52%) ⬆️
Flag Coverage Δ
cmd-support 35.02% <ø> (ø)
cpo-hostedcontrolplane 44.01% <0.00%> (-0.10%) ⬇️
cpo-other 43.45% <ø> (ø)
hypershift-operator 51.82% <ø> (ø)
other 31.96% <65.51%> (+0.39%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cssjr cssjr force-pushed the etcd-endpointslice-self-register branch from 87cf121 to 39ffc19 Compare May 27, 2026 21:13
@cssjr

cssjr commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

/cc patjlm

@openshift-ci openshift-ci Bot requested a review from patjlm May 27, 2026 21:22

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dnsresolver/cmd.go (1)

177-182: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject malformed non-service DNS names before self-registering.

parseServiceName only checks the label count and then trusts parts[1]. With --self-register, an input like foo.bar.baz would still create or update an EndpointSlice for service bar in the current namespace instead of failing fast. Please validate the documented <pod>.<service>.<namespace>.svc[.cluster.local] shape here before returning a service name.

Proposed fix
 func parseServiceName(dnsName string) (string, error) {
 	parts := strings.Split(dnsName, ".")
-	if len(parts) < 3 {
-		return "", fmt.Errorf("expected at least 3 dot-separated components, got %d", len(parts))
+	if len(parts) < 4 || parts[0] == "" || parts[1] == "" || parts[2] == "" || parts[3] != "svc" {
+		return "", fmt.Errorf("expected <pod>.<service>.<namespace>.svc[.cluster.local], got %q", dnsName)
 	}
 	return parts[1], nil
 }
As per coding guidelines, "Validate at trust boundaries with allow-lists, not deny-lists".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dnsresolver/cmd.go` around lines 177 - 182, parseServiceName currently only
checks label count and trusts parts[1]; instead validate the full DNS shape
pod.service.namespace.svc[.cluster.local] before returning the service. In
parseServiceName, split on '.', ensure no empty labels, require parts[3] ==
"svc", allow either len(parts)==4 or len(parts)==6 with parts[4]=="cluster" &&
parts[5]=="local", and only then return parts[1]; otherwise return a descriptive
error rejecting the malformed name.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@dnsresolver/cmd.go`:
- Around line 177-182: parseServiceName currently only checks label count and
trusts parts[1]; instead validate the full DNS shape
pod.service.namespace.svc[.cluster.local] before returning the service. In
parseServiceName, split on '.', ensure no empty labels, require parts[3] ==
"svc", allow either len(parts)==4 or len(parts)==6 with parts[4]=="cluster" &&
parts[5]=="local", and only then return parts[1]; otherwise return a descriptive
error rejecting the malformed name.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 11fd82fa-916a-486b-a98c-7ddd6ed5ad67

📥 Commits

Reviewing files that changed from the base of the PR and between b877681 and 30ffd78.

📒 Files selected for processing (2)
  • dnsresolver/cmd.go
  • dnsresolver/cmd_test.go

@cssjr

cssjr commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

/cc @garrickdabbs

@openshift-ci openshift-ci Bot requested a review from garrickdabbs May 27, 2026 21:30
@cssjr

cssjr commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

/test all

@cssjr

cssjr commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

/pipeline required

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@cwbotbot

cwbotbot commented May 28, 2026

Copy link
Copy Markdown

Test Results

e2e-aws

e2e-aks

@cssjr cssjr marked this pull request as ready for review May 28, 2026 14:29
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2026
@openshift-ci openshift-ci Bot requested review from devguyio and sjenning May 28, 2026 14:30
@cssjr

cssjr commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@cssjr

cssjr commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

/test images

@cssjr

cssjr commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

/cc @bryan-cox @Patryk-Stefanski

@openshift-ci openshift-ci Bot requested a review from bryan-cox May 29, 2026 15:12
@cssjr

cssjr commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

✅ Addressed feedback from @enxebre in commit c8b0fa1

All 4 comments have been addressed:

1. Added comment in statefulset.yaml (line 192) ✅

Added a 3-line comment explaining why --self-register is needed:

# Self-register this pod's IP into an EndpointSlice before waiting for DNS resolution.
# This bypasses stale kube-controller-manager EndpointSlice cache issues under high HCP density
# (2,500+ namespaces), where the standard controller can delay DNS updates by minutes to hours.

2. Added comment in cmd.go (line 46) ✅

Added a 3-line comment documenting the --self-register flag:

// --self-register creates an EndpointSlice for this pod before waiting for DNS resolution.
// This bypasses stale kube-controller-manager EndpointSlice cache issues under high HCP density,
// where the standard controller can delay DNS updates by minutes to hours.

3. Fixed IP parsing to handle nil and avoid silent failure (line 112) ✅

Replaced manual net.ParseIP().To4() check with netutil.IsIPv4Address() which:

  • Returns an error if ParseIP returns nil (no more silent failures)
  • Properly handles the To4() check
  • Follows existing codebase patterns

Before:

addressType := discoveryv1.AddressTypeIPv4
if net.ParseIP(podIP) != nil && net.ParseIP(podIP).To4() == nil {
    addressType = discoveryv1.AddressTypeIPv6
}

After:

isIPv4, err := netutil.IsIPv4Address(podIP)
if err != nil {
    return fmt.Errorf("failed to parse pod IP %s: %w", podIP, err)
}
addressType := discoveryv1.AddressTypeIPv6
if isIPv4 {
    addressType = discoveryv1.AddressTypeIPv4
}

4. Used support/netutil.IsIPv4Address()

Implemented as part of #3 above.

Verification

  • ✅ All tests pass
  • make build succeeds
  • make lint-fix applied
  • make pre-commit passes

@cssjr cssjr requested a review from enxebre June 3, 2026 15:33
@cssjr

cssjr commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2062196402125017088 | Cost: $4.831960000000001 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cssjr

cssjr commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

/retest-required

@csrwng csrwng left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean implementation of the EndpointSlice self-registration pattern.

RBAC is well-scoped (get/create/update EndpointSlices + get Pods), the custom managed-by label avoids conflict with the standard controller, and the OwnerReference to the Pod ensures proper GC. The backwards-compatible SA split (keeping etcd-defrag-controller for HA, adding etcd for all topologies) addresses the upgrade concern. Tests cover create, update, IPv6, error cases, and multi-pod scenarios.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

@openshift-ci

openshift-ci Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, csrwng, cssjr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cssjr

cssjr commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

/retest-required

@cssjr

cssjr commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-azure-v2-self-managed

@cssjr

cssjr commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Running validation on Azure/ARO-HCP#5635 with a custom image containing these changes (after rebasing).

cssjr and others added 7 commits June 16, 2026 16:46
…ntainer

Under high HCP density (~2,500 namespaces), the management cluster's
kube-controller-manager EndpointSlice informer cache goes stale, delaying
etcd-discovery headless Service DNS records by minutes to hours. This
causes etcd pods to block in the ensure-dns init container, preventing
cluster creation and triggering E2E test timeouts.

Each etcd pod now self-registers its IP into a custom EndpointSlice
(managed-by: control-plane-operator) before polling DNS. The standard
EndpointSlice controller ignores slices with a different managed-by
label, so there is no conflict. CoreDNS picks up the custom slice via
its watch within ~100ms, making DNS resolution near-instant regardless
of controller informer health.

A new dedicated "etcd" service account replaces the conditional
"etcd-defrag-controller" SA, with RBAC granting EndpointSlice and pod
get permissions. The defrag Role/RoleBinding remain HA-only.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Use apierrors.IsNotFound to only fall through to Create when the
EndpointSlice genuinely doesn't exist. RBAC, timeout, and other errors
are now returned immediately instead of being masked by a subsequent
Create attempt.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Refactor selfRegisterEndpointSlice to extract testable
ensureEndpointSlice function that accepts a kubernetes.Interface,
enabling tests with the fake client. Tests cover: create, update,
IPv6, missing pod, invalid DNS name, multiple pods, and field
correctness.

Coverage: ensureEndpointSlice 88.9%, parseServiceName 100%.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This addresses review feedback on PR openshift#8613 to maintain backwards
compatibility during ServiceAccount changes.

Blocking issues fixed:
- Keep etcd-defrag-controller SA (HA-only) separate from etcd SA
  (always present) to avoid orphaned SAs on upgrade
- Use fully qualified managed-by label
  "etcd-self-register.hypershift.openshift.io" to avoid future
  cross-contamination with LabelManagedBy filters

Suggestions implemented:
- Add Controller and BlockOwnerDeletion to OwnerReference for
  explicit GC intent
- Replace context.Background() with t.Context() in tests to prevent
  goroutine leaks
- Use gomega MatchError for cleaner error assertions
- Simplify IP generation in tests with fmt.Sprintf

The SA split works as follows:
- HA mode: use etcd-defrag-controller SA with both defrag and
  self-register roles bound to it
- Non-HA mode: use etcd SA with only self-register role

This keeps the PR scoped to self-registration without upgrade risk.

Signed-off-by: Cliff Schomburg <[email protected]>
Commit-Message-Assisted-by: Claude (via Claude Code)
Address feedback from @enxebre to document why --self-register is
needed and improve IP address parsing.

Changes:
- Added comment in statefulset.yaml explaining --self-register flag
  purpose (bypasses stale EndpointSlice cache under high HCP density)
- Added comment in dnsresolver/cmd.go documenting the flag
- Replaced manual net.ParseIP().To4() check with
  netutil.IsIPv4Address() which properly handles ParseIP returning
  nil and returns an error instead of silently failing

Signed-off-by: Cliff Schomburg <[email protected]>
Commit-Message-Assisted-by: Claude (via Claude Code)
Add test to verify that ensureEndpointSlice properly returns an error
when given a malformed IP address (not IPv4 or IPv6).

This ensures the netutil.IsIPv4Address() error handling works correctly
and doesn't silently fail on invalid input.

Signed-off-by: Cliff Schomburg <[email protected]>
Commit-Message-Assisted-by: Claude (via Claude Code)
@cssjr cssjr force-pushed the etcd-endpointslice-self-register branch from 824480d to d1662cf Compare June 16, 2026 23:47
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2026
@cssjr

cssjr commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Rebased - no code changes

@bryan-cox

Copy link
Copy Markdown
Member

/lgtm

putting this back since it was only rebase

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@bryan-cox

Copy link
Copy Markdown
Member

/retest

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@cssjr: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-kubevirt-aws-ovn-reduced d1662cf link true /test e2e-kubevirt-aws-ovn-reduced

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown

Confirmed — the PR is to the openshift/hypershift repo. The management cluster IPI install uses a standard OpenShift release image and the openshift-install binary. The HyperShift PR code (hypershift-operator, hypershift-tests) is not used until after the management cluster is installed. This failure is in the IPI install (pre-phase), making it unrelated to the PR changes.

Now I have all the evidence needed. Let me compile the report.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.

Summary

The management cluster IPI installation (pre-phase step ipi-install-install) failed with a bootstrap timeout after 45 minutes. The openshift-apiserver operator never reached Available status (APIServices_PreconditionNotReady), which blocked the routes.route.openshift.io API from registering. This caused a cascade: the monitoring operator couldn't create Route objects, the ingress operator's router deployment never achieved minimum replicas (0/1 available), and the authentication operator couldn't reach the oauth-server endpoints. This is a management cluster infrastructure failure that occurred before any HyperShift code from PR #8613 was involved — the PR changes to etcd EndpointSlice self-registration are not exercised until the HyperShift operator is deployed on top of an already-installed management cluster.

Root Cause

This is an infrastructure flake in the management cluster IPI bootstrap, unrelated to the PR under test.

Failure chain (traced to root):

  1. openshift-apiserver never became Available — reported APIServices_PreconditionNotReady for the entire 45-minute bootstrap window (14:54–15:39 UTC). The openshift-apiserver registers custom API services (including routes.route.openshift.io). When its APIServices preconditions aren't met, these API resources remain unregistered.

  2. monitoring operator failed because it tried to create Route objects (post routes.route.openshift.io) but the API resource was never registered (depends on openshift-apiserver). Similarly, Prometheus CRDs (prometheuses.monitoring.coreos.com) were not available.

  3. ingress operator reported the default IngressController's router deployment had 0/1 replicas available (MinimumReplicasUnavailable). The router pod could not be scheduled. The cluster used c5n.metal worker nodes with replicas: 1, but worker MachineSet creation happens post-bootstrap, so no worker node was present during bootstrap. The router must schedule on a master node during bootstrap, but the openshift-apiserver being down likely prevented proper admission/scheduling chain completion.

  4. authentication operator failed because it depends on both the openshift-apiserver (for APIServices) and ingress (for the oauth-server endpoint routing). With both down, oauth endpoints were unreachable.

  5. DNS resolution issues were observed: read udp 10.128.0.8:→172.30.0.10:53: read: connection refused — the cluster's CoreDNS service was intermittently refusing connections, which also prevented the insights operator from reaching api.openshift.com.

Why this is unrelated to PR #8613:

  • The failure is in the IPI management cluster install step (ipi-install-install), which uses the standard openshift-install binary and OCP release image
  • PR AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays #8613 modifies HyperShift's etcd EndpointSlice logic, which is not invoked until the hypershift-operator is deployed on the management cluster — a step that never occurred because installation failed first
  • The openshift-apiserver PreconditionNotReady failure is a known IPI bootstrap flake pattern
Recommendations
  1. Retrigger the job — This is an infrastructure flake in the management cluster IPI bootstrap, unrelated to the PR changes. A retest (/retest) should resolve the issue.

  2. No PR code changes needed — The HyperShift etcd EndpointSlice changes from PR AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays #8613 were never exercised in this run. The failure occurred entirely within the standard OpenShift IPI installer before any HyperShift components were deployed.

  3. If the failure recurs, investigate whether the openshift-apiserver PreconditionNotReady pattern is occurring at elevated rates in the CI fleet, which would indicate a broader platform regression rather than an isolated flake.

Evidence
Evidence Detail
Failed Step ipi-install-install (pre-phase) — management cluster IPI installation
Exit Code Installer exit code 5 (bootstrap timeout)
Bootstrap Wait 45 minutes (14:54:51–15:39:51 UTC), timed out
openshift-apiserver Available=False with APIServices_PreconditionNotReady for entire bootstrap window
ingress Available=False — router deployment 0/1 replicas (MinimumReplicasUnavailable)
monitoring Available=Falseroutes.route.openshift.io API resource not found (depends on openshift-apiserver)
authentication Available=False — oauth-server endpoints not ready, connection refused to 172.30.206.61:443
DNS errors read udp 10.128.0.8:→172.30.0.10:53: read: connection refused (CoreDNS intermittently unavailable)
Control Plane Nodes All 3 masters provisioned and joined (master-2 at 14:57:48, master-0 at 14:59:08, master-1 at 14:59:48)
Worker Nodes c5n.metal configured but workers are post-bootstrap; none present during bootstrap
PR Relevance PR #8613 modifies openshift/hypershift etcd EndpointSlice code — never invoked during IPI install
k8sgpt step Also failed with the server doesn't have a resource type "hostedcluster" — expected cascading failure since cluster install never completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants