AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays by cssjr · Pull Request #8613 · openshift/hypershift

cssjr · 2026-05-27T20:54:33Z

Summary

Adds EndpointSlice self-registration to the etcd ensure-dns init container, bypassing stale kube-controller-manager informer caches that delay DNS resolution under high HCP density
Creates a dedicated etcd service account with minimal RBAC for EndpointSlice and pod get permissions
Replaces the conditional etcd-defrag-controller SA with the always-present etcd SA, binding both self-register and defrag (HA-only) roles to it

Problem

Under high cluster density (~2,500 HCP namespaces), the management cluster's kube-controller-manager EndpointSlice informer cache goes stale, causing etcd-discovery headless Service DNS records to be delayed by minutes to hours. The ensure-dns init container blocks waiting for DNS, preventing etcd from starting, which cascades into cluster creation timeouts.

Evidence from Kusto (HostedControlPlaneLogs, hcp-dev-us-2.eastus2)

Scale: Hundreds to thousands of clusters affected daily across CI:

Date         Clusters   Failures   Prolonged(>5m)   Brief(<=5m)
2026-05-13        675     15,947           37            638
2026-05-14      1,210     24,869          142          1,063
2026-05-19      1,542     29,622          248          1,297
2026-05-20      1,228     22,563          180          1,039
2026-05-21      1,311     23,130          174          1,135
2026-05-26        980     17,884          138            842

Breakdown (last 3 days): Of 2,299 etcd pods that experienced DNS failures, 89.5% resolved within 30 seconds, but 8.1% (187 pods) took >5 minutes — these are the ones that cause E2E test timeouts.

Root Cause: EndpointSlice informer cache staleness

Confirmed via aksEvents table (category kube-controller-manager) on May 14:

"Error syncing endpoint slices for service, retrying"
  key="ocm-arohcpprow-<cluster-id>/etcd-discovery"
  err="EndpointSlice informer cache is out of date"

This appeared for dozens of HCP namespaces — individual namespaces saw 136–227 sync errors. A secondary pattern ("skipping Pod for Service, Node not found") shows the same informer staleness affecting node visibility, with top nodes seeing 330–557 skip events.

What's ruled out:

Hypothesis	Evidence	Verdict
CoreDNS unhealthy	No CoreDNS errors in logs	Ruled out
Node-specific issue	Failures evenly distributed across all nodes	Ruled out
CNI/IP assignment delay	CNS assigned IPs within seconds	Ruled out
Service creation race	Framework guarantees Service before StatefulSet	Ruled out

Specific case timeline (etcd-0 in a CI namespace, May 26):

Time	Event
17:57:52	CPO reconciles `etcd-discovery` Service
17:57:53	CPO reconciles etcd StatefulSet
17:58:22	`ensure-dns` starts failing — `"no such host"`
17:58:26	CNS assigns pod IP `10.128.64.186`
18:18:54	DNS finally resolves — 20 minutes after IP assignment

The pod had its IP within 4 seconds, the Service existed 30 seconds before the pod, CoreDNS was healthy — the EndpointSlice controller simply hadn't updated the headless Service's endpoints.

Impact on E2E stability

All three stage E2E jobs show poor stability:

Job	Success Rate (last 20 runs)
`periodic-stage-e2e-parallel`	40%
`periodic-stage-e2e-parallel-ocp-nightly`	20%
`branch-...-stage-e2e-parallel`	60%

The Jira's specific failure was in branch-ci-Azure-ARO-HCP-main-e2e-stage-e2e-parallel build 2053695569343287296 (May 11 04:36 UTC), where the patch-name-cluster hit the 45-minute E2E timeout due to this etcd DNS issue.

Solution: Etcd Pod Self-Registration

Each etcd pod creates its own EndpointSlice during the ensure-dns init container phase:

etcd-discovery Service (headless, selector: app=etcd)
    |
    +-- etcd-discovery-xxxxx  (managed-by: endpointslice-controller.k8s.io)
    |   [Standard controller — may be slow/stale under high density]
    |
    +-- etcd-discovery-self-etcd-0  (managed-by: control-plane-operator)
    |   [Created by etcd-0 init container — immediate]
    |
    +-- etcd-discovery-self-etcd-1  (managed-by: control-plane-operator)
    +-- etcd-discovery-self-etcd-2  (managed-by: control-plane-operator)

Why this works:

The standard EndpointSlice controller only manages slices labeled managed-by: endpointslice-controller.k8s.io — it completely ignores slices with a different managed-by label
CoreDNS merges endpoints from ALL EndpointSlices for a given Service regardless of who created them
Once the self-registered slice exists, CoreDNS picks it up via its watch within ~100ms
When the standard controller catches up, duplicate IPs are harmless for DNS
OwnerReference to the Pod ensures cleanup on pod deletion

Precedent: The HCCO already uses this pattern for the kubernetes EndpointSlice in the guest cluster (ReconcileKASEndpointSlice in hostedclusterconfigoperator/controllers/resources/kas/reconcile.go).

Changed files

File	Change
`dnsresolver/cmd.go`	Added `--self-register` flag and `selfRegisterEndpointSlice()` — creates EndpointSlice via K8s API before polling DNS, falls back to DNS-only on failure
`dnsresolver/cmd_test.go`	Unit tests for DNS name parsing
`v2/assets/etcd/statefulset.yaml`	Added `POD_IP` env var and `--self-register` flag to `ensure-dns` init container
`v2/assets/etcd/etcd-serviceaccount.yaml`	New `etcd` service account (always created)
`v2/assets/etcd/etcd-self-register-role.yaml`	Role granting EndpointSlice get/create/update and pod get
`v2/assets/etcd/etcd-self-register-rolebinding.yaml`	Binds role to `etcd` SA
`v2/assets/etcd/defrag-rolebinding.yaml`	Updated subject from `etcd-defrag-controller` to `etcd` SA
`v2/etcd/component.go`	Registered new RBAC assets, removed old defrag SA
`v2/etcd/statefulset.go`	Always set `ServiceAccountName = "etcd"` (not conditionally for HA)
`manifests/etcd.go`	Added `EtcdServiceAccount`, `EtcdSelfRegisterRole`, `EtcdSelfRegisterRoleBinding` helpers

Risks and mitigations

Risk	Mitigation
Duplicate endpoints when controller catches up	CoreDNS deduplicates; harmless for DNS resolution
Self-registered EndpointSlice outlives pod	OwnerReference to Pod triggers GC on pod deletion
RBAC escalation	Scoped to get/create/update EndpointSlices + get pods in own namespace only
Init container fails to create EndpointSlice	Falls back to DNS-only behavior with a warning log

Test plan

Unit tests for DNS name parsing (dnsresolver/cmd_test.go)
Existing etcd component tests pass (14/14)
make build compiles cleanly
make lint passes with 0 issues
E2E validation in staging environment
Verify EndpointSlice creation in a live HCP namespace
Verify DNS resolution timing improvement

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Etcd pods can self-register EndpointSlices via an optional DNS-resolver self-register mode (warns and falls back to DNS-only on failure).
- RBAC and ServiceAccount updates to support etcd self-registration and unify controller bindings; defrag controller now uses the standardized etcd ServiceAccount.
- Init behavior updated to expose pod IP for self-registration.
Tests
- Added unit tests for DNS service-name parsing and EndpointSlice reconciliation.

openshift-merge-bot · 2026-05-27T20:54:35Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-05-27T20:54:37Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-05-27T20:54:37Z

@cssjr: This pull request references AROSLSRE-830 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Adds EndpointSlice self-registration to the etcd ensure-dns init container, bypassing stale kube-controller-manager informer caches that delay DNS resolution under high HCP density

Creates a dedicated etcd service account with minimal RBAC for EndpointSlice and pod get permissions

Replaces the conditional etcd-defrag-controller SA with the always-present etcd SA, binding both self-register and defrag (HA-only) roles to it

Problem

Under high cluster density (~2,500 HCP namespaces), the management cluster's kube-controller-manager EndpointSlice informer cache goes stale, causing etcd-discovery headless Service DNS records to be delayed by minutes to hours. The ensure-dns init container blocks waiting for DNS, preventing etcd from starting, which cascades into cluster creation timeouts.

Evidence from Kusto (HostedControlPlaneLogs, hcp-dev-us-2.eastus2)

Scale: Hundreds to thousands of clusters affected daily across CI:
Date         Clusters   Failures   Prolonged(>5m)   Brief(<=5m)
2026-05-13        675     15,947           37            638
2026-05-14      1,210     24,869          142          1,063
2026-05-19      1,542     29,622          248          1,297
2026-05-20      1,228     22,563          180          1,039
2026-05-21      1,311     23,130          174          1,135
2026-05-26        980     17,884          138            842
Breakdown (last 3 days): Of 2,299 etcd pods that experienced DNS failures, 89.5% resolved within 30 seconds, but 8.1% (187 pods) took >5 minutes — these are the ones that cause E2E test timeouts.

Root Cause: EndpointSlice informer cache staleness

Confirmed via aksEvents table (category kube-controller-manager) on May 14:
"Error syncing endpoint slices for service, retrying"
 key="ocm-arohcpprow-<cluster-id>/etcd-discovery"
 err="EndpointSlice informer cache is out of date"
This appeared for dozens of HCP namespaces — individual namespaces saw 136–227 sync errors. A secondary pattern ("skipping Pod for Service, Node not found") shows the same informer staleness affecting node visibility, with top nodes seeing 330–557 skip events.

What's ruled out:

Hypothesis Evidence Verdict

CoreDNS unhealthy No CoreDNS errors in logs Ruled out

Node-specific issue Failures evenly distributed across all nodes Ruled out

CNI/IP assignment delay CNS assigned IPs within seconds Ruled out

Service creation race Framework guarantees Service before StatefulSet Ruled out

Specific case timeline (etcd-0 in a CI namespace, May 26):

Time Event

17:57:52 CPO reconciles etcd-discovery Service

17:57:53 CPO reconciles etcd StatefulSet

17:58:22 ensure-dns starts failing — "no such host"

17:58:26 CNS assigns pod IP 10.128.64.186

18:18:54 DNS finally resolves — 20 minutes after IP assignment

The pod had its IP within 4 seconds, the Service existed 30 seconds before the pod, CoreDNS was healthy — the EndpointSlice controller simply hadn't updated the headless Service's endpoints.

Impact on E2E stability

All three stage E2E jobs show poor stability:

Job Success Rate (last 20 runs)

periodic-stage-e2e-parallel 40%

periodic-stage-e2e-parallel-ocp-nightly 20%

branch-...-stage-e2e-parallel 60%

The Jira's specific failure was in branch-ci-Azure-ARO-HCP-main-e2e-stage-e2e-parallel build 2053695569343287296 (May 11 04:36 UTC), where the patch-name-cluster hit the 45-minute E2E timeout due to this etcd DNS issue.

Solution: Etcd Pod Self-Registration

Each etcd pod creates its own EndpointSlice during the ensure-dns init container phase:
etcd-discovery Service (headless, selector: app=etcd)
   |
   +-- etcd-discovery-xxxxx  (managed-by: endpointslice-controller.k8s.io)
   |   [Standard controller — may be slow/stale under high density]
   |
   +-- etcd-discovery-self-etcd-0  (managed-by: control-plane-operator)
   |   [Created by etcd-0 init container — immediate]
   |
   +-- etcd-discovery-self-etcd-1  (managed-by: control-plane-operator)
   +-- etcd-discovery-self-etcd-2  (managed-by: control-plane-operator)
Why this works:

The standard EndpointSlice controller only manages slices labeled managed-by: endpointslice-controller.k8s.io — it completely ignores slices with a different managed-by label

CoreDNS merges endpoints from ALL EndpointSlices for a given Service regardless of who created them

Once the self-registered slice exists, CoreDNS picks it up via its watch within ~100ms

When the standard controller catches up, duplicate IPs are harmless for DNS

OwnerReference to the Pod ensures cleanup on pod deletion

Precedent: The HCCO already uses this pattern for the kubernetes EndpointSlice in the guest cluster (ReconcileKASEndpointSlice in hostedclusterconfigoperator/controllers/resources/kas/reconcile.go).

Changed files

File Change

dnsresolver/cmd.go Added --self-register flag and selfRegisterEndpointSlice() — creates EndpointSlice via K8s API before polling DNS, falls back to DNS-only on failure

dnsresolver/cmd_test.go Unit tests for DNS name parsing

v2/assets/etcd/statefulset.yaml Added POD_IP env var and --self-register flag to ensure-dns init container

v2/assets/etcd/etcd-serviceaccount.yaml New etcd service account (always created)

v2/assets/etcd/etcd-self-register-role.yaml Role granting EndpointSlice get/create/update and pod get

v2/assets/etcd/etcd-self-register-rolebinding.yaml Binds role to etcd SA

v2/assets/etcd/defrag-rolebinding.yaml Updated subject from etcd-defrag-controller to etcd SA

v2/etcd/component.go Registered new RBAC assets, removed old defrag SA

v2/etcd/statefulset.go Always set ServiceAccountName = "etcd" (not conditionally for HA)

manifests/etcd.go Added EtcdServiceAccount, EtcdSelfRegisterRole, EtcdSelfRegisterRoleBinding helpers

Risks and mitigations

Risk Mitigation

Duplicate endpoints when controller catches up CoreDNS deduplicates; harmless for DNS resolution

Self-registered EndpointSlice outlives pod OwnerReference to Pod triggers GC on pod deletion

RBAC escalation Scoped to get/create/update EndpointSlices + get pods in own namespace only

Init container fails to create EndpointSlice Falls back to DNS-only behavior with a warning log

Test plan

Unit tests for DNS name parsing (dnsresolver/cmd_test.go)

Existing etcd component tests pass (14/14)

make build compiles cleanly

make lint passes with 0 issues

E2E validation in staging environment

Verify EndpointSlice creation in a live HCP namespace

Verify DNS resolution timing improvement

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-27T20:54:49Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR implements etcd self-registration: adds manifest constructors and YAML assets (ServiceAccount, etcd-self-register Role/RoleBinding; updates defrag RoleBinding subject), updates the StatefulSet initContainer to pass --self-register and POD_IP and uses the unified etcd ServiceAccount, updates component wiring, and extends dnsresolver with a --self-register flag plus selfRegisterEndpointSlice/ensureEndpointSlice/parseServiceName and unit tests.

Sequence Diagram(s)

sequenceDiagram
  participant dnsresolver
  participant KubeAPI
  participant EndpointSlice
  participant DNS

  dnsresolver->>KubeAPI: ensureEndpointSlice(create/update EndpointSlice for pod IP and targetRef)
  KubeAPI->>EndpointSlice: persist EndpointSlice
  dnsresolver->>DNS: resolveDNS(dnsName)
  DNS-->>dnsresolver: resolution results

Suggested reviewers

patjlm

🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Structure And Quality	⚠️ Warning	Tests lack meaningful assertion failure messages (criterion `#4`). 26+ assertions use bare Expect() calls without messages; custom check requires context messages per documented examples.	Add failure context messages to all assertions: e.g., Expect(err).NotTo(HaveOccurred(), "failed to ensure EndpointSlice") and Expect(slice.Labels[...]).To(Equal(...), "service name label mismatch").

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: adding etcd EndpointSlice self-registration to resolve DNS delays, which aligns with the core purpose of the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All test names added in PR are stable, deterministic. No dynamic values in titles—pod names, UUIDs, timestamps, node names, IPs appear only in test bodies, not names.
Topology-Aware Scheduling Compatibility	✅ Passed	No topology-unaware scheduling constraints found in statefulset, RBAC, PDB, or code. Defrag controller is topology-aware and only runs on HighlyAvailable clusters.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR adds standard Go unit tests (testing package), not Ginkgo e2e tests. Custom check applies only to Ginkgo e2e tests, which are absent.
No-Weak-Crypto	✅ Passed	No weak cryptography found. PR adds etcd EndpointSlice self-registration using only standard Kubernetes APIs with no cryptographic operations, weak algorithms, or insecure comparisons.
Container-Privileges	✅ Passed	No privileged containers, host access, elevated capabilities, or inappropriate privilege escalation settings found. All RBAC is properly scoped with minimal necessary permissions.
No-Sensitive-Data-In-Logs	✅ Passed	No passwords, tokens, API keys, PII, session IDs, or customer data found in logging statements. Infrastructure metadata (pod IPs, DNS names) are logged but not classified as sensitive.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@dnsresolver/cmd.go`:
- Around line 150-164: The Get call in selfRegisterEndpointSlice currently
treats any error as "not found" and proceeds to Create; change the error
handling to return immediately for non-NotFound errors from
client.DiscoveryV1().EndpointSlices(...).Get(ctx, sliceName, ...). Specifically,
import and use apierrors.IsNotFound(err) (k8s.io/apimachinery/pkg/api/errors)
and: if err == nil keep the Update path, else if apierrors.IsNotFound(err)
perform the Create path, otherwise return the original Get error (wrap with
fmt.Errorf). This prevents masking RBAC/timeout errors while preserving the
existing Update/Create logic and messages.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 64afd625-994c-4a36-ad76-4ecbbb5a19ca

📥 Commits

Reviewing files that changed from the base of the PR and between 0aa20fc and 87cf121.

📒 Files selected for processing (10)

control-plane-operator/controllers/hostedcontrolplane/manifests/etcd.go
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/defrag-rolebinding.yaml
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-self-register-role.yaml
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-self-register-rolebinding.yaml
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/etcd-serviceaccount.yaml
control-plane-operator/controllers/hostedcontrolplane/v2/assets/etcd/statefulset.yaml
control-plane-operator/controllers/hostedcontrolplane/v2/etcd/component.go
control-plane-operator/controllers/hostedcontrolplane/v2/etcd/statefulset.go
dnsresolver/cmd.go
dnsresolver/cmd_test.go

codecov · 2026-05-27T21:03:46Z

Codecov Report

❌ Patch coverage is 49.67320% with 77 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.76%. Comparing base (392fd5a) to head (d1662cf).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
dnsresolver/cmd.go	65.51%	38 Missing and 2 partials ⚠️
...r/controllers/hostedcontrolplane/manifests/etcd.go	0.00%	21 Missing ⚠️
...ontrollers/hostedcontrolplane/v2/etcd/component.go	0.00%	13 Missing ⚠️
...trollers/hostedcontrolplane/v2/etcd/statefulset.go	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8613      +/-   ##
==========================================
+ Coverage   41.75%   41.76%   +0.01%     
==========================================
  Files         758      758              
  Lines       93981    94132     +151     
==========================================
+ Hits        39240    39316      +76     
- Misses      51988    52061      +73     
- Partials     2753     2755       +2

Files with missing lines	Coverage Δ
...trollers/hostedcontrolplane/v2/etcd/statefulset.go	`40.16% <0.00%> (-0.67%)`	⬇️
...ontrollers/hostedcontrolplane/v2/etcd/component.go	`10.20% <0.00%> (-3.69%)`	⬇️
...r/controllers/hostedcontrolplane/manifests/etcd.go	`45.83% <0.00%> (-7.83%)`	⬇️
dnsresolver/cmd.go	`53.52% <65.51%> (+53.52%)`	⬆️

Flag	Coverage Δ
cmd-support	`35.02% <ø> (ø)`
cpo-hostedcontrolplane	`44.01% <0.00%> (-0.10%)`	⬇️
cpo-other	`43.45% <ø> (ø)`
hypershift-operator	`51.82% <ø> (ø)`
other	`31.96% <65.51%> (+0.39%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cssjr · 2026-05-27T21:22:57Z

/cc patjlm

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dnsresolver/cmd.go (1)
177-182: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject malformed non-service DNS names before self-registering.

parseServiceName only checks the label count and then trusts parts[1]. With --self-register, an input like foo.bar.baz would still create or update an EndpointSlice for service bar in the current namespace instead of failing fast. Please validate the documented <pod>.<service>.<namespace>.svc[.cluster.local] shape here before returning a service name.
Proposed fix
 func parseServiceName(dnsName string) (string, error) {
 	parts := strings.Split(dnsName, ".")
-	if len(parts) < 3 {
-		return "", fmt.Errorf("expected at least 3 dot-separated components, got %d", len(parts))
+	if len(parts) < 4 || parts[0] == "" || parts[1] == "" || parts[2] == "" || parts[3] != "svc" {
+		return "", fmt.Errorf("expected <pod>.<service>.<namespace>.svc[.cluster.local], got %q", dnsName)
 	}
 	return parts[1], nil
 }
As per coding guidelines, "Validate at trust boundaries with allow-lists, not deny-lists".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dnsresolver/cmd.go` around lines 177 - 182, parseServiceName currently only
checks label count and trusts parts[1]; instead validate the full DNS shape
pod.service.namespace.svc[.cluster.local] before returning the service. In
parseServiceName, split on '.', ensure no empty labels, require parts[3] ==
"svc", allow either len(parts)==4 or len(parts)==6 with parts[4]=="cluster" &&
parts[5]=="local", and only then return parts[1]; otherwise return a descriptive
error rejecting the malformed name.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@dnsresolver/cmd.go`:
- Around line 177-182: parseServiceName currently only checks label count and
trusts parts[1]; instead validate the full DNS shape
pod.service.namespace.svc[.cluster.local] before returning the service. In
parseServiceName, split on '.', ensure no empty labels, require parts[3] ==
"svc", allow either len(parts)==4 or len(parts)==6 with parts[4]=="cluster" &&
parts[5]=="local", and only then return parts[1]; otherwise return a descriptive
error rejecting the malformed name.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 11fd82fa-916a-486b-a98c-7ddd6ed5ad67

📥 Commits

Reviewing files that changed from the base of the PR and between b877681 and 30ffd78.

📒 Files selected for processing (2)

dnsresolver/cmd.go
dnsresolver/cmd_test.go

cssjr · 2026-05-27T21:30:18Z

/cc @garrickdabbs

cssjr · 2026-05-27T21:55:48Z

/test all

cssjr · 2026-05-27T21:57:07Z

/pipeline required

openshift-merge-bot · 2026-05-27T21:57:10Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

cwbotbot · 2026-05-28T00:05:03Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-06-03T22:13:25Z
View Job
View Job History

e2e-aks

Status: ✅ PASS
Started: 2026-06-03T19:10:46Z
View Job
View Job History

cssjr · 2026-05-28T21:03:19Z

/retest

cssjr · 2026-05-29T15:11:15Z

/test images

cssjr · 2026-05-29T15:12:15Z

/cc @bryan-cox @Patryk-Stefanski

cssjr · 2026-06-03T14:13:35Z

✅ Addressed feedback from @enxebre in commit `c8b0fa1`

All 4 comments have been addressed:

1. Added comment in statefulset.yaml (line 192) ✅

Added a 3-line comment explaining why --self-register is needed:

# Self-register this pod's IP into an EndpointSlice before waiting for DNS resolution.
# This bypasses stale kube-controller-manager EndpointSlice cache issues under high HCP density
# (2,500+ namespaces), where the standard controller can delay DNS updates by minutes to hours.

2. Added comment in cmd.go (line 46) ✅

Added a 3-line comment documenting the --self-register flag:

// --self-register creates an EndpointSlice for this pod before waiting for DNS resolution.
// This bypasses stale kube-controller-manager EndpointSlice cache issues under high HCP density,
// where the standard controller can delay DNS updates by minutes to hours.

3. Fixed IP parsing to handle nil and avoid silent failure (line 112) ✅

Replaced manual net.ParseIP().To4() check with netutil.IsIPv4Address() which:

Returns an error if ParseIP returns nil (no more silent failures)
Properly handles the To4() check
Follows existing codebase patterns

Before:

addressType := discoveryv1.AddressTypeIPv4
if net.ParseIP(podIP) != nil && net.ParseIP(podIP).To4() == nil {
    addressType = discoveryv1.AddressTypeIPv6
}

After:

isIPv4, err := netutil.IsIPv4Address(podIP)
if err != nil {
    return fmt.Errorf("failed to parse pod IP %s: %w", podIP, err)
}
addressType := discoveryv1.AddressTypeIPv6
if isIPv4 {
    addressType = discoveryv1.AddressTypeIPv4
}

4. Used `support/netutil.IsIPv4Address()` ✅

Implemented as part of #3 above.

Verification

✅ All tests pass
✅ make build succeeds
✅ make lint-fix applied
✅ make pre-commit passes

cssjr · 2026-06-03T15:34:29Z

/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

hypershift-jira-solve-ci · 2026-06-03T17:27:47Z

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2062196402125017088 | Cost: $4.831960000000001 | Failed step: hypershift-azure-run-e2e

View full analysis report

_{Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6}

cssjr · 2026-06-03T19:10:29Z

/retest-required

csrwng

LGTM — clean implementation of the EndpointSlice self-registration pattern.

RBAC is well-scoped (get/create/update EndpointSlices + get Pods), the custom managed-by label avoids conflict with the standard controller, and the OwnerReference to the Pod ensures proper GC. The backwards-compatible SA split (keeping etcd-defrag-controller for HA, adding etcd for all topologies) addresses the upgrade concern. Tests cover create, update, IPv6, error cases, and multi-pod scenarios.

openshift-merge-bot · 2026-06-03T21:54:07Z

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

openshift-ci · 2026-06-03T21:54:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, csrwng, cssjr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bryan-cox,csrwng]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cssjr · 2026-06-03T22:13:11Z

/retest-required

cssjr · 2026-06-04T14:14:15Z

/test e2e-azure-v2-self-managed

cssjr · 2026-06-12T17:41:50Z

Running validation on Azure/ARO-HCP#5635 with a custom image containing these changes (after rebasing).

…ntainer Under high HCP density (~2,500 namespaces), the management cluster's kube-controller-manager EndpointSlice informer cache goes stale, delaying etcd-discovery headless Service DNS records by minutes to hours. This causes etcd pods to block in the ensure-dns init container, preventing cluster creation and triggering E2E test timeouts. Each etcd pod now self-registers its IP into a custom EndpointSlice (managed-by: control-plane-operator) before polling DNS. The standard EndpointSlice controller ignores slices with a different managed-by label, so there is no conflict. CoreDNS picks up the custom slice via its watch within ~100ms, making DNS resolution near-instant regardless of controller informer health. A new dedicated "etcd" service account replaces the conditional "etcd-defrag-controller" SA, with RBAC granting EndpointSlice and pod get permissions. The defrag Role/RoleBinding remain HA-only. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Use apierrors.IsNotFound to only fall through to Create when the EndpointSlice genuinely doesn't exist. RBAC, timeout, and other errors are now returned immediately instead of being masked by a subsequent Create attempt. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Refactor selfRegisterEndpointSlice to extract testable ensureEndpointSlice function that accepts a kubernetes.Interface, enabling tests with the fake client. Tests cover: create, update, IPv6, missing pod, invalid DNS name, multiple pods, and field correctness. Coverage: ensureEndpointSlice 88.9%, parseServiceName 100%. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

This addresses review feedback on PR openshift#8613 to maintain backwards compatibility during ServiceAccount changes. Blocking issues fixed: - Keep etcd-defrag-controller SA (HA-only) separate from etcd SA (always present) to avoid orphaned SAs on upgrade - Use fully qualified managed-by label "etcd-self-register.hypershift.openshift.io" to avoid future cross-contamination with LabelManagedBy filters Suggestions implemented: - Add Controller and BlockOwnerDeletion to OwnerReference for explicit GC intent - Replace context.Background() with t.Context() in tests to prevent goroutine leaks - Use gomega MatchError for cleaner error assertions - Simplify IP generation in tests with fmt.Sprintf The SA split works as follows: - HA mode: use etcd-defrag-controller SA with both defrag and self-register roles bound to it - Non-HA mode: use etcd SA with only self-register role This keeps the PR scoped to self-registration without upgrade risk. Signed-off-by: Cliff Schomburg <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)

@enxebre

Address feedback from @enxebre to document why --self-register is needed and improve IP address parsing. Changes: - Added comment in statefulset.yaml explaining --self-register flag purpose (bypasses stale EndpointSlice cache under high HCP density) - Added comment in dnsresolver/cmd.go documenting the flag - Replaced manual net.ParseIP().To4() check with netutil.IsIPv4Address() which properly handles ParseIP returning nil and returns an error instead of silently failing Signed-off-by: Cliff Schomburg <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)

Add test to verify that ensureEndpointSlice properly returns an error when given a malformed IP address (not IPv4 or IPv6). This ensures the netutil.IsIPv4Address() error handling works correctly and doesn't silently fail on invalid input. Signed-off-by: Cliff Schomburg <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)

cssjr · 2026-06-17T00:00:08Z

Rebased - no code changes

bryan-cox · 2026-06-17T11:09:40Z

/lgtm

putting this back since it was only rebase

openshift-merge-bot · 2026-06-17T11:09:56Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

bryan-cox · 2026-06-17T14:22:37Z

/retest

openshift-ci · 2026-06-17T16:16:13Z

@cssjr: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-kubevirt-aws-ovn-reduced	`d1662cf`	link	true	`/test e2e-kubevirt-aws-ovn-reduced`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hypershift-jira-solve-ci · 2026-06-17T16:27:11Z

Confirmed — the PR is to the openshift/hypershift repo. The management cluster IPI install uses a standard OpenShift release image and the openshift-install binary. The HyperShift PR code (hypershift-operator, hypershift-tests) is not used until after the management cluster is installed. This failure is in the IPI install (pre-phase), making it unrelated to the PR changes.

Now I have all the evidence needed. Let me compile the report.

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-kubevirt-aws-ovn-reduced
Build ID: 2067251706453823488
Target: e2e-kubevirt-aws-ovn-reduced
PR: AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays #8613 (AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays)

Test Failure Analysis

Error

level=error msg=Bootstrap failed to complete: timed out waiting for the condition
level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.

Summary

The management cluster IPI installation (pre-phase step ipi-install-install) failed with a bootstrap timeout after 45 minutes. The openshift-apiserver operator never reached Available status (APIServices_PreconditionNotReady), which blocked the routes.route.openshift.io API from registering. This caused a cascade: the monitoring operator couldn't create Route objects, the ingress operator's router deployment never achieved minimum replicas (0/1 available), and the authentication operator couldn't reach the oauth-server endpoints. This is a management cluster infrastructure failure that occurred before any HyperShift code from PR #8613 was involved — the PR changes to etcd EndpointSlice self-registration are not exercised until the HyperShift operator is deployed on top of an already-installed management cluster.

Root Cause

This is an infrastructure flake in the management cluster IPI bootstrap, unrelated to the PR under test.

Failure chain (traced to root):

openshift-apiserver never became Available — reported APIServices_PreconditionNotReady for the entire 45-minute bootstrap window (14:54–15:39 UTC). The openshift-apiserver registers custom API services (including routes.route.openshift.io). When its APIServices preconditions aren't met, these API resources remain unregistered.
monitoring operator failed because it tried to create Route objects (post routes.route.openshift.io) but the API resource was never registered (depends on openshift-apiserver). Similarly, Prometheus CRDs (prometheuses.monitoring.coreos.com) were not available.
ingress operator reported the default IngressController's router deployment had 0/1 replicas available (MinimumReplicasUnavailable). The router pod could not be scheduled. The cluster used c5n.metal worker nodes with replicas: 1, but worker MachineSet creation happens post-bootstrap, so no worker node was present during bootstrap. The router must schedule on a master node during bootstrap, but the openshift-apiserver being down likely prevented proper admission/scheduling chain completion.
authentication operator failed because it depends on both the openshift-apiserver (for APIServices) and ingress (for the oauth-server endpoint routing). With both down, oauth endpoints were unreachable.
DNS resolution issues were observed: read udp 10.128.0.8:→172.30.0.10:53: read: connection refused — the cluster's CoreDNS service was intermittently refusing connections, which also prevented the insights operator from reaching api.openshift.com.

Why this is unrelated to PR #8613:

The failure is in the IPI management cluster install step (ipi-install-install), which uses the standard openshift-install binary and OCP release image
PR AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays #8613 modifies HyperShift's etcd EndpointSlice logic, which is not invoked until the hypershift-operator is deployed on the management cluster — a step that never occurred because installation failed first
The openshift-apiserver PreconditionNotReady failure is a known IPI bootstrap flake pattern

Recommendations

Retrigger the job — This is an infrastructure flake in the management cluster IPI bootstrap, unrelated to the PR changes. A retest (/retest) should resolve the issue.
No PR code changes needed — The HyperShift etcd EndpointSlice changes from PR AROSLSRE-830: Add etcd EndpointSlice self-registration to fix DNS resolution delays #8613 were never exercised in this run. The failure occurred entirely within the standard OpenShift IPI installer before any HyperShift components were deployed.
If the failure recurs, investigate whether the openshift-apiserver PreconditionNotReady pattern is occurring at elevated rates in the CI fleet, which would indicate a broader platform regression rather than an isolated flake.

Evidence

Evidence	Detail
Failed Step	`ipi-install-install` (pre-phase) — management cluster IPI installation
Exit Code	Installer exit code 5 (bootstrap timeout)
Bootstrap Wait	45 minutes (14:54:51–15:39:51 UTC), timed out
openshift-apiserver	`Available=False` with `APIServices_PreconditionNotReady` for entire bootstrap window
ingress	`Available=False` — router deployment 0/1 replicas (`MinimumReplicasUnavailable`)
monitoring	`Available=False` — `routes.route.openshift.io` API resource not found (depends on openshift-apiserver)
authentication	`Available=False` — oauth-server endpoints not ready, connection refused to `172.30.206.61:443`
DNS errors	`read udp 10.128.0.8:→172.30.0.10:53: read: connection refused` (CoreDNS intermittently unavailable)
Control Plane Nodes	All 3 masters provisioned and joined (master-2 at 14:57:48, master-0 at 14:59:08, master-1 at 14:59:48)
Worker Nodes	1× `c5n.metal` configured but workers are post-bootstrap; none present during bootstrap
PR Relevance	PR #8613 modifies `openshift/hypershift` etcd EndpointSlice code — never invoked during IPI install
k8sgpt step	Also failed with `the server doesn't have a resource type "hostedcluster"` — expected cascading failure since cluster install never completed

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 27, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 27, 2026

openshift-ci Bot added the do-not-merge/needs-area label May 27, 2026

openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels May 27, 2026

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread dnsresolver/cmd.go

cssjr force-pushed the etcd-endpointslice-self-register branch from 87cf121 to 39ffc19 Compare May 27, 2026 21:13

openshift-ci Bot requested a review from patjlm May 27, 2026 21:22

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

openshift-ci Bot requested a review from garrickdabbs May 27, 2026 21:30

hypershift-jira-solve-ci Bot mentioned this pull request May 28, 2026

OCPBUGS-86661: Konnectivity retry proxy connection on timeout #8579

Open

cssjr marked this pull request as ready for review May 28, 2026 14:29

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2026

openshift-ci Bot requested review from devguyio and sjenning May 28, 2026 14:30

openshift-ci Bot requested a review from bryan-cox May 29, 2026 15:12

cssjr requested a review from enxebre June 3, 2026 15:33

csrwng approved these changes Jun 3, 2026

View reviewed changes

openshift-ci Bot assigned csrwng Jun 3, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2026

cssjr and others added 7 commits June 16, 2026 16:46

fix(etcd): replace deprecated fake.NewSimpleClientset with NewClientset

6471b95

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

cssjr force-pushed the etcd-endpointslice-self-register branch from 824480d to d1662cf Compare June 16, 2026 23:47

openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2026

openshift-ci Bot assigned bryan-cox Jun 17, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2026

Conversation

cssjr commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Evidence from Kusto (HostedControlPlaneLogs, hcp-dev-us-2.eastus2)

Root Cause: EndpointSlice informer cache staleness

Impact on E2E stability

Solution: Etcd Pod Self-Registration

Changed files

Risks and mitigations

Test plan

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented May 27, 2026

Uh oh!

openshift-ci Bot commented May 27, 2026

Uh oh!

openshift-ci-robot commented May 27, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Evidence from Kusto (HostedControlPlaneLogs, hcp-dev-us-2.eastus2)

Root Cause: EndpointSlice informer cache staleness

Impact on E2E stability

Solution: Etcd Pod Self-Registration

Changed files

Risks and mitigations

Test plan

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram(s)

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cssjr commented May 27, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cssjr commented May 27, 2026

Uh oh!

cssjr commented May 27, 2026

Uh oh!

cssjr commented May 27, 2026

Uh oh!

openshift-merge-bot Bot commented May 27, 2026

Uh oh!

cwbotbot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

e2e-aks

Uh oh!

cssjr commented May 28, 2026

Uh oh!

cssjr commented May 29, 2026

Uh oh!

cssjr commented May 29, 2026

Uh oh!

cssjr commented Jun 3, 2026

✅ Addressed feedback from @enxebre in commit c8b0fa1

1. Added comment in statefulset.yaml (line 192) ✅

2. Added comment in cmd.go (line 46) ✅

3. Fixed IP parsing to handle nil and avoid silent failure (line 112) ✅

4. Used support/netutil.IsIPv4Address() ✅

Verification

Uh oh!

cssjr commented Jun 3, 2026

cssjr commented May 27, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented May 27, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading

codecov Bot commented May 27, 2026 •

edited

Loading

cwbotbot commented May 28, 2026 •

edited

Loading

✅ Addressed feedback from @enxebre in commit `c8b0fa1`

4. Used `support/netutil.IsIPv4Address()` ✅

cssjr commented Jun 12, 2026 •

edited

Loading

hypershift-jira-solve-ci Bot commented Jun 17, 2026 •

edited by openshift-ci Bot

Loading