NO-JIRA: deploymentcontroller: add topology-aware scheduling hooks#2276
NO-JIRA: deploymentcontroller: add topology-aware scheduling hooks#2276sanchezl wants to merge 1 commit into
Conversation
|
@sanchezl: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
WalkthroughAdds topology-aware deployment hooks: one computes replicas from Infrastructure topology or control-plane nodes, one applies or removes the control-plane nodeSelector based on topology, and one sets rolling-update limits and required hostname pod anti-affinity. Includes extensive unit tests and test helpers. ChangesTopology-aware deployment hooks
Sequence DiagramsequenceDiagram
participant Controller as DeploymentController
participant Infra as InfrastructureLister
participant Nodes as NodeLister
participant Deployment as Deployment
Controller->>Infra: Get cluster Infrastructure
Infra-->>Controller: ControlPlaneTopology
Controller->>Nodes: List nodes (control-plane selector)
Nodes-->>Controller: node list/count
Controller->>Deployment: set Spec.Replicas / update Strategy.RollingUpdate / adjust Affinity / modify NodeSelector
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sanchezl The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/operator/deploymentcontroller/topology_test.go`:
- Around line 20-22: Tests currently ignore errors returned by indexer.Add when
adding entries from nodes; update the test code in topology_test.go to check
those errors and fail the test if any occur (e.g., replace `_ = indexer.Add(n)`
with error capture and call t.Fatalf or use require.NoError/assert.NoError),
doing the same for the other occurrences around lines where nodes are added so
broken fixtures surface instead of being masked.
In `@pkg/operator/deploymentcontroller/topology.go`:
- Around line 93-96: Replace the hard required hostname anti-affinity with a
preferred (soft) rule: remove the RequiredDuringSchedulingIgnoredDuringExecution
entry that contains the corev1.PodAffinityTerm with TopologyKey
"kubernetes.io/hostname" and instead add a
PreferredDuringSchedulingIgnoredDuringExecution slice containing a
metav1.WeightedPodAffinityTerm (with an appropriate weight, e.g. 50) whose
PodAffinityTerm uses the same LabelSelector and TopologyKey; update the
surrounding struct where pod affinity is built so it uses
PreferredDuringSchedulingIgnoredDuringExecution and WeightedPodAffinityTerm
instead of RequiredDuringSchedulingIgnoredDuringExecution and
corev1.PodAffinityTerm.
- Around line 117-133: The hook WithControlPlaneNodeSelectorHook currently
always injects a control-plane nodeSelector
(deployment.Spec.Template.Spec.NodeSelector["node-role.kubernetes.io/control-plane"]=""),
which hard-pins workloads; change it to be opt-in: do not modify NodeSelector by
default—only add the control-plane nodeSelector when an explicit opt-in signal
is present (e.g., a deployment annotation like
"operator.openshift.io/enforce-control-plane-node-selector" == "true" or a
boolean flag in the passed OperatorSpec). Update
WithControlPlaneNodeSelectorHook to check that opt-in signal (and still skip for
ExternalTopologyMode) before creating or mutating
deployment.Spec.Template.Spec.NodeSelector, and document/flag this behavior so
callers must explicitly enable it.
- Around line 126-133: When infra.Status.ControlPlaneTopology equals
configv1.ExternalTopologyMode we currently return early but do not remove any
existing "node-role.kubernetes.io/control-plane" entry, which can leave pods
stuck Pending; update the conditional path in the same function to check
deployment.Spec.Template.Spec.NodeSelector and if non-nil (or initialize it)
delete the "node-role.kubernetes.io/control-plane" key (i.e., remove
deployment.Spec.Template.Spec.NodeSelector["node-role.kubernetes.io/control-plane"])
before returning so any pre-existing control-plane selector is cleared for
External topology.
- Around line 90-111: The code currently overwrites
deployment.Spec.Template.Spec.Affinity (and sets it to nil) which destroys any
pre-existing NodeAffinity/PodAffinity; instead, update only the
PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution field while
preserving other affinity subfields: ensure
deployment.Spec.Template.Spec.Affinity is initialized if nil, keep existing
Affinity.NodeAffinity and Affinity.PodAffinity intact, and set or clear
Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution based on
the replicas check; reference the Affinity object on
deployment.Spec.Template.Spec.Affinity and its
PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution, PodAffinity, and
NodeAffinity when implementing the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 67005b98-452d-4979-9a0c-7438c7888843
📒 Files selected for processing (2)
pkg/operator/deploymentcontroller/topology.gopkg/operator/deploymentcontroller/topology_test.go
| RequiredDuringSchedulingIgnoredDuringExecution: []corev1.PodAffinityTerm{ | ||
| { | ||
| TopologyKey: "kubernetes.io/hostname", | ||
| LabelSelector: &metav1.LabelSelector{ |
There was a problem hiding this comment.
Avoid required hostname pod anti-affinity in controller defaults.
Line 93-96 configures required anti-affinity with kubernetes.io/hostname, which can create strict scheduling failures in constrained/non-HA topologies. Prefer a soft rule (PreferredDuringSchedulingIgnoredDuringExecution) if this behavior is still needed.
As per coding guidelines: deployment/controller code must “flag required pod anti-affinity with hostname topology keys”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/operator/deploymentcontroller/topology.go` around lines 93 - 96, Replace
the hard required hostname anti-affinity with a preferred (soft) rule: remove
the RequiredDuringSchedulingIgnoredDuringExecution entry that contains the
corev1.PodAffinityTerm with TopologyKey "kubernetes.io/hostname" and instead add
a PreferredDuringSchedulingIgnoredDuringExecution slice containing a
metav1.WeightedPodAffinityTerm (with an appropriate weight, e.g. 50) whose
PodAffinityTerm uses the same LabelSelector and TopologyKey; update the
surrounding struct where pod affinity is built so it uses
PreferredDuringSchedulingIgnoredDuringExecution and WeightedPodAffinityTerm
instead of RequiredDuringSchedulingIgnoredDuringExecution and
corev1.PodAffinityTerm.
There was a problem hiding this comment.
Required anti-affinity is intentional here — it is the core purpose of this PR. The topology-aware replicas hook (which runs first in the chain) already adapts to constrained topologies: SNO gets 1 replica (so anti-affinity is removed entirely), DualReplica/TNA gets 2 replicas on 2 nodes, etc. The deadlock that required anti-affinity normally causes on constrained topologies is avoided because the companion scheduling hook sets maxUnavailable = max(replicas-1, 1), allowing rolling updates to proceed by deleting an old pod to free a node. This matches the pattern used by every other HA operator in OpenShift (oauth-openshift, openshift-apiserver, openshift-controller-manager, image-registry).
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Resolve the coderabbit review comment using the GitHub CLI
# First, list comments on the PR to find the comment ID
gh api repos/openshift/library-go/pulls/2276/comments --jq '.[] | select(.user.login == "coderabbitai[bot]") | {id: .id, body: .body[0:80]}'Length of output: 978
🏁 Script executed:
#!/bin/bash
# Resolve the review comment via GraphQL (minimize it)
# First get the node ID for comment 3357674013
gh api graphql -f query='
{
repository(owner: "openshift", name: "library-go") {
pullRequest(number: 2276) {
reviewThreads(first: 50) {
nodes {
id
isResolved
comments(first: 1) {
nodes {
databaseId
body
}
}
}
}
}
}
}' --jq '.data.repository.pullRequest.reviewThreads.nodes[] | select(.comments.nodes[0].databaseId == 3357674013) | {threadId: .id, resolved: .isResolved}'Length of output: 716
🏁 Script executed:
#!/bin/bash
# Resolve the review thread
gh api graphql -f query='
mutation {
resolveReviewThread(input: {threadId: "PRRT_kwDOB-wqks6HJQQ6"}) {
thread {
id
isResolved
}
}
}'Length of output: 705
@sanchezl, understood — the design is intentional and the concern is fully mitigated. The two hooks work as a system:
- Replicas hook (runs first): adapts replica count to topology — SNO gets 1 replica, so
WithTopologyAwareSchedulingHookremoves anti-affinity entirely for that case. - Scheduling hook (runs second): sets
maxUnavailable = max(replicas-1, 1), which prevents the classic scheduling deadlock by allowing Kubernetes to evict an old pod to free a node before placing the replacement.
This is the established pattern across OpenShift HA operators (oauth-openshift, openshift-apiserver, openshift-controller-manager, image-registry), and required anti-affinity is the correct choice here. I'll mark the comment as resolved.
✏️ Learnings added
Learnt from: sanchezl
Repo: openshift/library-go PR: 2276
File: pkg/operator/deploymentcontroller/topology.go:0-0
Timestamp: 2026-06-04T17:50:01.796Z
Learning: In `pkg/operator/deploymentcontroller/topology.go` (openshift/library-go PR `#2276`), `RequiredDuringSchedulingIgnoredDuringExecution` pod anti-affinity on `kubernetes.io/hostname` is intentional and safe. The design relies on two hooks in tandem: (1) `WithTopologyAwareReplicasHook` sets replicas=1 for SNO (so `WithTopologyAwareSchedulingHook` removes anti-affinity entirely for SNO), and replicas=2 for DualReplica/TNA/External; (2) `WithTopologyAwareSchedulingHook` sets `maxUnavailable = max(replicas-1, 1)`, avoiding the typical scheduling deadlock by allowing Kubernetes to evict an old pod before scheduling the replacement. This matches the pattern used by oauth-openshift, openshift-apiserver, openshift-controller-manager, and image-registry. Do not flag required hostname anti-affinity as a violation in this file.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: CR
Repo: openshift/library-go PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-05-27T15:26:58.336Z
Learning: Applies to **/*.{yaml,yml,go} : Deployment manifests and controller code must avoid scheduling constraints that assume standard HA topology. Flag required pod anti-affinity with hostname topology keys, pod topology spread constraints with DoNotSchedule, replica counts derived from node count without topology awareness, nodeSelector or affinity targeting control-plane nodes, broad tolerations that schedule to arbiter nodes, code targeting only worker nodes, and PodDisruptionBudgets designed for 3+ nodes
Learnt from: CR
Repo: openshift/library-go PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-05-27T15:26:58.336Z
Learning: Applies to **/*_test.go : When new Ginkgo e2e tests are added, check whether they make multi-node or HA assumptions. Flag tests expecting multiple control-plane/master nodes, multiple worker nodes, pod anti-affinity, node-to-node communication, leader election failover, pod rescheduling, node scaling, separate infra/worker/master roles, rolling updates requiring other nodes, or multi-endpoint ingress/load balancing
| // WithControlPlaneNodeSelectorHook adds a control-plane node selector to the | ||
| // deployment, except on External topology (HyperShift) where no control-plane | ||
| // nodes exist in-cluster and the selector would leave pods Pending. | ||
| func WithControlPlaneNodeSelectorHook(infrastructureLister configv1listers.InfrastructureLister) DeploymentHookFunc { | ||
| return func(_ *opv1.OperatorSpec, deployment *appsv1.Deployment) error { | ||
| infra, err := infrastructureLister.Get("cluster") | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get infrastructure resource: %w", err) | ||
| } | ||
| if infra.Status.ControlPlaneTopology == configv1.ExternalTopologyMode { | ||
| return nil | ||
| } | ||
| if deployment.Spec.Template.Spec.NodeSelector == nil { | ||
| deployment.Spec.Template.Spec.NodeSelector = make(map[string]string) | ||
| } | ||
| deployment.Spec.Template.Spec.NodeSelector["node-role.kubernetes.io/control-plane"] = "" | ||
| return nil |
There was a problem hiding this comment.
Avoid enforcing control-plane nodeSelector in shared hook defaults.
Line 132 hard-pins workloads to control-plane nodes. This violates the topology-safety policy and risks portability issues across non-standard topologies. Consider making this opt-in at the caller layer instead of default behavior in a shared hook.
As per coding guidelines: deployment/controller code must “flag nodeSelector or affinity targeting control-plane nodes”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/operator/deploymentcontroller/topology.go` around lines 117 - 133, The
hook WithControlPlaneNodeSelectorHook currently always injects a control-plane
nodeSelector
(deployment.Spec.Template.Spec.NodeSelector["node-role.kubernetes.io/control-plane"]=""),
which hard-pins workloads; change it to be opt-in: do not modify NodeSelector by
default—only add the control-plane nodeSelector when an explicit opt-in signal
is present (e.g., a deployment annotation like
"operator.openshift.io/enforce-control-plane-node-selector" == "true" or a
boolean flag in the passed OperatorSpec). Update
WithControlPlaneNodeSelectorHook to check that opt-in signal (and still skip for
ExternalTopologyMode) before creating or mutating
deployment.Spec.Template.Spec.NodeSelector, and document/flag this behavior so
callers must explicitly enable it.
There was a problem hiding this comment.
This hook is opt-in — callers explicitly register it in their controller's hook chain. It is not applied by default to any deployment. The caller decides whether their workload should run on control-plane nodes. Additionally, the External topology case (where no CP nodes exist) is handled by clearing any pre-existing CP selector, as addressed in the latest push.
There was a problem hiding this comment.
@sanchezl, you're right — that's a fair clarification. Since callers must explicitly register WithControlPlaneNodeSelectorHook in their hook chain, the opt-in contract is clear at the usage site. The hook name also makes the intent self-documenting. I'll withdraw this concern.
🧠 Learnings used
Learnt from: CR
Repo: openshift/library-go PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-05-27T15:26:58.336Z
Learning: Applies to **/*.{yaml,yml,go} : Deployment manifests and controller code must avoid scheduling constraints that assume standard HA topology. Flag required pod anti-affinity with hostname topology keys, pod topology spread constraints with DoNotSchedule, replica counts derived from node count without topology awareness, nodeSelector or affinity targeting control-plane nodes, broad tolerations that schedule to arbiter nodes, code targeting only worker nodes, and PodDisruptionBudgets designed for 3+ nodes
Learnt from: CR
Repo: openshift/library-go PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-05-27T15:26:58.336Z
Learning: Applies to **/*_test.go : When new Ginkgo e2e tests are added, check whether they make multi-node or HA assumptions. Flag tests expecting multiple control-plane/master nodes, multiple worker nodes, pod anti-affinity, node-to-node communication, leader election failover, pod rescheduling, node scaling, separate infra/worker/master roles, rolling updates requiring other nodes, or multi-endpoint ingress/load balancing
78e4348 to
c6e8028
Compare
Add three DeploymentHookFunc helpers for operators that manage Deployments via the library-go DeploymentController: - WithTopologyAwareReplicasHook: sets replicas based on ControlPlaneTopology (SingleReplica, DualReplica, HighlyAvailableArbiter, External, HighlyAvailable) with a caller-specified cap, falling back to control-plane node count for unknown topologies. - WithTopologyAwareSchedulingHook: sets required pod anti-affinity and rolling update strategy (maxUnavailable, maxSurge) based on the deployment's replica count. When replicas > 1, pods are guaranteed to spread across nodes via required anti-affinity and maxUnavailable allows rolling updates to proceed without deadlock. When replicas == 1 (SNO), anti-affinity is removed. - WithControlPlaneNodeSelectorHook: adds a control-plane node selector, skipping External topology where no control-plane nodes exist in-cluster.
c6e8028 to
43b4ab6
Compare
|
@sanchezl: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Add topology-aware scheduling helpers for operators that manage Deployments. These provide a consistent approach to replica scaling, pod anti-affinity, rolling update strategy, and control-plane node selection across all supported OpenShift topologies (SNO, DualReplica, TNA, HA, HyperShift).
Motivation: A survey of OpenShift operators found that each one reinvents topology-aware scheduling logic with inconsistent results. For example, the kube-storage-version-migrator-operator is the only HA operator using preferred (instead of required) anti-affinity, which led to a 74-second
Available=Falseoutage on a two-node fencing test (OCPBUGS-65984) when both pods landed on the same node during a bootstrap timing race. Every other HA operator (oauth-openshift, metrics-server, image-registry) uses required anti-affinity and survived the same fencing event.New APIs
Pure functions (for any operator, no DeploymentController required)
TopologyAwareReplicas(topology, controlPlaneNodeCount, maxReplicas)— Returns the desired replica count for a given topology:SetTopologyAwareScheduling(deployment, topology, controlPlaneNodeCount, appLabelValue, maxSurge, maxReplicas)— Applies all three topology adjustments (replicas, node selector, scheduling strategy) to a*appsv1.Deploymentin one call. Designed for operators that manage Deployments outsideDeploymentController, such asworkload.Delegateimplementations.DeploymentHookFunc hooks (for DeploymentController users)
WithTopologyAwareReplicasHook(infrastructureLister, nodeLister, maxReplicas)— Sets replicas based onControlPlaneTopology(delegates toTopologyAwareReplicas).WithTopologyAwareSchedulingHook(appLabelValue, maxSurge)— Sets rolling update strategy and pod anti-affinity based on replica count (must run after the replicas hook):kubernetes.io/hostname,maxUnavailable = max(replicas-1, 1)maxUnavailable = 1WithControlPlaneNodeSelectorHook(infrastructureLister)— Adds a control-plane node selector, except on External topology (HyperShift) where no control-plane nodes exist.Convenience constructor
WithTopologyAwareSchedulingHooks(infrastructureLister, nodeLister, appLabelValue, maxSurge, maxReplicas)— Returns all three hooks in the correct registration order as[]DeploymentHookFunc, eliminating the ordering requirement between replicas and scheduling hooks.Usage examples
DeploymentController (individual hooks):
DeploymentController (convenience):
workload.Delegate or custom controller:
Cross-operator survey
These hooks codify the pattern already used by most HA operators:
The hooks are designed for operators using the
DeploymentControllerpattern. The pure functions (TopologyAwareReplicas,SetTopologyAwareScheduling) extend this to any deployment management pattern. Operators with genuinely custom logic (ingress: hostNetwork; image-registry: platform-driven) are unaffected.Summary by CodeRabbit
New Features
Tests