Skip to content

OCPBUGS-86661: Konnectivity retry proxy connection on timeout#8579

Open
YamunadeviShanmugam wants to merge 2 commits into
openshift:mainfrom
YamunadeviShanmugam:fix-add-startup-probe-konnectivity-sock5-sidecar
Open

OCPBUGS-86661: Konnectivity retry proxy connection on timeout#8579
YamunadeviShanmugam wants to merge 2 commits into
openshift:mainfrom
YamunadeviShanmugam:fix-add-startup-probe-konnectivity-sock5-sidecar

Conversation

@YamunadeviShanmugam

@YamunadeviShanmugam YamunadeviShanmugam commented May 22, 2026

Copy link
Copy Markdown
Contributor

This PR fixes the intermittent E2E test failures where the konnectivity-proxy-socks5 sidecar container crashes and restarts during cluster bootstrap. Fix includes retry loop, native Kubernetes startup probes, and explicit component dependency ordering.
Proxy Internal Retry Logic

  • Added a new retry infrastructure with exponential backoff to handle transient connection errors safely. The container process now stays alive while waiting for the network tunnel to become available.

Native Startup Probes
konnectivity-container.go: Added a native TCP startup probe to all Konnectivity sidecar containers. This gives the sidecars a safe 60-second window to initialize during cluster bootstrap before any readiness or liveness checks trigger.

Component Dependency Ordering
Deployment graph is updated with dependency check. The openshift-oauth-apiserver pods will now explicitly wait for the konnectivity-agent to be completely Available and RolloutComplete before they begin scheduling.

Test Fixtures are updated accordingly

Summary by CodeRabbit

  • New Features

    • Konnectivity SOCKS5 proxy guard with automatic retries/backoff, graceful shutdown, and guarded startup.
    • Default serving port constant and TCP startup probe for konnectivity proxy; mode-specific serving-port selection.
    • Multiple hosted control-plane components now declare the Konnectivity agent as an explicit dependency.
  • Bug Fixes

    • Startup errors now propagate and exit cleanly instead of panicking.
  • Tests

    • Extensive unit tests for guard/bootstrap, transient-error detection, dialing, CLI flags, container build/probes, and graceful shutdown.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 22, 2026
@openshift-ci

openshift-ci Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a guarded Konnectivity-backed SOCKS5 proxy (bootstrap, TCP verify, transient retry/backoff, graceful shutdown), moves proxy startup to a signal-aware guard in main, adds a TCP StartupProbe and DefaultServingPort (8090) for the konnectivity proxy container, and updates multiple hosted-control-plane components to declare the konnectivity agent as a dependency. Comprehensive unit tests for the guard, dial checks, main command, and container probe behavior are included.

Sequence Diagram(s)

sequenceDiagram
  participant SignalHandler
  participant KonnectivityGuard
  participant KubernetesAPI
  participant KonnectivityDialer
  participant SOCKS5Server

  SignalHandler->>KonnectivityGuard: start(ctx)
  KonnectivityGuard->>KubernetesAPI: create controller-runtime client
  KonnectivityGuard->>KonnectivityDialer: tryBootstrapKonnectivity()
  KonnectivityDialer->>KubernetesAPI: use injected client
  KonnectivityDialer->>KonnectivityServer: TCP dial host:port (verify)
  KonnectivityGuard->>SOCKS5Server: serveWithGracefulShutdown(server, KonnectivityDialer)
  SOCKS5Server-->>KonnectivityGuard: exit error
  KonnectivityGuard->>KonnectivityGuard: isTransientKonnectivityError? -> backoff & retry
Loading
🚥 Pre-merge checks | ✅ 7 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Test Structure And Quality ❓ Inconclusive Tests use standard Go testing, not Ginkgo; check explicitly requires "Ginkgo test code" with Ginkgo patterns (It blocks, BeforeEach/AfterEach) absent here. Clarify: does check apply to standard Go tests? If yes, assertion messages need improvement (some assertions lack meaningful messages for failure diagnosis).
✅ Passed checks (7 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR adds no Ginkgo tests—all new test files (guard_test.go, main_test.go, konnectivity-container_test.go) use standard Go testing with t.Run, not Ginkgo It()/Describe(). Check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed No topology-breaking constraints: startup probes are simple TCP checks, no affinity/topology spread rules, no node selectors, no replica calculations based on node count.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds only unit tests (guard_test.go, main_test.go, konnectivity-container_test.go), not Ginkgo e2e tests. Custom check applies only to new Ginkgo e2e tests; check is not applicable here.
Title check ✅ Passed The title clearly describes the main change: adding retry logic for konnectivity proxy connections on timeout, which is the core feature introduced across guard.go, main.go, and related component dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels May 22, 2026
@YamunadeviShanmugam YamunadeviShanmugam force-pushed the fix-add-startup-probe-konnectivity-sock5-sidecar branch from e81da62 to c1f6355 Compare May 22, 2026 17:40
@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/test all

@codecov

codecov Bot commented May 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.25773% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.16%. Comparing base (ee9099c) to head (07dfe64).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
konnectivity-socks5-proxy/guard.go 77.01% 32 Missing and 5 partials ⚠️
konnectivity-socks5-proxy/main.go 0.00% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8579      +/-   ##
==========================================
+ Coverage   41.75%   42.16%   +0.40%     
==========================================
  Files         758      759       +1     
  Lines       93981    94156     +175     
==========================================
+ Hits        39240    39697     +457     
+ Misses      51988    51703     -285     
- Partials     2753     2756       +3     
Files with missing lines Coverage Δ
...hostedcontrolplane/v2/oauth_apiserver/component.go 100.00% <100.00%> (+72.41%) ⬆️
...t/controlplane-component/konnectivity-container.go 71.00% <100.00%> (+36.92%) ⬆️
konnectivity-socks5-proxy/main.go 66.00% <0.00%> (+66.00%) ⬆️
konnectivity-socks5-proxy/guard.go 77.01% <77.01%> (ø)

... and 7 files with indirect coverage changes

Flag Coverage Δ
cmd-support 35.26% <100.00%> (+0.23%) ⬆️
cpo-hostedcontrolplane 45.38% <100.00%> (+1.28%) ⬆️
cpo-other 43.45% <ø> (ø)
hypershift-operator 51.82% <ø> (ø)
other 32.73% <72.09%> (+1.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
konnectivity-socks5-proxy/guard_test.go (2)

28-58: ⚡ Quick win

Consider refactoring to a table-driven test.

The test cases all follow the same structure (create error, call isTransientKonnectivityError, assert result). Table-driven tests improve maintainability and make it easier to add new cases. As per coding guidelines, prefer table-driven tests where possible.

♻️ Proposed table-driven refactor
 func TestIsTransientKonnectivityError(t *testing.T) {
-	t.Run("When error is nil, it should not be transient", func(t *testing.T) {
-		g := NewGomegaWithT(t)
-		g.Expect(isTransientKonnectivityError(nil)).To(BeFalse())
-	})
-
-	t.Run("When error is context deadline exceeded, it should be transient", func(t *testing.T) {
-		g := NewGomegaWithT(t)
-		g.Expect(isTransientKonnectivityError(context.DeadlineExceeded)).To(BeTrue())
-	})
-
-	t.Run("When error is connection refused, it should be transient", func(t *testing.T) {
-		g := NewGomegaWithT(t)
-		g.Expect(isTransientKonnectivityError(fmt.Errorf("dial tcp: connection refused"))).To(BeTrue())
-	})
-
-	t.Run("When error is syscall ECONNREFUSED, it should be transient", func(t *testing.T) {
-		g := NewGomegaWithT(t)
-		g.Expect(isTransientKonnectivityError(syscall.ECONNREFUSED)).To(BeTrue())
-	})
-
-	t.Run("When error is validation failure, it should not be transient", func(t *testing.T) {
-		g := NewGomegaWithT(t)
-		g.Expect(isTransientKonnectivityError(errors.New("failed validation: KonnectivityHost is required"))).To(BeFalse())
-	})
-
-	t.Run("When error is file not found, it should not be transient", func(t *testing.T) {
-		g := NewGomegaWithT(t)
-		g.Expect(isTransientKonnectivityError(os.ErrNotExist)).To(BeFalse())
-	})
+	testCases := []struct {
+		name       string
+		err        error
+		isTransient bool
+	}{
+		{
+			name:       "When error is nil, it should not be transient",
+			err:        nil,
+			isTransient: false,
+		},
+		{
+			name:       "When error is context deadline exceeded, it should be transient",
+			err:        context.DeadlineExceeded,
+			isTransient: true,
+		},
+		{
+			name:       "When error is connection refused, it should be transient",
+			err:        fmt.Errorf("dial tcp: connection refused"),
+			isTransient: true,
+		},
+		{
+			name:       "When error is syscall ECONNREFUSED, it should be transient",
+			err:        syscall.ECONNREFUSED,
+			isTransient: true,
+		},
+		{
+			name:       "When error is validation failure, it should not be transient",
+			err:        errors.New("failed validation: KonnectivityHost is required"),
+			isTransient: false,
+		},
+		{
+			name:       "When error is file not found, it should not be transient",
+			err:        os.ErrNotExist,
+			isTransient: false,
+		},
+	}
+
+	for _, tc := range testCases {
+		t.Run(tc.name, func(t *testing.T) {
+			g := NewGomegaWithT(t)
+			if tc.isTransient {
+				g.Expect(isTransientKonnectivityError(tc.err)).To(BeTrue())
+			} else {
+				g.Expect(isTransientKonnectivityError(tc.err)).To(BeFalse())
+			}
+		})
+	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@konnectivity-socks5-proxy/guard_test.go` around lines 28 - 58, Refactor
TestIsTransientKonnectivityError into a table-driven test: create a slice of
cases with fields name, err and want (bool) containing the current scenarios
(nil, context.DeadlineExceeded, fmt.Errorf("dial tcp: connection refused"),
syscall.ECONNREFUSED, errors.New("failed validation..."), os.ErrNotExist), then
loop over cases calling t.Run(case.name, func(t *testing.T){ g :=
NewGomegaWithT(t);
g.Expect(isTransientKonnectivityError(case.err)).To(Equal(case.want)) }); keep
the test function name TestIsTransientKonnectivityError and reuse
isTransientKonnectivityError and NewGomegaWithT to preserve existing assertions.

95-103: ⚡ Quick win

Avoid mutating package-level state in tests.

Modifying the package-level dialKonnectivityServer variable violates the guideline to avoid global state in tests. While the defer restores the original value, this pattern is unsafe for parallel test execution and can cause race conditions. As per coding guidelines, tests should run with parallel execution enabled.

Consider refactoring the code under test to accept the dialer as a parameter (dependency injection) rather than relying on a package-level variable. This would make tests safer, support parallel execution, and improve testability.

Example approach:

// In guard.go, make dialKonnectivityServer a parameter or field
func retryDialKonnectivityServer(ctx context.Context, log logr.Logger, host string, port uint32, dialFunc func(string, uint32) error) error {
    // ... use dialFunc instead of global dialKonnectivityServer
}

// In test
func TestDialKonnectivityServerRetries(t *testing.T) {
    attempts := 0
    mockDial := func(host string, port uint32) error {
        attempts++
        if attempts < 2 {
            return fmt.Errorf("dial tcp: connection refused")
        }
        return dialKonnectivityServerTCP(host, port)
    }
    
    err := retryDialKonnectivityServer(ctx, log, "127.0.0.1", port, mockDial)
    // ... assertions
}

As per coding guidelines, avoid global state in tests and run tests with parallel execution.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@konnectivity-socks5-proxy/guard_test.go` around lines 95 - 103, The test
mutates the package-level dialKonnectivityServer which is unsafe for parallel
tests; refactor retryDialKonnectivityServer (in guard.go) to accept a dial
function parameter (e.g., dialFunc func(string,uint32) error or as a field on a
struct) and update the test to call retryDialKonnectivityServer with a local
mockDial closure that increments attempts and returns the simulated error on the
first call and delegates to the real dial (e.g., dialKonnectivityServerTCP) on
success; remove any package-level reassignment in tests and use dependency
injection to supply the mock instead.
support/controlplane-component/konnectivity-container.go (1)

184-198: ⚡ Quick win

Refactor to eliminate duplication and add defensive default case.

The HTTPS and Socks5 branches contain identical logic, violating DRY. Additionally, the switch lacks a default case—if opts.Mode is neither HTTPS nor Socks5, servingPort remains zero, creating an invalid probe configuration.

♻️ Proposed refactoring
-	var servingPort int32
-	switch opts.Mode {
-	case HTTPS:
-		if opts.HTTPSOptions.ServingPort != 0 {
-			servingPort = int32(opts.HTTPSOptions.ServingPort)
-		} else {
-			servingPort = 8090
-		}
-	case Socks5:
-		if opts.Socks5Options.ServingPort != 0 {
-			servingPort = int32(opts.Socks5Options.ServingPort)
-		} else {
-			servingPort = 8090
-		}
-	}
+	servingPort := int32(8090)
+	switch opts.Mode {
+	case HTTPS:
+		if opts.HTTPSOptions.ServingPort != 0 {
+			servingPort = int32(opts.HTTPSOptions.ServingPort)
+		}
+	case Socks5:
+		if opts.Socks5Options.ServingPort != 0 {
+			servingPort = int32(opts.Socks5Options.ServingPort)
+		}
+	default:
+		// Should never occur given validation at line 81-84, but defensive.
+		panic(fmt.Sprintf("unexpected konnectivity proxy mode: %s", opts.Mode))
+	}

As per coding guidelines, follow the 100-go-mistakes rules: prefer eliminating code duplication and using defensive programming patterns.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/controlplane-component/konnectivity-container.go` around lines 184 -
198, The servingPort selection duplicates identical logic for HTTPS and Socks5
and lacks a default branch, so update the code that sets servingPort (variable
servingPort using opts.Mode, HTTPS, Socks5, opts.HTTPSOptions.ServingPort,
opts.Socks5Options.ServingPort) to remove duplication by first selecting the
candidate port (e.g., from opts.HTTPSOptions.ServingPort or
opts.Socks5Options.ServingPort based on opts.Mode) and then applying a single
fallback to 8090 if the candidate is zero; also add a defensive default for the
switch (or handle unknown modes) to ensure servingPort is always set to a valid
default.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@konnectivity-socks5-proxy/guard.go`:
- Around line 59-60: The retry backoff currently uses time.Sleep(delay()) which
ignores ctx cancellation; update both retry sites in guard.go (where
time.Sleep(delay()) is called) to use a cancellable wait: replace
time.Sleep(delay()) with a select that waits on time.After(delay()) and
ctx.Done(), returning or breaking when ctx is cancelled. Ensure the surrounding
function (the loop that calls delay()) respects ctx cancellation (e.g., check
ctx.Err() or return on ctx.Done()) so retries exit immediately when the provided
ctx is cancelled.
- Around line 30-35: coreGuardBackoff currently has Steps==0 so DelayFunc
returns a constant delay and exponential growth never happens; update
coreGuardBackoff to set a positive Steps (e.g., 5 or 10) so Factor and Cap take
effect, and replace every time.Sleep(delay()) call (where delay is produced by
coreGuardBackoff.DelayFunc or similar) with a ctx-aware wait: compute delay :=
delay() then use select { case <-ctx.Done(): return ctx.Err() (or break out
appropriately); case <-time.After(delay): /* continue retry */ } so retries exit
promptly on cancellation; reference coreGuardBackoff, DelayFunc and the sites
using time.Sleep(delay()) when making these edits.

In `@konnectivity-socks5-proxy/main.go`:
- Around line 55-71: signals.SetupSignalHandler() cancels ctx but the current
runWithCoreGuard call blocks on server.ListenAndServe and never observes
ctx.Done(), so SIGTERM doesn't stop the proxy and context cancellation becomes a
fatal error; change the socks5 server start to create a net.Listener (net.Listen
with fmt.Sprintf(":%d", servingPort)), run server.Serve(listener) in a
goroutine, and in the main goroutine select on ctx.Done() vs serve result: on
ctx.Done() call listener.Close() and wait for Serve to return and then return
nil (treat context cancellation as clean shutdown), and if Serve returns a
non-context error propagate it as before; update the block that constructs
socks5.Config and calls server.ListenAndServe to use this listener/Serve + ctx
cancellation handling so signals.SetupSignalHandler(), runWithCoreGuard,
server.Serve, and listener.Close() coordinate graceful shutdown.

---

Nitpick comments:
In `@konnectivity-socks5-proxy/guard_test.go`:
- Around line 28-58: Refactor TestIsTransientKonnectivityError into a
table-driven test: create a slice of cases with fields name, err and want (bool)
containing the current scenarios (nil, context.DeadlineExceeded,
fmt.Errorf("dial tcp: connection refused"), syscall.ECONNREFUSED,
errors.New("failed validation..."), os.ErrNotExist), then loop over cases
calling t.Run(case.name, func(t *testing.T){ g := NewGomegaWithT(t);
g.Expect(isTransientKonnectivityError(case.err)).To(Equal(case.want)) }); keep
the test function name TestIsTransientKonnectivityError and reuse
isTransientKonnectivityError and NewGomegaWithT to preserve existing assertions.
- Around line 95-103: The test mutates the package-level dialKonnectivityServer
which is unsafe for parallel tests; refactor retryDialKonnectivityServer (in
guard.go) to accept a dial function parameter (e.g., dialFunc
func(string,uint32) error or as a field on a struct) and update the test to call
retryDialKonnectivityServer with a local mockDial closure that increments
attempts and returns the simulated error on the first call and delegates to the
real dial (e.g., dialKonnectivityServerTCP) on success; remove any package-level
reassignment in tests and use dependency injection to supply the mock instead.

In `@support/controlplane-component/konnectivity-container.go`:
- Around line 184-198: The servingPort selection duplicates identical logic for
HTTPS and Socks5 and lacks a default branch, so update the code that sets
servingPort (variable servingPort using opts.Mode, HTTPS, Socks5,
opts.HTTPSOptions.ServingPort, opts.Socks5Options.ServingPort) to remove
duplication by first selecting the candidate port (e.g., from
opts.HTTPSOptions.ServingPort or opts.Socks5Options.ServingPort based on
opts.Mode) and then applying a single fallback to 8090 if the candidate is zero;
also add a defensive default for the switch (or handle unknown modes) to ensure
servingPort is always set to a valid default.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 56b1509a-1ce4-4bfd-9308-e453f94028b4

📥 Commits

Reviewing files that changed from the base of the PR and between d24af10 and e81da62.

⛔ Files ignored due to path filters (40)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/AROSwift/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/GCP/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/AROSwift/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/GCP/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/AROSwift/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/GCP/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/AROSwift/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/GCP/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/IBMCloud/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/AROSwift/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/GCP/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/AROSwift/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/GCP/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/IBMCloud/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/AROSwift/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/GCP/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/IBMCloud/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/AROSwift/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/GCP/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/IBMCloud/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
📒 Files selected for processing (5)
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/component.go
  • konnectivity-socks5-proxy/guard.go
  • konnectivity-socks5-proxy/guard_test.go
  • konnectivity-socks5-proxy/main.go
  • support/controlplane-component/konnectivity-container.go

Comment thread konnectivity-socks5-proxy/guard.go Outdated
Comment thread konnectivity-socks5-proxy/guard.go Outdated
Comment thread konnectivity-socks5-proxy/main.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
konnectivity-socks5-proxy/guard_test.go (1)

79-110: 🏗️ Heavy lift

Test the production retry path instead of a local clone.

retryDialKonnectivityServer reimplements its own loop, so this test can stay green while bootstrapKonnectivity or runWithCoreGuard regress. Prefer injecting the dial/bootstrap dependency into the real code path and asserting retries there.

As per coding guidelines, "Always include unit tests when creating new functions or modifying existing ones" and "Unit test any code changes and additions."

Also applies to: 113-131

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@konnectivity-socks5-proxy/guard_test.go` around lines 79 - 110, The test
currently calls retryDialKonnectivityServer directly and stubs
dialKonnectivityServer, which doesn't exercise the real production path;
refactor the code to inject the dialing/bootstrap dependency (e.g., accept a
DialFunc or a KonnectivityBootstrap interface) into
bootstrapKonnectivity/runWithCoreGuard and update the test to call the real
entrypoint (bootstrapKonnectivity or runWithCoreGuard) with a test dial
implementation that fails once then succeeds, assert the entrypoint eventually
succeeds and that the injected dial was invoked multiple times; keep references
to dialKonnectivityServer, retryDialKonnectivityServer, bootstrapKonnectivity,
and runWithCoreGuard to locate and change signatures, and ensure tests restore
any global state and use context timeouts as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@konnectivity-socks5-proxy/guard.go`:
- Around line 36-37: Thread a context through bootstrapKonnectivity →
tryBootstrapKonnectivity → dialKonnectivityServerTCP so the TCP probe is
cancellable: add a ctx parameter to bootstrapKonnectivity and propagate it into
tryBootstrapKonnectivity and dialKonnectivityServerTCP, then replace the
net.DialTimeout(...) call inside dialKonnectivityServerTCP with
(&net.Dialer{Timeout: konnectivityDialTimeout}).DialContext(ctx, network, addr)
so the in-progress probe is aborted when ctx is cancelled; update all callers to
pass the ctx accordingly.

---

Nitpick comments:
In `@konnectivity-socks5-proxy/guard_test.go`:
- Around line 79-110: The test currently calls retryDialKonnectivityServer
directly and stubs dialKonnectivityServer, which doesn't exercise the real
production path; refactor the code to inject the dialing/bootstrap dependency
(e.g., accept a DialFunc or a KonnectivityBootstrap interface) into
bootstrapKonnectivity/runWithCoreGuard and update the test to call the real
entrypoint (bootstrapKonnectivity or runWithCoreGuard) with a test dial
implementation that fails once then succeeds, assert the entrypoint eventually
succeeds and that the injected dial was invoked multiple times; keep references
to dialKonnectivityServer, retryDialKonnectivityServer, bootstrapKonnectivity,
and runWithCoreGuard to locate and change signatures, and ensure tests restore
any global state and use context timeouts as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a56c9242-6df5-4a69-85b4-450d2a131e1a

📥 Commits

Reviewing files that changed from the base of the PR and between e81da62 and c1f6355.

⛔ Files ignored due to path filters (40)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/AROSwift/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/GCP/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/catalog-operator/zz_fixture_TestControlPlaneComponents_catalog_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/AROSwift/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/GCP/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/cluster-network-operator/zz_fixture_TestControlPlaneComponents_cluster_network_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/AROSwift/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/GCP/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/ingress-operator/zz_fixture_TestControlPlaneComponents_ingress_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/AROSwift/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/GCP/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/IBMCloud/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/oauth-openshift/zz_fixture_TestControlPlaneComponents_oauth_openshift_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/AROSwift/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/GCP/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/olm-operator/zz_fixture_TestControlPlaneComponents_olm_operator_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/AROSwift/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/GCP/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/IBMCloud/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-apiserver/zz_fixture_TestControlPlaneComponents_openshift_apiserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/AROSwift/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/GCP/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/IBMCloud/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/openshift-oauth-apiserver/zz_fixture_TestControlPlaneComponents_openshift_oauth_apiserver_controlplanecomponent.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/AROSwift/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/GCP/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/IBMCloud/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/packageserver/zz_fixture_TestControlPlaneComponents_packageserver_deployment.yaml is excluded by !**/testdata/**
📒 Files selected for processing (5)
  • control-plane-operator/controllers/hostedcontrolplane/v2/oauth_apiserver/component.go
  • konnectivity-socks5-proxy/guard.go
  • konnectivity-socks5-proxy/guard_test.go
  • konnectivity-socks5-proxy/main.go
  • support/controlplane-component/konnectivity-container.go

Comment thread konnectivity-socks5-proxy/guard.go Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
konnectivity-socks5-proxy/main.go (1)

94-96: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat bootstrap-time context cancellation as a clean exit.

If runWithCoreGuard returns context.Canceled while still bootstrapping, Lines 94-96 still print an error and os.Exit(1). That makes an intentional SIGTERM during startup look like a crash.

Suggested change
 import (
+	"context"
+	"errors"
 	"fmt"
 	"net"
 	"os"
@@
 		})
 		if err != nil {
+			if errors.Is(err, context.Canceled) {
+				return
+			}
 			fmt.Fprintf(os.Stderr, "Error: %v\n", err)
 			os.Exit(1)
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@konnectivity-socks5-proxy/main.go` around lines 94 - 96, When handling the
error returned from runWithCoreGuard, treat bootstrap-time cancellation as a
clean exit by checking if errors.Is(err, context.Canceled) (or err ==
context.Canceled) and, if so, do not print an error and exit 0; otherwise keep
the existing fmt.Fprintf(os.Stderr, ...) and os.Exit(1). Update the error
handling around runWithCoreGuard to perform this conditional check (and add an
import for the errors package if needed) so a SIGTERM during startup is treated
as a normal shutdown.
🧹 Nitpick comments (1)
konnectivity-socks5-proxy/main_test.go (1)

100-134: ⚡ Quick win

Exercise the production shutdown path instead of a copied closure.

This subtest reimplements the listener/Serve/ctx.Done() flow locally, so it can stay green even if NewStartCommand drifts. Extract that block into a small package-private helper and call the helper from both main.go and this test.

As per coding guidelines, "Unit test any code changes and additions".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@konnectivity-socks5-proxy/main_test.go` around lines 100 - 134, The test
duplicates the listener/Serve/ctx.Done() shutdown logic (serveFunc) instead of
exercising the real production path; refactor that logic into a package-private
helper (e.g., create a function serveWithGracefulShutdown(ctx context.Context,
servingPort uint32, server *socks5.Server) error) and replace the test-local
closure and the implementation in main.go to call this helper; update
NewStartCommand (or the function in main.go that currently inlines the logic) to
call serveWithGracefulShutdown so the test can simply invoke the real code path
and verify graceful shutdown, and adjust tests to pass a server instance and
listener port to the new helper.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@konnectivity-socks5-proxy/guard_test.go`:
- Around line 213-219: The test currently dials a hardcoded port 54321 which can
be occupied on CI; instead create a short-lived listener on "127.0.0.1:0" to get
an ephemeral port, read its actual port from listener.Addr(), close the listener
to free the port, then call dialKonnectivityServerTCP(ctx, "127.0.0.1", port)
and assert error with isTransientKonnectivityError; update the anonymous t.Run
block to use this ephemeral-port pattern and keep the same assertions using
dialKonnectivityServerTCP and isTransientKonnectivityError.

---

Duplicate comments:
In `@konnectivity-socks5-proxy/main.go`:
- Around line 94-96: When handling the error returned from runWithCoreGuard,
treat bootstrap-time cancellation as a clean exit by checking if errors.Is(err,
context.Canceled) (or err == context.Canceled) and, if so, do not print an error
and exit 0; otherwise keep the existing fmt.Fprintf(os.Stderr, ...) and
os.Exit(1). Update the error handling around runWithCoreGuard to perform this
conditional check (and add an import for the errors package if needed) so a
SIGTERM during startup is treated as a normal shutdown.

---

Nitpick comments:
In `@konnectivity-socks5-proxy/main_test.go`:
- Around line 100-134: The test duplicates the listener/Serve/ctx.Done()
shutdown logic (serveFunc) instead of exercising the real production path;
refactor that logic into a package-private helper (e.g., create a function
serveWithGracefulShutdown(ctx context.Context, servingPort uint32, server
*socks5.Server) error) and replace the test-local closure and the implementation
in main.go to call this helper; update NewStartCommand (or the function in
main.go that currently inlines the logic) to call serveWithGracefulShutdown so
the test can simply invoke the real code path and verify graceful shutdown, and
adjust tests to pass a server instance and listener port to the new helper.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 099bd414-6684-4f69-8d67-a453ed5f21b3

📥 Commits

Reviewing files that changed from the base of the PR and between c1f6355 and 922b57f.

📒 Files selected for processing (5)
  • konnectivity-socks5-proxy/guard.go
  • konnectivity-socks5-proxy/guard_test.go
  • konnectivity-socks5-proxy/main.go
  • konnectivity-socks5-proxy/main_test.go
  • support/controlplane-component/konnectivity-container_test.go

Comment thread konnectivity-socks5-proxy/guard_test.go
@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/test all

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/pipeline required

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/test e2e-v2-gke

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2058851957484818432 | Cost: $2.74653275 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@cwbotbot

cwbotbot commented May 25, 2026

Copy link
Copy Markdown

Test Results

e2e-aws

e2e-aks

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@jparrill

Copy link
Copy Markdown
Contributor

/ok-to-test

@openshift-ci openshift-ci Bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label May 26, 2026
@jparrill

Copy link
Copy Markdown
Contributor

/area dependencies

@openshift-ci

openshift-ci Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

@jparrill: The label(s) area/dependencies cannot be applied, because the repository doesn't have them.

Details

In response to this:

/area dependencies

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jparrill jparrill left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped some comments. Thanks

Apart from those comments inline, you will need to:

  • Create a bug in OCPBUGS project
  • /retitle OCPBUGS-XXXXX: Konnectivity retry proxy connection on timeout
  • /jira refresh

Comment thread konnectivity-socks5-proxy/guard.go Outdated
Comment thread konnectivity-socks5-proxy/guard.go Outdated
Comment thread konnectivity-socks5-proxy/guard.go Outdated
Comment thread konnectivity-socks5-proxy/guard.go
Comment thread konnectivity-socks5-proxy/main.go Outdated
Comment thread konnectivity-socks5-proxy/guard_test.go
Comment thread konnectivity-socks5-proxy/guard_test.go Outdated
Comment thread konnectivity-socks5-proxy/guard_test.go Outdated
Comment thread konnectivity-socks5-proxy/main_test.go Outdated
Comment thread support/controlplane-component/konnectivity-container_test.go Outdated
Comment thread konnectivity-socks5-proxy/guard.go
@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/pipeline required

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@openshift-ci openshift-ci Bot added area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/ibmcloud PR/issue for IBMCloud (IBMCloudPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/platform/openstack PR/issue for OpenStack (OpenStackPlatform) platform area/platform/powervs PR/issue for PowerVS (PowerVSPlatform) platform area/testing Indicates the PR includes changes for e2e testing labels Jun 5, 2026
@YamunadeviShanmugam YamunadeviShanmugam force-pushed the fix-add-startup-probe-konnectivity-sock5-sidecar branch 2 times, most recently from 582c8ea to 95f2d94 Compare June 5, 2026 12:29
The proxy was exiting hard on startup timeouts. This change adds
exponential backoff retry logic with startup probes.
@YamunadeviShanmugam YamunadeviShanmugam force-pushed the fix-add-startup-probe-konnectivity-sock5-sidecar branch from 95f2d94 to fa71f72 Compare June 5, 2026 12:57
@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@jparrill jparrill left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2026
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2026
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

New changes are detected. LGTM label has been removed.

@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2026
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: YamunadeviShanmugam
Once this PR has been reviewed and has the lgtm label, please ask for approval from jparrill. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/pipeline required

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/retest

2 similar comments
@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/retest

@YamunadeviShanmugam

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@YamunadeviShanmugam: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-kubevirt-aws-ovn-reduced 07dfe64 link true /test e2e-kubevirt-aws-ovn-reduced

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci

Copy link
Copy Markdown

No symptom labels directory. I have all the evidence I need. Let me compile the final report:

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Bootstrap failed to complete: timed out waiting for the condition
Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.
Installer exit with code 5

Summary

The management cluster installation (IPI on AWS) failed during the bootstrap phase due to the ingress router pod continuously crash-looping with startup probe failures. The router pod's backend-proxy-http and has-synced health checks returned HTTP 500 for over 20 minutes, causing 8 container restarts. This prevented the ingress operator from becoming Available, which cascaded to block monitoring (unable to create Routes), authentication (no ingress endpoints), and openshift-apiserver (PreconditionNotReady). The bootstrap timed out after ~55 minutes waiting for these operators. This is a management cluster infrastructure failure unrelated to the PR changes, which modify Konnectivity retry proxy connection logic in the HyperShift control plane — code that is never exercised during management cluster IPI installation.

Root Cause

The root cause is a management cluster ingress router startup failure that created a cascading operator unavailability chain:

  1. Router pod scheduling delay: The router pod was created at 15:29:59Z but couldn't be scheduled until 15:47:05Z (~17 minutes) because only 3 control-plane nodes existed with taints that the router couldn't tolerate. It had to wait for the single c5n.metal worker node (ip-10-0-48-9) to join the cluster at 15:46:15Z.

  2. Router startup probe crash-loop: Once scheduled and started at 15:47:13Z, the router container repeatedly failed its startup probe with HTTP 500:

    [-]backend-proxy-http failed: reason withheld
    [-]has-synced failed: reason withheld
    [+]process-running ok
    healthz check failed
    

    This probe failure occurred 872 times across 8 container restarts. The backend-proxy-http check indicates the router's backend proxy connection (to the API server or backend services) was not establishing correctly. The has-synced check indicates the router's configuration sync from the API had not completed.

  3. Cascading operator failures:

    • Ingress operator: Available=False (router deployment MinimumReplicasUnavailable, ProgressDeadlineExceeded)
    • Monitoring operator: Available=False — could not create Routes (routes.route.openshift.io API not ready because openshift-apiserver was unavailable)
    • Authentication operator: Available=False — oauth-server endpoints not ready, connection refused to oauth service
    • OpenShift API Server: Available=FalseAPIServices_PreconditionNotReady
    • Console operator: Available=Unknown (no data)
  4. Bootstrap timeout: After ~55 minutes, the installer gave up waiting for these operators to become Available and exited with code 5.

This failure is not caused by PR #8579. The PR modifies Konnectivity retry proxy connection timeout handling in the HyperShift hosted control plane. The management cluster IPI installation that failed here does not use HyperShift or Konnectivity — it is a standard OpenShift cluster being provisioned as the management cluster. The router crash-loop is an infrastructure-level issue, likely related to timing or transient network conditions during the bare-metal worker node (c5n.metal) provisioning.

Recommendations
  1. Retry the job — This is an infrastructure-level installation failure unrelated to the PR code changes. The router crash-loop during bootstrap is a known class of transient failure in CI environments, especially with c5n.metal bare-metal worker instances that take longer to provision.

  2. No code changes needed — PR OCPBUGS-86661: Konnectivity retry proxy connection on timeout #8579 modifies HyperShift Konnectivity proxy retry logic, which is only used in hosted cluster control planes. The failure occurred during management cluster IPI installation, before any HyperShift code is exercised.

  3. If the failure recurs, investigate:

    • Whether c5n.metal instance provisioning time is consistently causing bootstrap timeouts (worker joined 17 min after router pod creation, then router took another 20+ min to crash-loop through startup probes)
    • Whether the router backend-proxy-http startup probe failures correlate with specific OCP payload versions (the release used was 4.22.0-0.ci-2026-06-17-021229)
Evidence
Evidence Detail
Failed Step ipi-install-install (pre phase), exit code 5 after 55m34s
Bootstrap Error Bootstrap failed to complete: timed out waiting for the condition at 16:08:52Z
Router Pod router-default-5f4bc8c585-74tvs — 8 restarts, never became ready
Probe Failure Startup probe HTTP 500: backend-proxy-http failed, has-synced failed (872 occurrences)
Scheduling Delay Router pod created 15:29:59Z, scheduled 15:47:05Z (waited ~17min for worker node)
Worker Node ip-10-0-48-9.ec2.internal (c5n.metal) joined at 15:46:15Z
Ingress Operator Available=False, Degraded=True — MinimumReplicasUnavailable, ProgressDeadlineExceeded
Monitoring Operator Available=False — cannot create Routes (routes.route.openshift.io API unavailable)
Authentication Operator Available=False — oauth service endpoints not ready, connection refused
OpenShift API Server Available=False — APIServices_PreconditionNotReady
Deployment Status router-default — 0 ready replicas, ProgressDeadlineExceeded
PR Relevance PR #8579 modifies Konnectivity retry proxy logic (hosted control plane only), not used during IPI install

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ai Indicates the PR includes changes related to AI - Claude agents, Cursor rules, etc. area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/control-plane-pki-operator Indicates the PR includes changes for the control plane PKI operator - in an OCP release area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/ibmcloud PR/issue for IBMCloud (IBMCloudPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/platform/openstack PR/issue for OpenStack (OpenStackPlatform) platform area/platform/powervs PR/issue for PowerVS (PowerVSPlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants