Skip to content

OCPBUGS-77557: propagate additionalTrustBundle to AWS control plane components#7907

Open
sdminonne wants to merge 2 commits into
openshift:mainfrom
sdminonne:OCPBUGS-77557
Open

OCPBUGS-77557: propagate additionalTrustBundle to AWS control plane components#7907
sdminonne wants to merge 2 commits into
openshift:mainfrom
sdminonne:OCPBUGS-77557

Conversation

@sdminonne

@sdminonne sdminonne commented Mar 10, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add DeploymentAddAWSCABundleVolume helper that creates a combined CA bundle (system + user CAs) via an init container and sets AWS_CA_BUNDLE on the main container
  • Wire trust bundle propagation into all AWS control plane components when AdditionalTrustBundle is set on the HostedControlPlane spec:
    • aws-cloud-controller-manager
    • capi-provider
    • ingress-operator
    • karpenter
    • karpenter-operator
    • aws-node-termination-handler
    • kube-apiserver AWS KMS sidecars (aws-kms-active, aws-kms-backup) when SecretEncryption.KMS.Provider is AWS
  • Add unit tests for all components and a v2 e2e test for aws-cloud-controller-manager

Problem

In isolated AWS environments (e.g., US-ISO regions), custom CA bundles specified via HostedCluster.Spec.AdditionalTrustBundle are not propagated to AWS control plane components. This causes TLS verification failures when these components call AWS API endpoints:

Post https://sts.us-iso-east-1.c2s.ic.gov: tls: failed to verify certificate:
x509: certificate signed by unknown authority

Why not reuse DeploymentAddTrustBundleVolume?

The existing helper mounts a ConfigMap as a directory at /etc/pki/tls/certs, which replaces the entire system CA directory. This works for in-house components (CPO, ignition-server, OAPI) whose TLS needs are tightly controlled. However, the affected components are binaries that make HTTPS calls to standard AWS service endpoints (EC2, ELB, STS, SQS, KMS). The AWS SDK's default HTTP client loads the system CA store from /etc/pki/tls/certs to verify TLS certificates. Replacing that directory with a ConfigMap containing only the custom CA would cause the binary to lose the public root CAs (e.g., Amazon Trust Services), breaking connectivity to standard AWS API endpoints.

Why AWS_CA_BUNDLE with a combined bundle?

The AWS SDK (both v1 and v2) reads AWS_CA_BUNDLE and uses it instead of the system CA bundle — it creates a new empty x509.CertPool and loads only the specified file. To avoid losing trust in standard AWS endpoints, an init container concatenates the system CAs (/etc/pki/tls/certs/ca-bundle.crt) with the user-provided CAs from additionalTrustBundle into a single combined PEM file. AWS_CA_BUNDLE points to this combined file, ensuring the AWS SDK trusts both system and custom CAs.

KAS KMS sidecars

When secret encryption uses AWS KMS (SecretEncryption.KMS.Provider == AWS), the aws-kms-active and aws-kms-backup sidecar containers in the kube-apiserver deployment also need access to the combined CA bundle. These sidecars call AWS KMS endpoints to encrypt/decrypt data encryption keys. The aws-kms-token-minter sidecar is intentionally excluded as it does not make AWS API calls.

Test plan

  • Unit tests verify volume, init container, mount, and env var presence when AdditionalTrustBundle is set
  • Unit tests verify no volume/env var when AdditionalTrustBundle is nil
  • Unit tests verify non-AWS platforms are unaffected (capi-provider, ingress-operator, karpenter-operator)
  • Unit tests verify aws-kms-active and aws-kms-backup get volume mount and AWS_CA_BUNDLE env var
  • Unit tests verify aws-kms-token-minter is not wired
  • Unit tests verify no KMS wiring when KMS containers are absent
  • v2 e2e test verifies AWS_CA_BUNDLE wiring on aws-cloud-controller-manager (add and remove)
  • make test passes
  • make verify passes

Fixes: https://issues.redhat.com/browse/OCPBUGS-77557

🤖 Generated with Claude Code

@openshift-ci-robot

Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai

coderabbitai Bot commented Mar 10, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Adds support for wiring an AWS CA bundle into deployments when a HostedControlPlane has Spec.AdditionalTrustBundle set and the platform is AWS. A new utility, DeploymentAddAWSCABundleVolume, constructs user/system CA volumes, an init container to produce a combined bundle, mounts it into main containers, and sets AWS_CA_BUNDLE. Multiple hosted control plane components now call this utility during their deployment adaptation; tests and e2e checks were added to validate presence/absence of volumes, mounts, init containers, and the AWS_CA_BUNDLE env var.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant HCP as HostedControlPlane
participant Controller as Control Plane Operator
participant Util as support/util
participant Deployment as Kubernetes Deployment
participant KubeAPI as Kubernetes API
HCP->>Controller: Reconcile / adaptDeployment invoked
Controller->>Controller: check Platform == AWS && AdditionalTrustBundle != nil
Controller->>Util: DeploymentAddAWSCABundleVolume(trustBundleConfigMap, deployment, initImage)
Util->>Deployment: add user-ca ConfigMap volume
Util->>Deployment: add aws-ca-bundle EmptyDir + init container (concat CAs)
Util->>Deployment: add volumeMount to main container + set AWS_CA_BUNDLE env
Controller->>KubeAPI: apply updated Deployment
KubeAPI-->>Deployment: Deployment updated/applied

Changes

Cohort / File(s) Summary
Utility
support/util/volumes.go
Adds DeploymentAddAWSCABundleVolume(...) to add user CA ConfigMap volume, aws-ca-bundle EmptyDir, a setup init-container to combine CAs, mount the combined bundle into containers, and set AWS_CA_BUNDLE.
CAPI Provider
control-plane-operator/controllers/hostedcontrolplane/v2/capi_provider/deployment.go and test
Calls DeploymentAddAWSCABundleVolume when platform is AWS and AdditionalTrustBundle is present; adds tests validating volumes, mounts, init container, and AWS_CA_BUNDLE.
AWS Cloud Controller Manager
control-plane-operator/controllers/hostedcontrolplane/v2/cloud_controller_manager/aws/component.go, deployment.go, tests
Registers and implements adaptDeployment to invoke DeploymentAddAWSCABundleVolume for AWS+AdditionalTrustBundle; adds tests asserting expected wiring.
Ingress Operator
control-plane-operator/controllers/hostedcontrolplane/v2/ingressoperator/deployment.go and test
Adds conditional call to DeploymentAddAWSCABundleVolume in adaptDeployment for AWS with AdditionalTrustBundle; tests added.
AWS Node Determination Handler
control-plane-operator/controllers/hostedcontrolplane/v2/awsnodeterminationhandler/deployment.go and test
Injects AWS CA bundle wiring when AdditionalTrustBundle exists; tests added to verify volumes, init container, mounts, and env var.
Karpenter & Karpenter Operator
control-plane-operator/controllers/hostedcontrolplane/v2/karpenter/deployment.go, karpenteroperator/deployment.go and tests
Adds conditional AWS CA bundle wiring in adaptDeployment for Karpenter and its operator; tests validate presence/absence of CA resources and AWS_CA_BUNDLE.
E2E test
test/e2e/nodepool_additionalTrustBundlePropagation_test.go
Adds runtime checks (AWS-only) verifying aws-cloud-controller-manager deployment has aws-ca-bundle EmptyDir, setup-aws-ca-bundle init container, and AWS_CA_BUNDLE env var present/absent across bundle add/remove scenarios.
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: propagating additionalTrustBundle to AWS control plane components, which is the central theme of all modifications across multiple deployment files.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from bryan-cox and enxebre March 10, 2026 15:48
@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/platform/aws PR/issue for AWS (AWSPlatform) platform and removed do-not-merge/needs-area labels Mar 10, 2026
@sdminonne

Copy link
Copy Markdown
Contributor Author

/assign @enxebre

@enxebre

enxebre commented Mar 10, 2026

Copy link
Copy Markdown
Member

how about karpenter-aws and aws-node-termination-handler?

@enxebre

enxebre commented Mar 10, 2026

Copy link
Copy Markdown
Member

The AWS SDK (both v1 and v2) reads AWS_CA_BUNDLE and appends those CAs to the system cert pool.

That statement seems to contradict the docs https://docs.aws.amazon.com/sdk-for-go/api/aws/session/
"Custom CA Bundle" section

"Path to a custom Credentials Authority (CA) bundle PEM file that the SDK will use instead of the default system's root CA bundle. Use this only if you want to replace the CA bundle the SDK uses for TLS requests."

@sdminonne sdminonne changed the title fix(cpo): propagate additionalTrustBundle to AWS control plane components fix(OCPBUGS-77557): propagate additionalTrustBundle to AWS control plane components Mar 11, 2026
@sdminonne

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: No Jira issue is referenced in the title of this pull request.
To reference a jira issue, add 'XYZ-NNN:' to the title of this pull request and request another refresh with /jira refresh.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdminonne sdminonne changed the title fix(OCPBUGS-77557): propagate additionalTrustBundle to AWS control plane components OCPBUGS-77557: propagate additionalTrustBundle to AWS control plane components Mar 11, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is invalid:

  • expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "4.21.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Add DeploymentAddAWSCABundleVolume helper that mounts the user-ca-bundle ConfigMap at a non-conflicting path (/etc/pki/ca-trust/extracted/hypershift/) and sets the AWS_CA_BUNDLE environment variable
  • Wire trust bundle propagation into aws-cloud-controller-manager, capi-provider, and ingress-operator deployments when AdditionalTrustBundle is set on the HostedControlPlane spec
  • Add unit tests for all three components

Problem

In isolated AWS environments (e.g., US-ISO regions), custom CA bundles specified via HostedCluster.Spec.AdditionalTrustBundle are not propagated to three control plane components: aws-cloud-controller-manager, ingress-operator, and capi-provider. This causes TLS verification failures when these components call AWS STS endpoints:

Post https://sts.us-iso-east-1.c2s.ic.gov: tls: failed to verify certificate:
x509: certificate signed by unknown authority

Why not reuse DeploymentAddTrustBundleVolume?

The existing helper mounts a ConfigMap as a directory at /etc/pki/tls/certs, which replaces the entire system CA directory. This works for in-house components (CPO, ignition-server, OAPI) whose TLS needs are tightly controlled. However, the three affected components are third-party binaries that make HTTPS calls to standard AWS service endpoints (EC2, ELB, STS). The AWS SDK's default HTTP client loads the system CA store from /etc/pki/tls/certs to verify TLS certificates on those connections. Replacing that directory with a ConfigMap containing only the custom CA would cause the binary to lose the public root CAs (e.g., Amazon Trust Services), breaking connectivity to standard AWS API endpoints.

Why AWS_CA_BUNDLE?

The AWS SDK (both v1 and v2) reads AWS_CA_BUNDLE and appends those CAs to the system cert pool. This means standard AWS endpoints continue to work via system CAs while also trusting custom CAs needed in isolated regions.

Test plan

  • Unit tests verify volume, mount, and env var presence when AdditionalTrustBundle is set
  • Unit tests verify no volume/env var when AdditionalTrustBundle is nil
  • Unit tests verify non-AWS platforms are unaffected (capi-provider, ingress-operator)
  • make test passes
  • make verify passes

Fixes: https://issues.redhat.com/browse/OCPBUGS-77557

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Added AWS CA bundle support across control plane deployments. When an additional trust bundle is configured on AWS platforms, it is now properly mounted and integrated into deployments, enabling components to use custom CA certificates.

  • Tests

  • Added test coverage for AWS CA bundle deployment configuration across multiple components.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdminonne

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is invalid:

  • expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "4.21.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdminonne

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the area/testing Indicates the PR includes changes for e2e testing label Mar 16, 2026
@openshift-ci

openshift-ci Bot commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sdminonne
Once this PR has been reviewed and has the lgtm label, please ask for approval from enxebre. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sdminonne

Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@openshift-ci openshift-ci Bot added area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/platform/powervs PR/issue for PowerVS (PowerVSPlatform) platform labels May 5, 2026
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2026
@bryan-cox

Copy link
Copy Markdown
Member

@sdminonne are you still looking to take this PR forward?

@openshift-ci-robot openshift-ci-robot removed the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Summary

  • Add DeploymentAddAWSCABundleVolume helper that creates a combined CA bundle (system + user CAs) via an init container and sets AWS_CA_BUNDLE on the main container
  • Wire trust bundle propagation into all AWS control plane components when AdditionalTrustBundle is set on the HostedControlPlane spec:
  • aws-cloud-controller-manager
  • capi-provider
  • ingress-operator
  • karpenter
  • karpenter-operator
  • aws-node-termination-handler
  • kube-apiserver AWS KMS sidecars (aws-kms-active, aws-kms-backup) when SecretEncryption.KMS.Provider is AWS
  • Add unit tests for all components and an e2e test for aws-cloud-controller-manager

Problem

In isolated AWS environments (e.g., US-ISO regions), custom CA bundles specified via HostedCluster.Spec.AdditionalTrustBundle are not propagated to AWS control plane components. This causes TLS verification failures when these components call AWS API endpoints:

Post https://sts.us-iso-east-1.c2s.ic.gov: tls: failed to verify certificate:
x509: certificate signed by unknown authority

Why not reuse DeploymentAddTrustBundleVolume?

The existing helper mounts a ConfigMap as a directory at /etc/pki/tls/certs, which replaces the entire system CA directory. This works for in-house components (CPO, ignition-server, OAPI) whose TLS needs are tightly controlled. However, the affected components are binaries that make HTTPS calls to standard AWS service endpoints (EC2, ELB, STS, SQS, KMS). The AWS SDK's default HTTP client loads the system CA store from /etc/pki/tls/certs to verify TLS certificates. Replacing that directory with a ConfigMap containing only the custom CA would cause the binary to lose the public root CAs (e.g., Amazon Trust Services), breaking connectivity to standard AWS API endpoints.

Why AWS_CA_BUNDLE with a combined bundle?

The AWS SDK (both v1 and v2) reads AWS_CA_BUNDLE and uses it instead of the system CA bundle — it creates a new empty x509.CertPool and loads only the specified file. To avoid losing trust in standard AWS endpoints, an init container concatenates the system CAs (/etc/pki/tls/certs/ca-bundle.crt) with the user-provided CAs from additionalTrustBundle into a single combined PEM file. AWS_CA_BUNDLE points to this combined file, ensuring the AWS SDK trusts both system and custom CAs.

KAS KMS sidecars

When secret encryption uses AWS KMS (SecretEncryption.KMS.Provider == AWS), the aws-kms-active and aws-kms-backup sidecar containers in the kube-apiserver deployment also need access to the combined CA bundle. These sidecars call AWS KMS endpoints to encrypt/decrypt data encryption keys. The aws-kms-token-minter sidecar is intentionally excluded as it does not make AWS API calls.

Test plan

  • Unit tests verify volume, init container, mount, and env var presence when AdditionalTrustBundle is set
  • Unit tests verify no volume/env var when AdditionalTrustBundle is nil
  • Unit tests verify non-AWS platforms are unaffected (capi-provider, ingress-operator, karpenter-operator)
  • Unit tests verify aws-kms-active and aws-kms-backup get volume mount and AWS_CA_BUNDLE env var
  • Unit tests verify aws-kms-token-minter is not wired
  • Unit tests verify no KMS wiring when KMS containers are absent
  • E2E test verifies AWS_CA_BUNDLE wiring on aws-cloud-controller-manager
  • make test passes
  • make verify passes

Fixes: https://issues.redhat.com/browse/OCPBUGS-77557

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 11, 2026
@sdminonne

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 11, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

},
}

cpContext := controlplanecomponent.WorkloadContext{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you pull this block outside the loop instead of making it over and over?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack — will hoist the HCP/cpContext setup outside the loop.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still looks like it's inside the loop. You can build the base HCP once outside and just set hcp.Spec.AdditionalTrustBundle = tc.additionalTrust inside each iteration.

})

if hcp.Spec.AdditionalTrustBundle != nil {
podspec.DeploymentAddAWSCABundleVolume(hcp.Spec.AdditionalTrustBundle, deployment, cpContext.ReleaseImageProvider.GetImage(podspec.CPOImageName))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the cpContext had a copy of the hcp you could pass in. Do you know if that is true?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are actually using it here https://github.com/openshift/hypershift/pull/7907/changes#diff-db92db45ba677a590c2e7e4cd186a414d3496b6f62954e4f98218c44518e24cbR25. So maybe the HCP should be referenced from that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — updated to use cpContext.HCP consistently.

Comment thread support/podspec/volumes.go Outdated
//
// The initContainerImage should be a RHEL-based image that has /bin/sh and cat available
// (e.g. the control-plane-operator image).
func DeploymentAddAWSCABundleVolume(trustBundleConfigMap *corev1.LocalObjectReference, deployment *appsv1.Deployment, initContainerImage string) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just pass the cpContext in and drop the number of parameters down to two?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support/podspec is a low-level utility package. WorkloadContext lives in support/controlplane-component, which imports support/podspec. Passing cpContext here would create a circular dependency (or at minimum a bad layering inversion). The current signature (trustBundleConfigMap, deployment, initContainerImage) keeps the package self-contained — all parameters are plain k8s types.

t.Run(tc.name, func(t *testing.T) {
g := NewGomegaWithT(t)

hcp := &hyperv1.HostedControlPlane{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move this block outside the for loop and then just change the AdditionalTrustBundle each time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack — will hoist outside the loop.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — still inside the loop.

},
}

cpContext := controlplanecomponent.WorkloadContext{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be moved outside the for loop instead of being recreated each time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — still inside the loop.

},
}

cpContext := component.WorkloadContext{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment as the other tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — still inside the loop.

if err := applyKMSConfig(&deployment.Spec.Template.Spec, secretEncryption, newKMSImages(hcp), hcp); err != nil {
return err
}
if secretEncryption.KMS != nil && secretEncryption.KMS.Provider == hyperv1.AWS && hcp.Spec.AdditionalTrustBundle != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this not be in the block on L113?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The platform switch at L113 handles pod identity webhooks — that's its concern. The KMS/secret-encryption block at L126+ is a separate concern with its own guard (secretEncryption.KMS.Provider == hyperv1.AWS). Nesting KMS handling inside the platform switch would conflate two independent concerns. The current placement correctly treats secret encryption as orthogonal to platform identity webhooks.

}

podspec.UpdateContainer("aws-kms-active", podSpec.Containers, wireCABundle)
podspec.UpdateContainer("aws-kms-backup", podSpec.Containers, wireCABundle)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always deploy the backup container if there is not a key set? 🤔

If so, this would trip things up if there was no container I think.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podspec.UpdateContainer is a no-op when the named container doesn't exist — it just iterates and matches by name (containers.go:55-60). So calling it on "aws-kms-backup" when there's no backup container is safe; it silently skips.

Comment thread support/podspec/volumes.go Outdated
})
}

// DeploymentAddAWSCABundleVolume creates a combined CA bundle containing both the system CAs from

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this out to its own AWS platform file? This file to date has been platform agnostic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing DeploymentAddTrustBundleVolume in this file is also trust-bundle-specific (not truly platform-agnostic). The two functions share the same concern: trust bundle volume wiring. The file is ~140 lines — splitting one function into a separate file would scatter related code. Happy to revisit if the package grows with more platform-specific helpers, but for now I'd prefer keeping them together.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeploymentAddTrustBundleVolume is platform-agnostic — it mounts a ConfigMap as a trusted-ca volume for any platform. The new functions are explicitly AWS-specific (AWS_CA_BUNDLE, AWS SDK concatenation strategy). The file went from ~35 lines of generic helpers to ~140 with ~90 lines of AWS-only code. I'd still prefer a volumes_aws.go to keep the separation clean.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that could be moved over to v2 e2e rather than adding new tests to v1 e2e?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into whether the v2 e2e framework supports what this test needs. If feasible, will migrate; otherwise will track as a follow-up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed the v1 test file and added the AWS_CA_BUNDLE wiring checks (add + remove) to the existing NodePoolTrustBundleTest in test/e2e/v2/tests/nodepool_lifecycle_test.go. Also switched from Containers[0] indexing to podspec.FindContainer and used the exported podspec.AWSCABundleVolumeName/AWSCABundleMountPath/AWSCABundleFileName constants.

@bryan-cox bryan-cox left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR — the init container approach for combining system + user CAs is the right design given the AWS SDK's AWS_CA_BUNDLE replacement behavior. A few items to address:

Comment thread support/podspec/volumes.go Outdated
func DeploymentAddAWSCABundleVolume(trustBundleConfigMap *corev1.LocalObjectReference, deployment *appsv1.Deployment, initContainerImage string) {
const (
userCAVolumeName = "user-ca-bundle"
combinedCAVolumeName = "aws-ca-bundle"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The volume name, mount path, and filename constants are duplicated in kas/deployment.go:applyAWSCABundleToKMSContainers. If either copy drifts, KMS sidecars silently break in isolated environments. Could you export the shared constants?

const (
    AWSCABundleVolumeName = "aws-ca-bundle"
    AWSCABundleMountPath  = "/etc/pki/ca-trust/extracted/hypershift"
    AWSCABundleFileName   = "combined-ca-bundle.pem"
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — exported AWSCABundleVolumeName, AWSCABundleMountPath, and AWSCABundleFileName as shared constants. Both DeploymentAddAWSCABundleVolume and applyAWSCABundleToKMSContainers now use them.

return err
}
if secretEncryption.KMS != nil && secretEncryption.KMS.Provider == hyperv1.AWS && hcp.Spec.AdditionalTrustBundle != nil {
podspec.DeploymentAddAWSCABundleVolume(hcp.Spec.AdditionalTrustBundle, deployment, cpContext.ReleaseImageProvider.GetImage(podspec.CPOImageName))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeploymentAddAWSCABundleVolume sets AWS_CA_BUNDLE on Containers[0], which is kube-apiserver here. KAS doesn't use the AWS SDK — only the KMS sidecars do, and those are correctly wired by applyAWSCABundleToKMSContainers below. Could we split the helper so the volume/init-container setup is separate from the Containers[0] env var wiring? That way KAS only gets the volumes + init container, and the KMS sidecars get the env var via applyAWSCABundleToKMSContainers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — split the helper into DeploymentAddAWSCABundleSetup (volumes + init container only) and ContainerAddAWSCABundle (volume mount + env var). KAS now calls DeploymentAddAWSCABundleSetup and applyAWSCABundleToKMSContainers uses podspec.ContainerAddAWSCABundle to wire only the KMS sidecars. DeploymentAddAWSCABundleVolume is kept as a convenience wrapper for non-KAS callers (calls both functions for Containers[0]).

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

type fakeReleaseProvider struct{}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct is duplicated identically in 7 test files in this PR. There's already a shared support/releaseinfo/fake.FakeReleaseProvider used across 40+ tests in the codebase — could you use that instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — replaced all 6 copies with testutil.FakeImageProvider() from support/testutil.


proxy.SetEnvVars(&deployment.Spec.Template.Spec.Containers[0].Env)

if cpContext.HCP.Spec.Platform.Type == hyperv1.AWSPlatform && cpContext.HCP.Spec.AdditionalTrustBundle != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The other files in this PR (awsnodeterminationhandler, karpenter, kas) create a local hcp := cpContext.HCP and use hcp.Spec.*. For consistency, consider doing the same here and in ingressoperator/deployment.go.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack — will add hcp := cpContext.HCP for consistency.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capi_provider and ingressoperator still reference cpContext.HCP.Spec.* directly — the hcp := cpContext.HCP local isn't added yet.

Comment thread support/podspec/volumes.go Outdated
deployment.Spec.Template.Spec.InitContainers = append(deployment.Spec.Template.Spec.InitContainers, corev1.Container{
Name: initContainerName,
Image: initContainerImage,
Command: []string{"/bin/sh", "-c",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If the system CA file is missing for some reason, this crashes the init container. A defensive fallback is cheap:

cat /etc/pki/tls/certs/ca-bundle.crt /user-ca/user-ca-bundle.pem > /etc/pki/ca-trust/extracted/hypershift/combined-ca-bundle.pem 2>/dev/null || cp /user-ca/user-ca-bundle.pem /etc/pki/ca-trust/extracted/hypershift/combined-ca-bundle.pem

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack — will add the defensive fallback.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defensive fallback still isn't here. If the system CA file is ever missing, this init container crashes and blocks pod startup in an isolated environment — exactly the environment this PR is designed to support. Cheap insurance.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2026
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@sdminonne: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws 3aaf940 link true /test e2e-aws
ci/prow/e2e-v2-gke 13667a3 link false /test e2e-v2-gke
ci/prow/e2e-gke 13667a3 link false /test e2e-gke

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

…ents

Add DeploymentAddAWSCABundleVolume helper that creates a combined CA
bundle (system + user CAs) via an init container and sets AWS_CA_BUNDLE
on the main container. Wire trust bundle propagation into all AWS control
plane components when AdditionalTrustBundle is set on the
HostedControlPlane spec: aws-cloud-controller-manager, capi-provider,
ingress-operator, karpenter, karpenter-operator,
aws-node-termination-handler, and kube-apiserver AWS KMS sidecars
(aws-kms-active, aws-kms-backup).

Split the helper into DeploymentAddAWSCABundleSetup (volumes + init
container) and ContainerAddAWSCABundle (per-container wiring) so KAS
only gets the volumes while KMS sidecars get the env var. Export shared
constants (AWSCABundleVolumeName, AWSCABundleMountPath,
AWSCABundleFileName) to avoid duplication. Move AWS-specific code to
volumes_aws.go to keep the base file platform-agnostic.

Add unit tests for all components and a v2 e2e test verifying
AWS_CA_BUNDLE wiring on aws-cloud-controller-manager (add and remove).

Fixes: https://issues.redhat.com/browse/OCPBUGS-77557

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2026
@sdminonne

Copy link
Copy Markdown
Contributor Author

/retest-required

@bryan-cox

Copy link
Copy Markdown
Member

Two minor nits from the latest push:

  1. karpenter and karpenteroperator tests: The HCP object is still constructed inside the for loop in karpenter/deployment_test.go and karpenteroperator/deployment_test.go. The other tests (awsnodeterminationhandler, cloud_controller_manager/aws, ingressoperator) were updated to hoist it outside the loop — would be good to make these consistent.

  2. cloud_controller_manager/aws/deployment.go: This file still uses cpContext.HCP.Spec.* directly instead of declaring a local hcp := cpContext.HCP like capi_provider/deployment.go and ingressoperator/deployment.go do. Minor, but would keep the pattern consistent.

Hoist HCP object construction outside the for loop in karpenter and
karpenteroperator tests for consistency with the other component tests.
Use local hcp variable in cloud_controller_manager/aws/deployment.go
instead of referencing cpContext.HCP.Spec directly.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@sdminonne

Copy link
Copy Markdown
Contributor Author

/retest

@hypershift-jira-solve-ci

Copy link
Copy Markdown

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

go: github.com/openshift/hypershift/cmd/cluster/core imports
    sigs.k8s.io/cluster-api-provider-azure/api/v1beta1: sigs.k8s.io/[email protected]: read "https://proxy.golang.org/sigs.k8s.io/cluster-api-provider-azure/@v/v1.22.0.zip": stream error: stream ID 263; INTERNAL_ERROR; received from peer

Summary

The verify-deps job failed due to a transient network error when the Go module proxy (proxy.golang.org) returned an HTTP/2 INTERNAL_ERROR while downloading sigs.k8s.io/[email protected]. This is an infrastructure flake — the Go module proxy experienced a stream-level error (stream ID 263) during the download of the dependency zip archive. The error occurred during go mod tidy, which is the first step of the go-verify-deps CI step. This failure is not related to the PR's code changes (propagating additionalTrustBundle to AWS control plane components) and is not caused by a missing or invalid dependency.

Root Cause

The root cause is a transient HTTP/2 stream error from the Go module proxy (proxy.golang.org). Specifically:

  1. The CI step go-verify-deps runs go mod tidy to verify that vendored dependencies are correct.
  2. During this process, Go downloads all module dependencies from proxy.golang.org.
  3. While downloading sigs.k8s.io/[email protected] (a large dependency zip), the proxy returned an HTTP/2 INTERNAL_ERROR on stream ID 263.
  4. The INTERNAL_ERROR stream code indicates the proxy server (or an intermediary) terminated the HTTP/2 stream unexpectedly — this is a server-side infrastructure issue, not a client-side or dependency resolution problem.
  5. The go mod tidy command failed with exit code 1 due to this network error, causing the entire verify-deps step to fail.

This is a well-known class of transient CI failures. The dependency sigs.k8s.io/[email protected] is a valid, existing module that the hypershift project already depends on — the failure is purely in the transport layer during download.

Recommendations
  1. Retest the PR — This is a transient infrastructure flake. Simply re-trigger the job with /retest or /test verify-deps on the PR. The failure is expected to pass on retry.
  2. No code changes needed — The PR's changes (propagating additionalTrustBundle to AWS control plane components) are unrelated to the Go module proxy network error.
  3. If the failure persists on retry — Check the Go module proxy status and consider whether proxy.golang.org is experiencing an outage. Persistent failures on the same module could indicate a temporary issue with that specific module's availability on the proxy.
Evidence
Evidence Detail
Failed step verify-deps-go-verify-deps (test phase)
Exit code 1 (container test)
Error type HTTP/2 stream error: INTERNAL_ERROR (stream ID 263)
Failed operation go mod tidy downloading sigs.k8s.io/[email protected]
Proxy URL https://proxy.golang.org/sigs.k8s.io/cluster-api-provider-azure/@v/v1.22.0.zip
Import chain github.com/openshift/hypershift/cmd/cluster/coresigs.k8s.io/cluster-api-provider-azure/api/v1beta1
Step duration ~40 seconds before failure
PR relevance None — PR changes AWS additionalTrustBundle propagation, unrelated to Azure CAPI dependency download
Failure classification Infrastructure flake — transient Go module proxy network error

@bryan-cox

Copy link
Copy Markdown
Member

Could you squash the commits? Otherwise lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/karpenter-operator Indicates the PR includes changes related to the Karpenter operator area/platform/aws PR/issue for AWS (AWSPlatform) platform area/platform/azure PR/issue for Azure (AzurePlatform) platform area/platform/gcp PR/issue for GCP (GCPPlatform) platform area/platform/kubevirt PR/issue for KubeVirt (KubevirtPlatform) platform area/platform/powervs PR/issue for PowerVS (PowerVSPlatform) platform area/testing Indicates the PR includes changes for e2e testing jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants