OCPBUGS-77557: propagate additionalTrustBundle to AWS control plane components#7907
OCPBUGS-77557: propagate additionalTrustBundle to AWS control plane components#7907sdminonne wants to merge 2 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
📝 WalkthroughWalkthroughAdds support for wiring an AWS CA bundle into deployments when a HostedControlPlane has Spec.AdditionalTrustBundle set and the platform is AWS. A new utility, DeploymentAddAWSCABundleVolume, constructs user/system CA volumes, an init container to produce a combined bundle, mounts it into main containers, and sets AWS_CA_BUNDLE. Multiple hosted control plane components now call this utility during their deployment adaptation; tests and e2e checks were added to validate presence/absence of volumes, mounts, init containers, and the AWS_CA_BUNDLE env var. Sequence Diagram(s)mermaid Changes
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/assign @enxebre |
|
how about karpenter-aws and aws-node-termination-handler? |
That statement seems to contradict the docs https://docs.aws.amazon.com/sdk-for-go/api/aws/session/ "Path to a custom Credentials Authority (CA) bundle PEM file that the SDK will use instead of the default system's root CA bundle. Use this only if you want to replace the CA bundle the SDK uses for TLS requests." |
|
/jira refresh |
|
@sdminonne: No Jira issue is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
c9c61d8 to
3aaf940
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sdminonne The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test e2e-aws |
|
@sdminonne are you still looking to take this PR forward? |
|
@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@sdminonne: This pull request references Jira Issue OCPBUGS-77557, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| }, | ||
| } | ||
|
|
||
| cpContext := controlplanecomponent.WorkloadContext{ |
There was a problem hiding this comment.
Could you pull this block outside the loop instead of making it over and over?
There was a problem hiding this comment.
Ack — will hoist the HCP/cpContext setup outside the loop.
There was a problem hiding this comment.
This still looks like it's inside the loop. You can build the base HCP once outside and just set hcp.Spec.AdditionalTrustBundle = tc.additionalTrust inside each iteration.
| }) | ||
|
|
||
| if hcp.Spec.AdditionalTrustBundle != nil { | ||
| podspec.DeploymentAddAWSCABundleVolume(hcp.Spec.AdditionalTrustBundle, deployment, cpContext.ReleaseImageProvider.GetImage(podspec.CPOImageName)) |
There was a problem hiding this comment.
I thought the cpContext had a copy of the hcp you could pass in. Do you know if that is true?
There was a problem hiding this comment.
Yeah you are actually using it here https://github.com/openshift/hypershift/pull/7907/changes#diff-db92db45ba677a590c2e7e4cd186a414d3496b6f62954e4f98218c44518e24cbR25. So maybe the HCP should be referenced from that.
There was a problem hiding this comment.
Good point — updated to use cpContext.HCP consistently.
| // | ||
| // The initContainerImage should be a RHEL-based image that has /bin/sh and cat available | ||
| // (e.g. the control-plane-operator image). | ||
| func DeploymentAddAWSCABundleVolume(trustBundleConfigMap *corev1.LocalObjectReference, deployment *appsv1.Deployment, initContainerImage string) { |
There was a problem hiding this comment.
Could you just pass the cpContext in and drop the number of parameters down to two?
There was a problem hiding this comment.
support/podspec is a low-level utility package. WorkloadContext lives in support/controlplane-component, which imports support/podspec. Passing cpContext here would create a circular dependency (or at minimum a bad layering inversion). The current signature (trustBundleConfigMap, deployment, initContainerImage) keeps the package self-contained — all parameters are plain k8s types.
| t.Run(tc.name, func(t *testing.T) { | ||
| g := NewGomegaWithT(t) | ||
|
|
||
| hcp := &hyperv1.HostedControlPlane{ |
There was a problem hiding this comment.
could you move this block outside the for loop and then just change the AdditionalTrustBundle each time?
There was a problem hiding this comment.
Ack — will hoist outside the loop.
There was a problem hiding this comment.
Same as above — still inside the loop.
| }, | ||
| } | ||
|
|
||
| cpContext := controlplanecomponent.WorkloadContext{ |
There was a problem hiding this comment.
Could this be moved outside the for loop instead of being recreated each time?
There was a problem hiding this comment.
Same as above — still inside the loop.
| }, | ||
| } | ||
|
|
||
| cpContext := component.WorkloadContext{ |
There was a problem hiding this comment.
similar comment as the other tests
There was a problem hiding this comment.
Same as above — still inside the loop.
| if err := applyKMSConfig(&deployment.Spec.Template.Spec, secretEncryption, newKMSImages(hcp), hcp); err != nil { | ||
| return err | ||
| } | ||
| if secretEncryption.KMS != nil && secretEncryption.KMS.Provider == hyperv1.AWS && hcp.Spec.AdditionalTrustBundle != nil { |
There was a problem hiding this comment.
Why would this not be in the block on L113?
There was a problem hiding this comment.
The platform switch at L113 handles pod identity webhooks — that's its concern. The KMS/secret-encryption block at L126+ is a separate concern with its own guard (secretEncryption.KMS.Provider == hyperv1.AWS). Nesting KMS handling inside the platform switch would conflate two independent concerns. The current placement correctly treats secret encryption as orthogonal to platform identity webhooks.
| } | ||
|
|
||
| podspec.UpdateContainer("aws-kms-active", podSpec.Containers, wireCABundle) | ||
| podspec.UpdateContainer("aws-kms-backup", podSpec.Containers, wireCABundle) |
There was a problem hiding this comment.
Do we always deploy the backup container if there is not a key set? 🤔
If so, this would trip things up if there was no container I think.
There was a problem hiding this comment.
podspec.UpdateContainer is a no-op when the named container doesn't exist — it just iterates and matches by name (containers.go:55-60). So calling it on "aws-kms-backup" when there's no backup container is safe; it silently skips.
| }) | ||
| } | ||
|
|
||
| // DeploymentAddAWSCABundleVolume creates a combined CA bundle containing both the system CAs from |
There was a problem hiding this comment.
Can we move this out to its own AWS platform file? This file to date has been platform agnostic.
There was a problem hiding this comment.
The existing DeploymentAddTrustBundleVolume in this file is also trust-bundle-specific (not truly platform-agnostic). The two functions share the same concern: trust bundle volume wiring. The file is ~140 lines — splitting one function into a separate file would scatter related code. Happy to revisit if the package grows with more platform-specific helpers, but for now I'd prefer keeping them together.
There was a problem hiding this comment.
DeploymentAddTrustBundleVolume is platform-agnostic — it mounts a ConfigMap as a trusted-ca volume for any platform. The new functions are explicitly AWS-specific (AWS_CA_BUNDLE, AWS SDK concatenation strategy). The file went from ~35 lines of generic helpers to ~140 with ~90 lines of AWS-only code. I'd still prefer a volumes_aws.go to keep the separation clean.
There was a problem hiding this comment.
Is this something that could be moved over to v2 e2e rather than adding new tests to v1 e2e?
There was a problem hiding this comment.
Will look into whether the v2 e2e framework supports what this test needs. If feasible, will migrate; otherwise will track as a follow-up.
There was a problem hiding this comment.
Done — removed the v1 test file and added the AWS_CA_BUNDLE wiring checks (add + remove) to the existing NodePoolTrustBundleTest in test/e2e/v2/tests/nodepool_lifecycle_test.go. Also switched from Containers[0] indexing to podspec.FindContainer and used the exported podspec.AWSCABundleVolumeName/AWSCABundleMountPath/AWSCABundleFileName constants.
bryan-cox
left a comment
There was a problem hiding this comment.
Thanks for the PR — the init container approach for combining system + user CAs is the right design given the AWS SDK's AWS_CA_BUNDLE replacement behavior. A few items to address:
| func DeploymentAddAWSCABundleVolume(trustBundleConfigMap *corev1.LocalObjectReference, deployment *appsv1.Deployment, initContainerImage string) { | ||
| const ( | ||
| userCAVolumeName = "user-ca-bundle" | ||
| combinedCAVolumeName = "aws-ca-bundle" |
There was a problem hiding this comment.
The volume name, mount path, and filename constants are duplicated in kas/deployment.go:applyAWSCABundleToKMSContainers. If either copy drifts, KMS sidecars silently break in isolated environments. Could you export the shared constants?
const (
AWSCABundleVolumeName = "aws-ca-bundle"
AWSCABundleMountPath = "/etc/pki/ca-trust/extracted/hypershift"
AWSCABundleFileName = "combined-ca-bundle.pem"
)There was a problem hiding this comment.
Done — exported AWSCABundleVolumeName, AWSCABundleMountPath, and AWSCABundleFileName as shared constants. Both DeploymentAddAWSCABundleVolume and applyAWSCABundleToKMSContainers now use them.
| return err | ||
| } | ||
| if secretEncryption.KMS != nil && secretEncryption.KMS.Provider == hyperv1.AWS && hcp.Spec.AdditionalTrustBundle != nil { | ||
| podspec.DeploymentAddAWSCABundleVolume(hcp.Spec.AdditionalTrustBundle, deployment, cpContext.ReleaseImageProvider.GetImage(podspec.CPOImageName)) |
There was a problem hiding this comment.
DeploymentAddAWSCABundleVolume sets AWS_CA_BUNDLE on Containers[0], which is kube-apiserver here. KAS doesn't use the AWS SDK — only the KMS sidecars do, and those are correctly wired by applyAWSCABundleToKMSContainers below. Could we split the helper so the volume/init-container setup is separate from the Containers[0] env var wiring? That way KAS only gets the volumes + init container, and the KMS sidecars get the env var via applyAWSCABundleToKMSContainers.
There was a problem hiding this comment.
Done — split the helper into DeploymentAddAWSCABundleSetup (volumes + init container only) and ContainerAddAWSCABundle (volume mount + env var). KAS now calls DeploymentAddAWSCABundleSetup and applyAWSCABundleToKMSContainers uses podspec.ContainerAddAWSCABundle to wire only the KMS sidecars. DeploymentAddAWSCABundleVolume is kept as a convenience wrapper for non-KAS callers (calls both functions for Containers[0]).
| metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
| ) | ||
|
|
||
| type fakeReleaseProvider struct{} |
There was a problem hiding this comment.
This struct is duplicated identically in 7 test files in this PR. There's already a shared support/releaseinfo/fake.FakeReleaseProvider used across 40+ tests in the codebase — could you use that instead?
There was a problem hiding this comment.
Done — replaced all 6 copies with testutil.FakeImageProvider() from support/testutil.
|
|
||
| proxy.SetEnvVars(&deployment.Spec.Template.Spec.Containers[0].Env) | ||
|
|
||
| if cpContext.HCP.Spec.Platform.Type == hyperv1.AWSPlatform && cpContext.HCP.Spec.AdditionalTrustBundle != nil { |
There was a problem hiding this comment.
nit: The other files in this PR (awsnodeterminationhandler, karpenter, kas) create a local hcp := cpContext.HCP and use hcp.Spec.*. For consistency, consider doing the same here and in ingressoperator/deployment.go.
There was a problem hiding this comment.
Ack — will add hcp := cpContext.HCP for consistency.
There was a problem hiding this comment.
capi_provider and ingressoperator still reference cpContext.HCP.Spec.* directly — the hcp := cpContext.HCP local isn't added yet.
| deployment.Spec.Template.Spec.InitContainers = append(deployment.Spec.Template.Spec.InitContainers, corev1.Container{ | ||
| Name: initContainerName, | ||
| Image: initContainerImage, | ||
| Command: []string{"/bin/sh", "-c", |
There was a problem hiding this comment.
nit: If the system CA file is missing for some reason, this crashes the init container. A defensive fallback is cheap:
cat /etc/pki/tls/certs/ca-bundle.crt /user-ca/user-ca-bundle.pem > /etc/pki/ca-trust/extracted/hypershift/combined-ca-bundle.pem 2>/dev/null || cp /user-ca/user-ca-bundle.pem /etc/pki/ca-trust/extracted/hypershift/combined-ca-bundle.pemThere was a problem hiding this comment.
Ack — will add the defensive fallback.
There was a problem hiding this comment.
The defensive fallback still isn't here. If the system CA file is ever missing, this init container crashes and blocks pod startup in an isolated environment — exactly the environment this PR is designed to support. Cheap insurance.
|
@sdminonne: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
…ents Add DeploymentAddAWSCABundleVolume helper that creates a combined CA bundle (system + user CAs) via an init container and sets AWS_CA_BUNDLE on the main container. Wire trust bundle propagation into all AWS control plane components when AdditionalTrustBundle is set on the HostedControlPlane spec: aws-cloud-controller-manager, capi-provider, ingress-operator, karpenter, karpenter-operator, aws-node-termination-handler, and kube-apiserver AWS KMS sidecars (aws-kms-active, aws-kms-backup). Split the helper into DeploymentAddAWSCABundleSetup (volumes + init container) and ContainerAddAWSCABundle (per-container wiring) so KAS only gets the volumes while KMS sidecars get the env var. Export shared constants (AWSCABundleVolumeName, AWSCABundleMountPath, AWSCABundleFileName) to avoid duplication. Move AWS-specific code to volumes_aws.go to keep the base file platform-agnostic. Add unit tests for all components and a v2 e2e test verifying AWS_CA_BUNDLE wiring on aws-cloud-controller-manager (add and remove). Fixes: https://issues.redhat.com/browse/OCPBUGS-77557 Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
/retest-required |
|
Two minor nits from the latest push:
|
Hoist HCP object construction outside the for loop in karpenter and karpenteroperator tests for consistency with the other component tests. Use local hcp variable in cloud_controller_manager/aws/deployment.go instead of referencing cpContext.HCP.Spec directly. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
/retest |
Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThe Root CauseThe root cause is a transient HTTP/2 stream error from the Go module proxy (
This is a well-known class of transient CI failures. The dependency Recommendations
Evidence
|
|
Could you squash the commits? Otherwise lgtm |
Summary
DeploymentAddAWSCABundleVolumehelper that creates a combined CA bundle (system + user CAs) via an init container and setsAWS_CA_BUNDLEon the main containerAdditionalTrustBundleis set on the HostedControlPlane spec:aws-cloud-controller-managercapi-provideringress-operatorkarpenterkarpenter-operatoraws-node-termination-handlerkube-apiserverAWS KMS sidecars (aws-kms-active,aws-kms-backup) whenSecretEncryption.KMS.Provideris AWSaws-cloud-controller-managerProblem
In isolated AWS environments (e.g., US-ISO regions), custom CA bundles specified via
HostedCluster.Spec.AdditionalTrustBundleare not propagated to AWS control plane components. This causes TLS verification failures when these components call AWS API endpoints:Why not reuse
DeploymentAddTrustBundleVolume?The existing helper mounts a ConfigMap as a directory at
/etc/pki/tls/certs, which replaces the entire system CA directory. This works for in-house components (CPO, ignition-server, OAPI) whose TLS needs are tightly controlled. However, the affected components are binaries that make HTTPS calls to standard AWS service endpoints (EC2, ELB, STS, SQS, KMS). The AWS SDK's default HTTP client loads the system CA store from/etc/pki/tls/certsto verify TLS certificates. Replacing that directory with a ConfigMap containing only the custom CA would cause the binary to lose the public root CAs (e.g., Amazon Trust Services), breaking connectivity to standard AWS API endpoints.Why
AWS_CA_BUNDLEwith a combined bundle?The AWS SDK (both v1 and v2) reads
AWS_CA_BUNDLEand uses it instead of the system CA bundle — it creates a new emptyx509.CertPooland loads only the specified file. To avoid losing trust in standard AWS endpoints, an init container concatenates the system CAs (/etc/pki/tls/certs/ca-bundle.crt) with the user-provided CAs fromadditionalTrustBundleinto a single combined PEM file.AWS_CA_BUNDLEpoints to this combined file, ensuring the AWS SDK trusts both system and custom CAs.KAS KMS sidecars
When secret encryption uses AWS KMS (
SecretEncryption.KMS.Provider == AWS), theaws-kms-activeandaws-kms-backupsidecar containers in the kube-apiserver deployment also need access to the combined CA bundle. These sidecars call AWS KMS endpoints to encrypt/decrypt data encryption keys. Theaws-kms-token-mintersidecar is intentionally excluded as it does not make AWS API calls.Test plan
AdditionalTrustBundleis setAdditionalTrustBundleis nilaws-kms-activeandaws-kms-backupget volume mount andAWS_CA_BUNDLEenv varaws-kms-token-minteris not wiredAWS_CA_BUNDLEwiring onaws-cloud-controller-manager(add and remove)make testpassesmake verifypassesFixes: https://issues.redhat.com/browse/OCPBUGS-77557
🤖 Generated with Claude Code