Skip to content

[Improvement] Separate S3 storage configuration for MLRun and Kubeflow Pipeline#295

Open
GiladShapira94 wants to merge 31 commits into
mlrun:developmentfrom
GiladShapira94:separate-data-kfp
Open

[Improvement] Separate S3 storage configuration for MLRun and Kubeflow Pipeline#295
GiladShapira94 wants to merge 31 commits into
mlrun:developmentfrom
GiladShapira94:separate-data-kfp

Conversation

@GiladShapira94
Copy link
Copy Markdown
Collaborator

@GiladShapira94 GiladShapira94 commented May 12, 2026

📝 Description

This PR separates the storage credential configuration into two distinct paths: storage.local.* for the bundled in-cluster SeaweedFS (always used by SeaweedFS IAM, the bucket-init job, and KFP Pipelines), and storage.s3.* for external AWS S3 (used only by MLRun and Jupyter when storage.mode: s3).

The default storage.mode is changed from s3 to local, reflecting that the default CE installation uses the bundled SeaweedFS rather than external AWS S3.

New dedicated _helpers.tpl partials (mlrun-ce.seaweedfs.s3.* and mlrun-ce.pipelines.s3.*) ensure Pipelines and SeaweedFS always resolve credentials from storage.local.* regardless of the active storage.mode, eliminating the previous credential cross-contamination when switching modes.


🛠️ Changes Made

  • charts/mlrun-ce/values.yaml:
    • Changed storage.mode default from s3local
    • Added new storage.local block (accessKey, secretKey, bucket) as the single source of truth for in-cluster SeaweedFS credentials
    • Cleared storage.s3.accessKey/secretKey/bucket defaults (now empty strings; only meaningful when mode: s3)
  • charts/mlrun-ce/templates/_helpers.tpl:
    • mlrun-ce.s3.accessKey/secretKey/bucket — now branches on storage.mode (local vs s3)
    • Added mlrun-ce.seaweedfs.s3.* helpers — always resolve from storage.local.*
    • Added mlrun-ce.pipelines.s3.* helpers — always delegate to mlrun-ce.seaweedfs.s3.*
    • mlrun-ce.artifactPath, mlrun-ce.featureStore.dataPrefix, mlrun-ce.model-endpoint.monitoring.* — replaced hardcoded global.infrastructure.aws.bucketName | default "mlrun" with mlrun-ce.s3.bucket
  • charts/mlrun-ce/templates/config/storage-secret.yamlAWS_ENDPOINT_URL_S3 now only injected when storage.mode: local; storage.s3 no longer sets a custom endpoint
  • charts/mlrun-ce/templates/config/storage-validation.yaml — added fail guard for storage.mode: local with missing storage.local.bucket
  • charts/mlrun-ce/templates/config/mlrun-env-configmap.yaml — updated comment describing per-mode env vars
  • charts/mlrun-ce/templates/pipelines/** — all pipeline templates now use mlrun-ce.pipelines.s3.* helpers
  • charts/mlrun-ce/templates/seaweedfs/** — bucket-init job and IAM config now use mlrun-ce.seaweedfs.s3.* helpers
  • charts/mlrun-ce/templates/NOTES.txt — S3 credentials display updated to reference storage.local.*
  • charts/mlrun-ce/Chart.yaml — version bumped 0.11.0-rc.360.11.0-rc.37
  • charts/mlrun-ce/README.md — version matrix updated to 0.11.0-rc.37

✅ Checklist

  • I have tested the changes in this PR
  • I confirmed whether my changes require a change in documentation and if so, I created another PR in MLRun for the relevant documentation.
  • I confirmed whether my changes require changes in QA tests, for example: credentials changes, resources naming change and if so, I updated the relevant Jira ticket for QA.
  • I increased the Chart version in charts/mlrun-ce/Chart.yaml.
  • I confirmed that the installation works both on a local Docker Desktop environment and on a real cluster when using the required prerequisites.
  • If needed, update https://github.com/mlrun/ce/blob/development/charts/mlrun-ce/README.md with the relevant installation instructions and version Matrix.
  • If needed, update the following values files for multi namespace support:

🧪 Testing

  • helm lint charts/mlrun-ce — run locally to catch syntax errors in refactored helpers
  • helm template mlrun charts/mlrun-ce -f charts/mlrun-ce/values.yaml — render all templates to verify helper resolution
  • Verify storage.mode: s3 path: render with --set storage.mode=s3,storage.s3.accessKey=foo,storage.s3.secretKey=bar,storage.s3.bucket=mybucket and confirm storage-secret does not contain AWS_ENDPOINT_URL_S3
  • Verify storage.mode: local path: render with defaults and confirm storage-secret contains AWS_ENDPOINT_URL_S3 pointing at the SeaweedFS service
  • Confirm pipelines secret mlpipeline-seaweedfs-artifact always uses storage.local.* regardless of storage.mode
  • End-to-end cluster install (required before merge)

🔗 References

  • Ticket link: CEML-707
  • External links:
  • Design docs links (Optional):

🚨 Breaking Changes?

  • Yes (explain below)
  • No

Consumers upgrading from a previous release must:

  • Rename storage.s3.accessKey/secretKey/bucketstorage.local.accessKey/secretKey/bucket if they were using the default SeaweedFS-backed installation (i.e., the old default mode: s3 pointed at SeaweedFS with seaweed/seaweed123/mlrun).

  • Set storage.mode: s3 explicitly if they were previously relying on the default mode: s3 to pass external AWS credentials — the new default is local.

  • Users who supply an external AWS S3 configuration no longer need to clear AWS_ENDPOINT_URL_S3 manually; the secret now omits it when mode: s3.


🔍️ Additional Notes

  • The three install-mode values files (admin_installation_values.yaml, non_admin_installation_values.yaml, non_admin_cluster_ip_installation_values.yaml) contain no storage.* overrides, so they correctly inherit the new defaults from values.yaml without modification.
  • KFP Pipelines is intentionally hardwired to SeaweedFS (storage.local.*) in all modes — this is by design and is documented in the updated helper comments.

Warnings

  1. Breaking change — existing storage.s3.* users: Anyone who previously used the default install (which was mode: s3 pointing at SeaweedFS with seaweed/seaweed123) must migrate their overrides to storage.local.*.

    Their upgrade path:

    --set storage.local.accessKey=<old s3.accessKey> \
    --set storage.local.secretKey=<old s3.secretKey> \
    --set storage.local.bucket=<old s3.bucket>
    

Comment on lines 43 to +56
storage:
mode: s3
s3:
mode: local
# Single source of truth for the in-cluster SeaweedFS.
# Always used by: SeaweedFS IAM config, bucket-init job, and KFP Pipelines.
# Also used by MLRun and Jupyter when mode is "local".
local:
accessKey: "seaweed"
secretKey: "seaweed123"
bucket: "mlrun"
# External AWS S3 credentials — only applied to MLRun and Jupyter when mode is "s3".
s3:
accessKey: ""
secretKey: ""
bucket: ""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The PR description says this PR adds pipelines.storage.s3.{bucket,accessKey,secretKey} under pipelines: so users can route KFP to a separate bucket / IAM, and lists a "Use case B (MLRun → AWS, Pipelines → SeaweedFS)" and "Use case C (both → AWS)".
    Neither of those test cases is actually achievable from this values file, there is no pipelines.storage.s3.* block, and the new mlrun-ce.pipelines.s3.* helpers hardcode SeaweedFS.
    Either the description needs to be rewritten to match the actual implementation, or the implementation needs to add the missing block.
  2. Renaming storage.s3.* (which used to hold the SeaweedFS creds seaweed / seaweed123 / mlrun) to storage.local.* and giving storage.s3.* brand-new "external AWS only" semantics is a possible silent breaking change. Any existing override file that sets storage.s3.accessKey / secretKey / bucket (which was the only path before) will now be ignored at upgrade because storage.mode defaults to local. If we do have existing usages, you can either keep storage.s3.* as the in-cluster SeaweedFS block (matching prior behavior) and introduce a new storage.aws.* (or storage.external.) for the AWS-only block, or add a Helm fail/NOTES.txt warning that detects "user set storage.s3. but storage.mode is local" and tells them to migrate the values.

{{- if eq .Values.storage.mode "local" -}}
{{- .Values.storage.local.bucket -}}
{{- else -}}
{{- coalesce .Values.global.infrastructure.aws.bucketName .Values.storage.s3.bucket "mlrun" -}}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says global.infrastructure.aws.bucketName is kept for backwards compatibility, but storage.s3.bucket is the recommended single source of truth. The current coalesce gives bucketName priority - so if a new user follows the recommendation and sets storage.s3.bucket, an inherited bucketName from an umbrella chart or older values file silently wins. Was that the intended precedence, or should storage.s3.bucket win when explicitly set?

AWS_ENDPOINT_URL_S3: {{ include "mlrun-ce.s3.service.url" . }}
{{- end }}
{{- end }}
{{- end }} No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing new line

@@ -1,6 +1,9 @@
{{- if and (eq .Values.storage.mode "s3") (not .Values.storage.s3.bucket) }}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage.s3.{accessKey,secretKey} default to empty strings but only bucket is validated. Switching to s3 mode without creds will silently produce an unusable Secret. Please also fail-fast when accessKey/secretKey are empty (unless global.infrastructure.aws.s3NonAnonymous is true).

{{- end }}
{{- if and (eq .Values.storage.mode "azure-blob") (not .Values.storage.azure.containerName) }}
{{ fail "storage.mode is set to \"azure-blob\" but storage.azure.containerName is not provided. Please set storage.azure.containerName." }}
{{- end }} No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline


{{- define "mlrun-ce.pipelines.s3.insecure" -}}
true
{{- end -}}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every mlrun-ce.pipelines.s3.* helper just delegates to seaweedfs.s3.* or hardcodes a literal. Either drop the family and call seaweedfs.s3.* directly, or actually wire it to pipelines.storage.s3.* per the PR description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants