Skip to content

feat(db): consolidate REST into shared nico-pg-cluster#3081

Merged
shayan1995 merged 3 commits into
NVIDIA:mainfrom
shayan1995:feat/consolidated-database
Jul 2, 2026
Merged

feat(db): consolidate REST into shared nico-pg-cluster#3081
shayan1995 merged 3 commits into
NVIDIA:mainfrom
shayan1995:feat/consolidated-database

Conversation

@shayan1995

@shayan1995 shayan1995 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Move the NICo REST API database from the standalone postgres.postgres StatefulSet onto the Zalando-managed nico-pg-cluster, eliminating the separate REST database instance.

Changes

  • postgresql.yaml: add nico-rest.nico user and nico_rest database, gated on rest.enabled (default: true)
  • eso-external-secrets.yaml: add nico-rest-db-eso ClusterExternalSecret syncing Zalando credentials to the nico-rest namespace as nico-rest-pg-creds — named to avoid collision with the db-creds hook managed by nico-rest-common (hook-delete-policy: before-hook-creation causes a Helm/ESO race if ESO targets db-creds directly)
  • values.yaml: add rest.enabled / rest.namespace toggles with documentation
  • setup.sh:
    • Wait (up to 120 s) for nico-rest-pg-creds to be synced by ESO; extract Zalando credentials and write to a chmod 600 temp file; pass via -f to helm upgrade --install to populate nico-rest-common.secrets.dbCreds.* at install time (avoids --set credential exposure and Helm special-char escaping issues)
    • Install pg_trgm extension on nico_rest after the cluster reaches Running state, with a 120 s bounded retry loop
  • nico-rest-db/values.yaml: update db.host / db.name / db.user defaults to nico-pg-cluster.postgres.svc.cluster.local / nico_rest / nico-rest.nico
  • nico-rest-api/values.yaml: same db defaults plus secrets.dbCreds: db-creds to enable the secret-based password path for live rotation

Related Issues

N/A

Type of Change

Change - Changes in existing functionality

Breaking Changes

This PR contains breaking changes.

rest.enabled defaults to true. On upgrade, nico-prereqs will create a nico-rest.nico Postgres user, a nico_rest database on nico-pg-cluster, and a nico-rest-db-eso ClusterExternalSecret in every site that does not explicitly set rest.enabled: false. Sites that do not run nico-rest should add rest.enabled: false to their nico-prereqs site values before upgrading to prevent these resources from being created.

Additionally, nico-rest-api and nico-rest-db chart defaults now point at nico-pg-cluster (host, name, user). Any site running nico-rest against the old standalone instance must either migrate to nico-pg-cluster via setup.sh or override these values explicitly.

Testing

Manual end-to-end validation performed on dev6 (rg-forge-dev6, 3-node cluster).

Additional Notes

@shayan1995 shayan1995 requested a review from a team as a code owner July 2, 2026 03:49
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3c42f289-a07e-4531-a2e0-45d5963ff760

📥 Commits

Reviewing files that changed from the base of the PR and between e2acc67 and fa9cded.

📒 Files selected for processing (1)
  • helm-prereqs/values/nico-site-agent.yaml

Summary by CodeRabbit

  • New Features
    • Added configuration to deploy NICo REST with a dedicated PostgreSQL user/database and secret-based credentials syncing.
    • PostgreSQL setup now provisions the REST user/database and required pg_trgm extension.
  • Bug Fixes
    • Install now waits for required REST database credentials to be synced before deploying.
    • REST Helm charts now point to the correct PostgreSQL service, database, user, and credential wiring.
    • Temporary credentials files are cleaned up after the Helm deployment attempt.

Walkthrough

This PR provisions NICo REST database prerequisites on the shared PostgreSQL cluster, syncs credentials into the REST namespace, updates the install script to consume them, and aligns REST chart defaults with the new database and user.

Changes

NICo REST DB provisioning

Layer / File(s) Summary
Enable REST prerequisites and PostgreSQL user/database mapping
helm-prereqs/values.yaml, helm-prereqs/templates/postgresql.yaml
Adds a rest config block enabling the feature and namespace, and conditionally adds a nico-rest.nico user and nico_rest database mapping.
ESO ClusterExternalSecret for REST DB credentials
helm-prereqs/templates/eso-external-secrets.yaml
Adds nico-rest-db-eso, gated on postgresql.enabled and rest.enabled, syncing Zalando-generated credentials into nico-rest-pg-creds.
Setup script: pg_trgm extension and Helm credential wiring
helm-prereqs/setup.sh
Creates the pg_trgm extension on nico_rest after the cluster is ready, waits for the credentials Secret, decodes it, passes the values into Helm, and cleans up the temporary file.
REST chart default values point to new database
helm/rest/nico-rest/charts/nico-rest-api/values.yaml, helm/rest/nico-rest/charts/nico-rest-db/values.yaml, helm-prereqs/values/nico-site-agent.yaml
Updates db.host, db.name, db.user, secrets.dbCreds, and DB_ADDR/DB_DATABASE defaults to reference the shared cluster and new REST database.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
  participant SetupScript
  participant PostgresCluster
  participant ESO
  participant K8sSecret
  participant HelmInstall

  SetupScript->>PostgresCluster: wait for nico-pg-cluster Running
  SetupScript->>PostgresCluster: exec CREATE EXTENSION pg_trgm on nico_rest
  ESO->>PostgresCluster: read Zalando-generated credentials
  ESO->>K8sSecret: write nico-rest-pg-creds
  SetupScript->>K8sSecret: wait for nico-rest-pg-creds
  SetupScript->>K8sSecret: base64-decode username/password
  SetupScript->>HelmInstall: pass decoded dbCreds values
Loading

Related Issues: None found.

Related PRs: None found.

Suggested labels: helm, database, enhancement

Suggested reviewers: None found.

🐰 A cluster hums, a secret wakes,
The REST app learns the path it takes,
pg_trgm is set, credentials flow,
And charts now point where they should go.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: moving REST onto the shared nico-pg-cluster.
Description check ✅ Passed The description is detailed and directly describes the same database consolidation changes in the diff.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@shayan1995

shayan1995 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

@ajf given that this is a breaking change can we get this in 2.0 RC?

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-07-02 03:52:23 UTC | Commit: 3240863

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@helm-prereqs/setup.sh`:
- Around line 384-397: The pg_trgm install in the setup script is race-prone
because it runs once and suppresses errors, so it can miss the window before the
nico_rest database is ready. Update the installation block around the
_PG_PRIMARY lookup and kubectl exec call to use a bounded retry/wait loop
similar to the later nico-rest-pg-creds wait logic, and only proceed when the
CREATE EXTENSION succeeds. Also tighten the fallback message so it distinguishes
“database not ready yet” from “rest.enabled=false” rather than using one
ambiguous message.
- Around line 730-749: The nico-rest setup flow currently waits forever for the
ESO-managed secret and injects DB credentials directly via Helm CLI flags.
Update the credential sync loop in setup.sh around the nico-rest-pg-creds check
to include a bounded timeout and fail fast with a clear error if the secret
never appears, and replace the --set nico-rest-common.secrets.dbCreds.* usage in
NICO_REST_CMD with a safer mechanism such as a temporary values file or direct
Secret reference so secret values are not exposed or broken by shell/Helm
parsing.

In `@helm-prereqs/values.yaml`:
- Around line 128-134: The doc comment in values.yaml names the wrong Secret for
the ESO sync. Update the comment to match the target used by the
ClusterExternalSecret in the ESO template, using the actual secret name
referenced by the corresponding ESO resource (the one intended to avoid the
db-creds collision). Keep the description aligned with the generated credentials
flow and the rest namespace consumer names, and remove the misleading mention of
db-creds.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cd803d4b-d4b6-43d2-9cbf-6a2ff9e41203

📥 Commits

Reviewing files that changed from the base of the PR and between e9ebd53 and 3240863.

📒 Files selected for processing (6)
  • helm-prereqs/setup.sh
  • helm-prereqs/templates/eso-external-secrets.yaml
  • helm-prereqs/templates/postgresql.yaml
  • helm-prereqs/values.yaml
  • helm/rest/nico-rest/charts/nico-rest-api/values.yaml
  • helm/rest/nico-rest/charts/nico-rest-db/values.yaml

Comment thread helm-prereqs/setup.sh
Comment thread helm-prereqs/setup.sh Outdated
Comment thread helm-prereqs/values.yaml

@thossain-nv thossain-nv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @shayan1995 Existing installations have Values.rest.enabled, correct?

Should we populate the Breaking Changes section in the PR description?

Comment thread helm-prereqs/setup.sh
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 272 5 27 89 7 144
machine-validation-runner 769 25 209 278 36 221
machine_validation 769 25 209 278 36 221
machine_validation-aarch64 769 25 209 278 36 221
nvmetal-carbide 769 25 209 278 36 221
TOTAL 3354 105 863 1207 151 1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
helm-prereqs/setup.sh (2)

764-770: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Guard the temp credentials file with a trap instead of relying only on the deploy/skip branches.

_NICO_REST_CREDS_FILE (containing the decoded DB password in plaintext) is only removed at Lines 815 and 817, on the two known exit paths of the deploy decision. Any earlier abnormal exit between file creation (Line 764) and those branches (e.g. a failing command under a set -e shell, or a future refactor that adds a step in between) would leave the plaintext credentials file behind in /tmp. chmod 600 limits exposure to the local user but does not clean it up.

A trap ... EXIT registered right after mktemp removes this class of bug entirely and is idiomatic for temp-secret handling in shell scripts.

As per path instructions, review shell scripts for "secret-safe logging" and "idempotency"; guaranteed cleanup on any exit path is part of that guarantee.

🔒 Suggested fix: trap-based cleanup
 _NICO_REST_CREDS_FILE="$(mktemp)"
 chmod 600 "${_NICO_REST_CREDS_FILE}"
+trap 'rm -f "${_NICO_REST_CREDS_FILE}"' EXIT
 printf 'nico-rest-common:\n  secrets:\n    dbCreds:\n      username: "%s"\n      password: "%s"\n' \

And remove the now-redundant explicit rm -f calls at Lines 815/817 (or leave them as a harmless no-op after the trap already fires).

Also applies to: 815-817

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm-prereqs/setup.sh` around lines 764 - 770, The temp credentials file
created in the `_NICO_REST_CREDS_FILE` setup needs guaranteed cleanup on every
exit path, not just the deploy/skip branches. Add a `trap` immediately after
`mktemp` in `setup.sh` so the file is removed on EXIT even if a later command
fails or the script exits early, and then clean up the now-redundant explicit
`rm -f` handling in the deploy decision block.

Source: Path instructions


398-400: 🩺 Stability & Availability | 🔵 Trivial | 💤 Low value

stderr from kubectl exec is swallowed, blurring "db not ready" from other failures.

2>/dev/null on the CREATE EXTENSION attempt means any failure (auth error, transient network blip, syntax issue) is retried and reported identically to "database not yet created by operator." This is a minor diagnostic gap since the loop already retries regardless of cause, and the final summary message at Lines 409-411 gives operators enough to act on.

Also applies to: 405-405

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm-prereqs/setup.sh` around lines 398 - 400, The kubectl exec check in
setup.sh is swallowing stderr during the CREATE EXTENSION pg_trgm probe, which
makes all failures look like the database is merely not ready. Remove the stderr
redirection from the kubectl exec invocation in the retry block around the
CREATE EXTENSION attempt so failures are visible, and keep the existing retry
logic and final summary message unchanged; use the kubectl exec / psql probe as
the locating symbol.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@helm-prereqs/setup.sh`:
- Around line 764-770: The temp credentials file created in the
`_NICO_REST_CREDS_FILE` setup needs guaranteed cleanup on every exit path, not
just the deploy/skip branches. Add a `trap` immediately after `mktemp` in
`setup.sh` so the file is removed on EXIT even if a later command fails or the
script exits early, and then clean up the now-redundant explicit `rm -f`
handling in the deploy decision block.
- Around line 398-400: The kubectl exec check in setup.sh is swallowing stderr
during the CREATE EXTENSION pg_trgm probe, which makes all failures look like
the database is merely not ready. Remove the stderr redirection from the kubectl
exec invocation in the retry block around the CREATE EXTENSION attempt so
failures are visible, and keep the existing retry logic and final summary
message unchanged; use the kubectl exec / psql probe as the locating symbol.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9d8a1b82-305d-4b9e-8d1c-2593c90b4857

📥 Commits

Reviewing files that changed from the base of the PR and between 3240863 and d172dcf.

📒 Files selected for processing (2)
  • helm-prereqs/setup.sh
  • helm-prereqs/values.yaml
✅ Files skipped from review due to trivial changes (1)
  • helm-prereqs/values.yaml

@shayan1995

Copy link
Copy Markdown
Contributor Author

@thossain-nv rest.enabled is a new field so existing values files don't have it; The default we set is enabled: true

Move the NICo REST API database from the standalone postgres.postgres
StatefulSet onto the Zalando-managed nico-pg-cluster, eliminating the
separate REST database instance.

Changes:
- postgresql.yaml: add nico-rest.nico user and nico_rest database,
  gated on rest.enabled (default: true)
- eso-external-secrets.yaml: add nico-rest-db-eso ClusterExternalSecret
  syncing Zalando credentials to nico-rest namespace as nico-rest-pg-creds
  (named to avoid collision with the db-creds hook managed by nico-rest-common)
- values.yaml: add rest.enabled / rest.namespace toggles with documentation
- setup.sh: extract Zalando credentials from nico-rest-pg-creds and inject
  into nico-rest-common.secrets.dbCreds.* at install time; create pg_trgm
  extension on nico_rest after cluster reaches Running state (Zalando
  preparedDatabases conflicts with the databases section in operator v1.10)
- nico-rest-db/values.yaml: update db.host/name/user defaults to
  nico-pg-cluster.postgres.svc.cluster.local / nico_rest / nico-rest.nico
- nico-rest-api/values.yaml: same db defaults plus secrets.dbCreds: db-creds
  to enable the secret-based password path for live rotation
- setup.sh pg_trgm block: replace single-attempt if with bounded retry
  loop (24×5s = 120s); re-detect Patroni primary each iteration;
  distinguish 'nico_rest not yet created' (retrying) from a final
  timeout message that calls out rest.enabled=false vs true
- setup.sh nico-rest-pg-creds wait: add 120s bounded timeout with a
  clear fail-fast error pointing at the ClusterExternalSecret; write
  Zalando credentials to a chmod-600 mktemp file and pass -f to helm
  instead of --set so passwords are not visible in process arguments
  and are not subject to Helm --set special-char escaping; rm temp
  file on both deploy and skip paths
- values.yaml: correct block and inline comments from 'db-creds' to
  'nico-rest-pg-creds' to match the actual ESO target secret name
@shayan1995 shayan1995 force-pushed the feat/consolidated-database branch from d172dcf to e2acc67 Compare July 2, 2026 20:39
Update nico-site-agent.yaml DB_ADDR and DB_DATABASE to point at the
shared nico-pg-cluster instead of the defunct standalone postgres.postgres
instance. The site-agent uses the same nico-rest.nico credentials (via
db-creds) as nico-rest-api, so no credential changes are needed.
@parmani-nv parmani-nv self-requested a review July 2, 2026 21:04

@parmani-nv parmani-nv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shayan1995 shayan1995 merged commit 12ef832 into NVIDIA:main Jul 2, 2026
114 checks passed
@ajf

ajf commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

@shayan1995 said more simply, in the values file for helm, on already deployed sites needs to say rest.enabled: false ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants