Skip to content

feat(storage): add csi driver health tests#523

Open
abueno-nvidia wants to merge 1 commit into
NVIDIA:mainfrom
abueno-nvidia:storage-acceptance-tests/csi-driver-health
Open

feat(storage): add csi driver health tests#523
abueno-nvidia wants to merge 1 commit into
NVIDIA:mainfrom
abueno-nvidia:storage-acceptance-tests/csi-driver-health

Conversation

@abueno-nvidia

@abueno-nvidia abueno-nvidia commented Jul 1, 2026

Copy link
Copy Markdown

K8sCsiDriverHealthCheck

What it tests: Verifies that the CSI drivers backing the configured StorageClasses are installed and running in the cluster.

For each unique provisioner derived from the configured StorageClasses, it runs:

Subtest What it checks Issue
storageclass-found[<sc>] The configured StorageClass exists in the cluster (resolves its spec.provisioner) #488
csidriver-registered[<provisioner>] CSIDriver/<provisioner> object exists in the cluster #487
controller-deployment-healthy[<provisioner>] CSI controller Deployment has spec.replicas >= min_controller_replicas (default 1) and all replicas Ready #486
node-daemonset-healthy[<provisioner>] CSI node DaemonSet has all nodes scheduled, available, and ready #486

Success criteria: Every configured StorageClass resolves to a registered CSIDriver; the controller and node workload subtests pass when a workloads block is configured (otherwise they are reported as skipped, not failed).

Configuration notes:

  • drivers is a list of specs, each anchored on one or more storage_classes. The whole check skips (passes) when no StorageClasses resolve.
  • The workloads block (namespace, controller.deployment, node.daemonset) enables the controller/node health subtests. If omitted, it checks sc registration only.
  • min_controller_replicas (default 1) sets the minimum Ready controller replicas.

Summary by CodeRabbit

  • New Features

    • Added a new CSI storage health check for Kubernetes storage suites.
    • The check now verifies configured CSI drivers are installed and healthy, and can also validate related controller and node workloads when available.
  • Bug Fixes

    • Improved handling of missing or incomplete storage configuration so checks skip gracefully when no valid storage classes are found.
    • Added support for detecting conflicting workload settings across storage classes tied to the same CSI driver.

@abueno-nvidia abueno-nvidia requested a review from a team as a code owner July 1, 2026 18:27
@copy-pr-bot

copy-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a new K8sCsiDriverHealthCheck validation class to isvtest that resolves StorageClasses to CSI provisioners and verifies CSIDriver registration plus controller Deployment/node DaemonSet health. Wires the check into the k8s_storage suite config and adds unit tests.

Changes

CSI Driver Health Check

Layer / File(s) Summary
Validation logic
isvtest/src/isvtest/validations/k8s_storage.py
Adds _as_dict() helper and K8sCsiDriverHealthCheck class resolving StorageClasses to provisioners, checking CSIDriver registration, and validating controller Deployment/DaemonSet readiness with conflict detection for mismatched workloads.
Suite configuration
isvctl/configs/suites/k8s.yaml
Registers K8sCsiDriverHealthCheck under k8s_storage with block, shared FS, and NFS driver groups and min_controller_replicas: 1.
Unit tests
isvtest/tests/test_k8s_storage.py
Adds TestK8sCsiDriverHealthCheck with JSON-building helpers and command parsing, covering skip behavior, pass scenarios, multiple failure conditions, and provisioner grouping/conflict logic.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Suite as k8s_storage Suite
  participant Check as K8sCsiDriverHealthCheck
  participant K8s as Kubernetes API
  participant CSIDriver
  participant Workloads as Deployment/DaemonSet

  Suite->>Check: run() with configured drivers
  Check->>K8s: get storageclass -o json
  K8s-->>Check: StorageClass list with provisioners
  Check->>Check: group StorageClasses by provisioner
  loop for each provisioner
    Check->>CSIDriver: verify csidriver registered
    CSIDriver-->>Check: registration result
    alt workloads configured
      Check->>Workloads: get deployment (controller)
      Workloads-->>Check: replicas/readyReplicas
      Check->>Workloads: get daemonset (node)
      Workloads-->>Check: desired/available/ready
    else no workloads
      Check->>Check: mark controller/node subtests skipped
    end
  end
  Check-->>Suite: subtest results
Loading

Related PRs: None identified.

Suggested labels: validation, k8s-storage, tests

Suggested reviewers: None identified.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is clearly related to the PR and captures the new CSI driver health test coverage, though it omits the broader validation/config changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
isvtest/tests/test_k8s_storage.py (1)

2214-2215: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add docstrings to the non-test helpers.

_make() and _workloads() are helper methods, not test entrypoints, so they still need PEP 257 docstrings under the repo's Python rules. As per coding guidelines, "Every function and class must have docstrings following PEP 257".

Also applies to: 2254-2264

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvtest/tests/test_k8s_storage.py` around lines 2214 - 2215, The non-test
helper methods _make() and _workloads() in the K8s CSI driver test helpers are
missing required PEP 257 docstrings. Add concise docstrings directly on those
helper definitions (and any related helper in the referenced _workloads block)
so they comply with the repo rule that every function and class must have
docstrings.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@isvtest/src/isvtest/validations/k8s_storage.py`:
- Around line 2144-2149: The readiness check in the deployment validation still
enforces full rollout by comparing ready replicas to spec.replicas, which
conflicts with the intended min_controller_replicas threshold. Update the logic
in the Deployment validation path to use min_controller_replicas as the
readiness comparison after the existing desired < min_controller_replicas guard,
and adjust the failure message in that same block to reflect the threshold-based
check using the deployment validation symbols like desired, ready, and
min_replicas.
- Around line 2050-2053: The storage class aggregation in k8s_storage validation
is collapsing duplicate entries too early, causing later workloads for the same
class to be ignored before conflict detection. Update the logic around the
storage_classes loop in the validation path so duplicate StorageClass specs are
preserved until after provisioner grouping, and let the per-provisioner conflict
checks see all entries. Use the sc_to_workloads aggregation and the downstream
conflict-detection code to keep behavior consistent, and add a regression test
covering the same-StorageClass but different workloads/controller-node names
case.

---

Nitpick comments:
In `@isvtest/tests/test_k8s_storage.py`:
- Around line 2214-2215: The non-test helper methods _make() and _workloads() in
the K8s CSI driver test helpers are missing required PEP 257 docstrings. Add
concise docstrings directly on those helper definitions (and any related helper
in the referenced _workloads block) so they comply with the repo rule that every
function and class must have docstrings.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8630636e-4f84-4cee-aa80-1ead975010e6

📥 Commits

Reviewing files that changed from the base of the PR and between dd984a3 and de577c1.

📒 Files selected for processing (3)
  • isvctl/configs/suites/k8s.yaml
  • isvtest/src/isvtest/validations/k8s_storage.py
  • isvtest/tests/test_k8s_storage.py

Comment on lines +2050 to +2053
for raw_sc in spec.get("storage_classes") or []:
sc = str(raw_sc).strip()
if sc and sc not in sc_to_workloads:
sc_to_workloads[sc] = workloads

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don't collapse duplicate StorageClass entries before conflict detection.

Deduping storage_classes here drops later workloads blocks for the same StorageClass, so two specs that point at the same class but different controller/node names silently validate whichever entry appeared first instead of tripping the per-provisioner conflict path. Please preserve duplicate specs until after provisioner grouping, and add a regression test for the same-StorageClass/different-workloads case.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvtest/src/isvtest/validations/k8s_storage.py` around lines 2050 - 2053, The
storage class aggregation in k8s_storage validation is collapsing duplicate
entries too early, causing later workloads for the same class to be ignored
before conflict detection. Update the logic around the storage_classes loop in
the validation path so duplicate StorageClass specs are preserved until after
provisioner grouping, and let the per-provisioner conflict checks see all
entries. Use the sc_to_workloads aggregation and the downstream
conflict-detection code to keep behavior consistent, and add a regression test
covering the same-StorageClass but different workloads/controller-node names
case.

Comment on lines +2144 to +2149
if desired < min_replicas:
return False, (
f"Deployment {namespace}/{name} has spec.replicas={desired} < min_controller_replicas={min_replicas}"
)
if ready != desired:
return False, (f"Deployment {namespace}/{name}: readyReplicas={ready} != spec.replicas={desired}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Use min_controller_replicas as the readiness threshold.

After the desired < min_controller_replicas guard, the ready != desired check still fails any controller that is above the configured minimum but not fully rolled out. With min_controller_replicas: 1, a 3-replica Deployment with 2 Ready replicas fails even though it satisfies the threshold this option advertises. Compare ready against min_controller_replicas here, or rename the setting/docs if full readiness is the real requirement.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvtest/src/isvtest/validations/k8s_storage.py` around lines 2144 - 2149, The
readiness check in the deployment validation still enforces full rollout by
comparing ready replicas to spec.replicas, which conflicts with the intended
min_controller_replicas threshold. Update the logic in the Deployment validation
path to use min_controller_replicas as the readiness comparison after the
existing desired < min_controller_replicas guard, and adjust the failure message
in that same block to reflect the threshold-based check using the deployment
validation symbols like desired, ready, and min_replicas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant