Skip to content

feat(isvtest): provider-agnostic workload and conformance improvements#522

Open
saiyam1814 wants to merge 2 commits into
NVIDIA:mainfrom
saiyam1814:framework-improvements
Open

feat(isvtest): provider-agnostic workload and conformance improvements#522
saiyam1814 wants to merge 2 commits into
NVIDIA:mainfrom
saiyam1814:framework-improvements

Conversation

@saiyam1814

@saiyam1814 saiyam1814 commented Jul 1, 2026

Copy link
Copy Markdown

Summary

Provider-agnostic framework and workload improvements, split out from #414 per @abegnoche's note that new provider directories (isvctl/configs/providers/**) will only be accepted at the 1.0.0 release this fall. These isvtest/ changes benefit every provider and stand alone, so this PR carries just those. #414 will be rebased down to only the vCluster provider config and held for 1.0.0.

All changes are additive and backward-compatible (new optional config params + resilience fallbacks); existing provider behavior is unchanged when the new options are unset.

Changes

  • workloads/k8s_nim.pyruntime_class_name, nim_memory_request, nim_memory_limit config params.
  • workloads/k8s_nim_helm.py--kubeconfig targeting (extracts kubeconfig from KUBECTL), memory + runtimeClassName params, and NIM_MAX_MODEL_LEN env via --set-string.
  • workloads/k8s_nccl_multinode.pyoverride_launcher_affinity (drop the launcher's node-role.kubernetes.io/control-plane affinity on clusters that don't expose control-plane nodes) and runtime_class_name.
  • workloads/k8s_stress.pyruntime_class_name config param.
  • validations/k8s_conformance.py — resilient JUnit retrieval for managed-K8s LoadBalancers that reset long-running kubectl exec streams: retry exec cat (3x), fall back to kubectl cp (tar framing), and support an opt-in pre-staged local file via ISVTEST_CONFORMANCE_JUNIT_LOCAL_PATH.
  • core/nvidia.py — broaden the GPU-count regex so non-standard nvidia-smi GPU names (e.g. Tesla T4) are counted.

Testing

  • make lint — ruff, yamlfmt, SPDX headers all pass.
  • make test — full unit suite passes (1057 isvtest tests, incl. updated test_k8s_conformance).

Rebased onto current main, so it composes cleanly with #449 (test_ids/labels in YAML) and #473 (CUDA UMD Version parsing).

Summary by CodeRabbit

  • New Features

    • Added configurable runtime class and resource controls for GPU/NIM workloads, including optional omission of runtimeClassName.
    • Enhanced Helm-based NIM deployment with additional override options and better cluster targeting.
  • Bug Fixes

    • Improved Kubernetes conformance JUnit collection reliability with retries and a fallback copy method when needed.
    • Fixed GPU counting to better identify actual GPUs without overcounting non-GPU table rows.
    • Updated workload manifest patching to respect launcher affinity overrides.

Split out from the vCluster provider PR (NVIDIA#414) so these can merge ahead
of the 1.0.0 provider-acceptance milestone. All changes are additive,
backward-compatible, and apply across every provider.

- k8s_nim: add runtime_class_name, nim_memory_request, nim_memory_limit
  config params
- k8s_nim_helm: --kubeconfig targeting, memory + runtimeClassName params,
  NIM_MAX_MODEL_LEN env via --set-string
- k8s_nccl_multinode: override_launcher_affinity and runtime_class_name for
  MPI launchers on clusters without control-plane nodes
- k8s_stress: runtime_class_name config param
- k8s_conformance: resilient JUnit retrieval (retry exec-cat, kubectl cp
  fallback, opt-in pre-staged local path via ISVTEST_CONFORMANCE_JUNIT_LOCAL_PATH)
- core/nvidia: broaden GPU-count regex for non-standard GPU names (e.g. Tesla T4)

make lint and make test pass (1057 isvtest unit tests).

Signed-off-by: Saiyam Pathak <[email protected]>
Copilot AI review requested due to automatic review settings July 1, 2026 10:06
@saiyam1814 saiyam1814 requested a review from a team as a code owner July 1, 2026 10:06
@copy-pr-bot

copy-pr-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 30fc0c34-dcff-446b-bc6d-5444f331b4e1

📥 Commits

Reviewing files that changed from the base of the PR and between 55c0281 and 2b3b526.

📒 Files selected for processing (5)
  • isvtest/src/isvtest/core/nvidia.py
  • isvtest/src/isvtest/validations/k8s_conformance.py
  • isvtest/src/isvtest/workloads/k8s_nim.py
  • isvtest/src/isvtest/workloads/k8s_stress.py
  • isvtest/tests/test_nvidia.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • isvtest/src/isvtest/core/nvidia.py
  • isvtest/src/isvtest/workloads/k8s_nim.py
  • isvtest/src/isvtest/workloads/k8s_stress.py
  • isvtest/src/isvtest/validations/k8s_conformance.py

📝 Walkthrough

Walkthrough

Adds staged JUnit retrieval with local pre-stage, retried kubectl exec, and kubectl cp fallback in conformance validation. Adds configurable runtimeClassName and sizing overrides across NCCL, NIM, NIM Helm, and stress workloads, plus updated GPU counting and tests.

Changes

JUnit Retrieval Resilience

Layer / File(s) Summary
Staged JUnit retrieval implementation and fallback helpers
isvtest/src/isvtest/validations/k8s_conformance.py
run() now tries a local pre-staged JUnit file, retries kubectl exec cat up to 3 times, and falls back to kubectl cp via new _try_local_junit() and _kubectl_cp() helpers.
Test for kubectl cp fallback
isvtest/tests/test_k8s_conformance.py
Adds Path import and a test verifying the kubectl cp fallback path executes and produces a passing result when cat fails.

Configurable Runtime Class and Memory Overrides

Layer / File(s) Summary
NCCL multinode manifest patching for runtime class and launcher affinity
isvtest/src/isvtest/workloads/k8s_nccl_multinode.py
_patch_manifest removes or replaces runtimeClassName and can strip launcher control-plane node affinity based on config.
NIM job manifest runtime class, memory, and max model length
isvtest/src/isvtest/workloads/k8s_nim.py
run() reads runtime class and sizing config, then updates the Job YAML to override runtime class, memory, and NIM_MAX_MODEL_LEN.
NIM Helm kubeconfig-aware invocation and value overrides
isvtest/src/isvtest/workloads/k8s_nim_helm.py
Adds kubeconfig parsing from KUBECTL, applies it to Helm commands, and adds Helm value overrides for memory, runtime class, and max model length.
GPU stress workload configurable runtime class
isvtest/src/isvtest/workloads/k8s_stress.py
run() passes runtime_class_name into pod YAML generation, which conditionally emits or omits runtimeClassName.
GPU counting regex update
isvtest/src/isvtest/core/nvidia.py
count_gpus_from_full_output now matches GPU rows by Bus-Id pattern instead of requiring NVIDIA text, with updated tests.

Estimated code review effort: 3 (Moderate) | ~30 minutes

Sequence Diagram(s)

sequenceDiagram
  participant run
  participant try_local_junit
  participant kubectl_exec_cat
  participant kubectl_cp

  run->>try_local_junit: read ISVTEST_CONFORMANCE_JUNIT_LOCAL_PATH
  try_local_junit-->>run: content or empty
  run->>kubectl_exec_cat: retry up to 3 times
  kubectl_exec_cat-->>run: content or failure
  run->>kubectl_cp: copy JUnit via kubectl cp
  kubectl_cp-->>run: content or failure
Loading

Suggested reviewers: mresvanis

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the PR’s broad provider-agnostic workload and conformance improvements.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends isvtest’s provider-agnostic Kubernetes workloads and validations with new configuration knobs (runtimeClassName + memory tuning), more resilient CNCF conformance JUnit retrieval, and improved GPU counting from nvidia-smi output.

Changes:

  • Add workload configuration options for runtimeClassName, NIM memory sizing, Helm --kubeconfig targeting, and NCCL multinode launcher affinity override.
  • Make CNCF conformance JUnit retrieval more resilient via retries, kubectl cp fallback, and optional pre-staged local JUnit support.
  • Broaden GPU counting logic for non-NVIDIA* GPU name rows.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
isvtest/tests/test_k8s_conformance.py Adds coverage for JUnit retrieval fallback to kubectl cp.
isvtest/src/isvtest/workloads/k8s_stress.py Introduces runtime_class_name plumbing into generated Pod YAML.
isvtest/src/isvtest/workloads/k8s_nim.py Adds runtimeClassName + memory tuning + optional NIM KV-cache env injection into the manifest.
isvtest/src/isvtest/workloads/k8s_nim_helm.py Extracts kubeconfig from KUBECTL and threads it into Helm operations; adds optional memory/runtimeClass/env settings.
isvtest/src/isvtest/workloads/k8s_nccl_multinode.py Adds optional runtimeClassName removal/override and launcher affinity override.
isvtest/src/isvtest/validations/k8s_conformance.py Implements multi-stage JUnit retrieval (local path → retries → kubectl cp).
isvtest/src/isvtest/core/nvidia.py Updates GPU counting regex for broader GPU name compatibility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread isvtest/src/isvtest/core/nvidia.py Outdated
Comment on lines 64 to 67
Example pattern matched: "| 0 NVIDIA A100-SXM4-80GB", "| 0 Tesla T4"
"""
gpu_lines = re.findall(r"\|\s*\d+\s+NVIDIA", output, re.MULTILINE)
gpu_lines = re.findall(r"\|\s{1,4}\d{1,3}\s{2,}", output, re.MULTILINE)
return len(gpu_lines)
Comment on lines +237 to 240
{runtime_class_line} tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
Comment on lines +220 to +223
self.log.warning(
f"JUnit retrieval attempt {attempt}/3 via 'kubectl exec cat' failed; retrying in 5s"
)
time.sleep(5)

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
isvtest/src/isvtest/core/nvidia.py (1)

55-67: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Regex over-matches the "Processes:" table, inflating GPU counts.

Standard nvidia-smi output includes a second table listing per-process rows in the same | <index> <PID/N-A> ... | shape, e.g. | 0 N/A N/A 12345 C python ...|. The new pattern \|\s{1,4}\d{1,3}\s{2,} matches these process rows too, since it only checks for a pipe, spaces, a small number, and more spaces — it doesn't require the GPU-summary-specific columns (name, memory, etc.) or exclude the "Processes:" section. The old NVIDIA-anchored regex avoided this because process rows don't contain "NVIDIA" text.

This means any workload with multiple running processes on a GPU (or multiple GPUs with processes) will inflate the count returned by count_gpus_from_full_output, which feeds directly into K8sNvidiaSmiCheck's per-node and total GPU count comparisons — causing spurious validation failures/passes.

Consider restricting the match to lines before the "Processes:" marker, or anchoring on GPU-summary-specific tokens (e.g. requiring MiB on the following line, or splitting the output on the processes header before searching).

🐛 Suggested fix: scope the regex to the GPU summary table only
-    gpu_lines = re.findall(r"\|\s{1,4}\d{1,3}\s{2,}", output, re.MULTILINE)
-    return len(gpu_lines)
+    summary_section = output.split("Processes:")[0]
+    gpu_lines = re.findall(r"\|\s{1,4}\d{1,3}\s{2,}", summary_section, re.MULTILINE)
+    return len(gpu_lines)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvtest/src/isvtest/core/nvidia.py` around lines 55 - 67, The GPU counting
logic in count_gpus_from_full_output is matching rows from the Processes section
as well as the GPU summary table, which inflates counts. Fix this by scoping the
search to the summary portion of the nvidia-smi output in
count_gpus_from_full_output (for example, stop parsing at the Processes: marker
or require summary-specific tokens) so K8sNvidiaSmiCheck gets an accurate GPU
count.
🧹 Nitpick comments (1)
isvtest/src/isvtest/workloads/k8s_nccl_multinode.py (1)

240-240: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Move re back to module scope.

This local import adds no lazy-load benefit here and breaks the file-level import rule.

As per coding guidelines, "Place all imports at the top of the file; defer imports inside functions only with a one-line comment giving the reason."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@isvtest/src/isvtest/workloads/k8s_nccl_multinode.py` at line 240, Move the
`re` import back to module scope in `k8s_nccl_multinode.py`; the local `import
re as _re` inside the function should be removed and replaced with a top-level
import alongside the other module imports. Use the existing `re`/`_re` usage in
the surrounding function to update references accordingly, and keep any deferred
imports inside functions only if they have a one-line reason comment.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@isvtest/src/isvtest/validations/k8s_conformance.py`:
- Around line 216-223: The retry loop in k8s_conformance.py’s JUnit retrieval
path logs “retrying in 5s” and sleeps after the final `attempt == 3`, which is
misleading and adds an unnecessary delay before the `kubectl cp` fallback.
Update the loop around `_exec_cat` so the warning and `time.sleep(5)` only
happen when another exec retry will actually follow (for example, gate them on
`attempt < 3`), while keeping the existing break behavior when `ok and
junit_xml` succeeds.

In `@isvtest/src/isvtest/workloads/k8s_nim.py`:
- Around line 103-105: The two global yaml_content.replace calls in the NIM
memory override logic can overlap and overwrite each other, causing the request
and limit fields to collapse to the same value. Update the replacement logic in
the k8s_nim workload code so the memory request and memory limit are targeted
separately, using distinct field-specific matching around the NIM server
container rather than replacing every occurrence of the raw values. Keep the
behavior scoped to the intended container/spec fields and avoid reusing a prior
replacement result when applying the second one.

In `@isvtest/src/isvtest/workloads/k8s_stress.py`:
- Around line 46-50: The run() method in k8s_stress.py is missing the required
PEP 257 docstring even though it is being modified. Add a concise docstring
directly on isvtest.workloads.k8s_stress.StressTest.run() that describes what
the method does and, if helpful, its key configuration behavior such as
runtime_class_name handling.

---

Outside diff comments:
In `@isvtest/src/isvtest/core/nvidia.py`:
- Around line 55-67: The GPU counting logic in count_gpus_from_full_output is
matching rows from the Processes section as well as the GPU summary table, which
inflates counts. Fix this by scoping the search to the summary portion of the
nvidia-smi output in count_gpus_from_full_output (for example, stop parsing at
the Processes: marker or require summary-specific tokens) so K8sNvidiaSmiCheck
gets an accurate GPU count.

---

Nitpick comments:
In `@isvtest/src/isvtest/workloads/k8s_nccl_multinode.py`:
- Line 240: Move the `re` import back to module scope in
`k8s_nccl_multinode.py`; the local `import re as _re` inside the function should
be removed and replaced with a top-level import alongside the other module
imports. Use the existing `re`/`_re` usage in the surrounding function to update
references accordingly, and keep any deferred imports inside functions only if
they have a one-line reason comment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2415d488-f292-48bb-b014-82bb3a6c7720

📥 Commits

Reviewing files that changed from the base of the PR and between dd984a3 and 55c0281.

📒 Files selected for processing (7)
  • isvtest/src/isvtest/core/nvidia.py
  • isvtest/src/isvtest/validations/k8s_conformance.py
  • isvtest/src/isvtest/workloads/k8s_nccl_multinode.py
  • isvtest/src/isvtest/workloads/k8s_nim.py
  • isvtest/src/isvtest/workloads/k8s_nim_helm.py
  • isvtest/src/isvtest/workloads/k8s_stress.py
  • isvtest/tests/test_k8s_conformance.py

Comment thread isvtest/src/isvtest/validations/k8s_conformance.py Outdated
Comment thread isvtest/src/isvtest/workloads/k8s_nim.py Outdated
Comment thread isvtest/src/isvtest/workloads/k8s_stress.py
- core/nvidia: anchor the GPU-count regex on the PCI Bus-Id so process-table
  rows (which also start with a GPU index) are not over-counted; add unit
  tests covering a non-NVIDIA-named GPU and a multi-GPU node with processes.
- k8s_nim: scope the memory request/limit substitutions to the GPU container's
  own limits/requests blocks so they can't clobber each other when the request
  value equals the original limit literal.
- k8s_conformance: skip the "retrying in 5s" log and sleep on the final exec
  attempt before falling back to kubectl cp.
- k8s_stress: add the missing PEP 257 docstring to run().

make lint and make test pass (1060 isvtest unit tests).

Signed-off-by: Saiyam Pathak <[email protected]>
@saiyam1814

Copy link
Copy Markdown
Author

Thanks @coderabbitai and Copilot. Addressed in 2b3b526. Rundown:

Fixed

  • core/nvidia.py GPU-count regex over-broad (Copilot): Correct, the previous pattern also matched Processes: table rows (which start with a GPU index too). Re-anchored the regex on the PCI Bus-Id that only appears on physical GPU identity rows, so process rows and MIG-device rows are no longer counted. Added unit tests: a Tesla T4 (non-NVIDIA-prefixed name) counts as 1, and a 2-GPU node with 2 running processes counts as 2 (the old pattern returned 4).

  • k8s_nim.py memory request/limit can clobber each other (CodeRabbit): Correct. The two global string replacements collided when the request value equaled the original limit literal (request="32Gi" produced 32Gi/32Gi then 48Gi/48Gi). Scoped each substitution to the GPU container's own limits:/requests: block by anchoring on its nvidia.com/gpu line, so they're independent.

  • k8s_conformance.py sleeps 5s after the final exec attempt (Copilot + CodeRabbit): Fixed, the log.warning(... retrying) + time.sleep(5) now only run when attempt < 3; the 3rd failure falls straight through to the kubectl cp fallback.

  • k8s_stress.py missing run() docstring (CodeRabbit): Added.

Not a defect

  • k8s_stress.py runtimeClassName/tolerations indentation (Copilot, High): I checked this one by rendering the pod spec and parsing it. Both paths are valid YAML: with runtime_class_name="" the line collapses to tolerations: (2-space, spec-level), and with a value it renders runtimeClassName: <v>\n tolerations: (both 2-space). tolerations list items aligned to the parent key is valid YAML. The GPU stress workload also passed end-to-end on the real run. Left as-is.

make lint and make test pass (1060 isvtest unit tests).

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
cd /tmp
rm -rf repo
git clone --quiet https://github.com/NVIDIA/ai-cloud-validation.git repo
cd repo
git fetch --quiet origin pull/522/head:pr522 || true
git checkout --quiet pr522 2>/dev/null || git checkout --quiet 2b3b526 2>/dev/null || true
git log --oneline -3
echo "---nvidia.py---"
sed -n '1,80p' isvtest/src/isvtest/core/nvidia.py
echo "---k8s_nim.py (memory section)---"
grep -n "memory\|GPU\|nvidia.com/gpu\|limits\|requests" isvtest/src/isvtest/workloads/k8s_nim.py | head -60
echo "---k8s_conformance.py retry/sleep section---"
grep -n "attempt\|sleep\|kubectl cp\|retry" isvtest/src/isvtest/validations/k8s_conformance.py | head -60
echo "---k8s_stress.py run docstring---"
sed -n '1,40p' isvtest/src/isvtest/workloads/k8s_stress.py | grep -n "def run\|\"\"\""

Length of output: 6140


@saiyam1814 Thanks for the detailed rundown — I verified each fix in 2b3b526 directly against the diff:

  • core/nvidia.py: Confirmed. count_gpus_from_full_output now anchors on the PCI Bus-Id pattern ([0-9A-Fa-f]{8}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}\.\d) rather than the GPU index alone, so Processes: rows and MIG-device rows (which lack a Bus-Id) are correctly excluded. The docstring update with the example row is a nice touch for future readers.
  • k8s_nim.py: Confirmed. The memory substitutions (Lines 108–114) are now anchored on the nvidia.com/gpu line within each limits:/requests: block, so the request and limit replacements are independent and can no longer clobber each other when values coincide.
  • k8s_conformance.py: Confirmed. The attempt < 3 guard (Line 223) correctly scopes the warning log and time.sleep(5) to only fire when another exec attempt will follow, letting the 3rd failure fall straight through to the kubectl cp fallback.
  • k8s_stress.py: Confirmed, run() now has a docstring (Line 38).

On the "not a defect" item regarding runtimeClassName/tolerations indentation — your YAML-rendering verification (both branches producing valid, consistently 2-space-indented output) is convincing, and I agree this is fine as-is.

Nice work tightening these up, and thanks for the clear write-up tying each fix back to its origin.

[resolve]

ヽ(・∀・)ノ

@saiyam1814

Copy link
Copy Markdown
Author

@abegnoche this is the split you suggested on #414 — just the provider-agnostic isvtest/ changes (no isvctl/configs/providers/**), so it can go in ahead of the 1.0.0 provider-acceptance milestone. CodeRabbit and Copilot findings are addressed in 2b3b526, make lint + make test are green, and DCO is passing. Whenever you get a chance for the code-owner review — thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants