feat(isvtest): provider-agnostic workload and conformance improvements#522
feat(isvtest): provider-agnostic workload and conformance improvements#522saiyam1814 wants to merge 2 commits into
Conversation
Split out from the vCluster provider PR (NVIDIA#414) so these can merge ahead of the 1.0.0 provider-acceptance milestone. All changes are additive, backward-compatible, and apply across every provider. - k8s_nim: add runtime_class_name, nim_memory_request, nim_memory_limit config params - k8s_nim_helm: --kubeconfig targeting, memory + runtimeClassName params, NIM_MAX_MODEL_LEN env via --set-string - k8s_nccl_multinode: override_launcher_affinity and runtime_class_name for MPI launchers on clusters without control-plane nodes - k8s_stress: runtime_class_name config param - k8s_conformance: resilient JUnit retrieval (retry exec-cat, kubectl cp fallback, opt-in pre-staged local path via ISVTEST_CONFORMANCE_JUNIT_LOCAL_PATH) - core/nvidia: broaden GPU-count regex for non-standard GPU names (e.g. Tesla T4) make lint and make test pass (1057 isvtest unit tests). Signed-off-by: Saiyam Pathak <[email protected]>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughAdds staged JUnit retrieval with local pre-stage, retried ChangesJUnit Retrieval Resilience
Configurable Runtime Class and Memory Overrides
Estimated code review effort: 3 (Moderate) | ~30 minutes Sequence Diagram(s)sequenceDiagram
participant run
participant try_local_junit
participant kubectl_exec_cat
participant kubectl_cp
run->>try_local_junit: read ISVTEST_CONFORMANCE_JUNIT_LOCAL_PATH
try_local_junit-->>run: content or empty
run->>kubectl_exec_cat: retry up to 3 times
kubectl_exec_cat-->>run: content or failure
run->>kubectl_cp: copy JUnit via kubectl cp
kubectl_cp-->>run: content or failure
Suggested reviewers: 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR extends isvtest’s provider-agnostic Kubernetes workloads and validations with new configuration knobs (runtimeClassName + memory tuning), more resilient CNCF conformance JUnit retrieval, and improved GPU counting from nvidia-smi output.
Changes:
- Add workload configuration options for
runtimeClassName, NIM memory sizing, Helm--kubeconfigtargeting, and NCCL multinode launcher affinity override. - Make CNCF conformance JUnit retrieval more resilient via retries,
kubectl cpfallback, and optional pre-staged local JUnit support. - Broaden GPU counting logic for non-
NVIDIA*GPU name rows.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| isvtest/tests/test_k8s_conformance.py | Adds coverage for JUnit retrieval fallback to kubectl cp. |
| isvtest/src/isvtest/workloads/k8s_stress.py | Introduces runtime_class_name plumbing into generated Pod YAML. |
| isvtest/src/isvtest/workloads/k8s_nim.py | Adds runtimeClassName + memory tuning + optional NIM KV-cache env injection into the manifest. |
| isvtest/src/isvtest/workloads/k8s_nim_helm.py | Extracts kubeconfig from KUBECTL and threads it into Helm operations; adds optional memory/runtimeClass/env settings. |
| isvtest/src/isvtest/workloads/k8s_nccl_multinode.py | Adds optional runtimeClassName removal/override and launcher affinity override. |
| isvtest/src/isvtest/validations/k8s_conformance.py | Implements multi-stage JUnit retrieval (local path → retries → kubectl cp). |
| isvtest/src/isvtest/core/nvidia.py | Updates GPU counting regex for broader GPU name compatibility. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Example pattern matched: "| 0 NVIDIA A100-SXM4-80GB", "| 0 Tesla T4" | ||
| """ | ||
| gpu_lines = re.findall(r"\|\s*\d+\s+NVIDIA", output, re.MULTILINE) | ||
| gpu_lines = re.findall(r"\|\s{1,4}\d{1,3}\s{2,}", output, re.MULTILINE) | ||
| return len(gpu_lines) |
| {runtime_class_line} tolerations: | ||
| - key: "nvidia.com/gpu" | ||
| operator: "Exists" | ||
| effect: "NoSchedule" |
| self.log.warning( | ||
| f"JUnit retrieval attempt {attempt}/3 via 'kubectl exec cat' failed; retrying in 5s" | ||
| ) | ||
| time.sleep(5) |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
isvtest/src/isvtest/core/nvidia.py (1)
55-67: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winRegex over-matches the "Processes:" table, inflating GPU counts.
Standard
nvidia-smioutput includes a second table listing per-process rows in the same| <index> <PID/N-A> ... |shape, e.g.| 0 N/A N/A 12345 C python ...|. The new pattern\|\s{1,4}\d{1,3}\s{2,}matches these process rows too, since it only checks for a pipe, spaces, a small number, and more spaces — it doesn't require the GPU-summary-specific columns (name, memory, etc.) or exclude the "Processes:" section. The old NVIDIA-anchored regex avoided this because process rows don't contain "NVIDIA" text.This means any workload with multiple running processes on a GPU (or multiple GPUs with processes) will inflate the count returned by
count_gpus_from_full_output, which feeds directly intoK8sNvidiaSmiCheck's per-node and total GPU count comparisons — causing spurious validation failures/passes.Consider restricting the match to lines before the "Processes:" marker, or anchoring on GPU-summary-specific tokens (e.g. requiring
MiBon the following line, or splitting the output on the processes header before searching).🐛 Suggested fix: scope the regex to the GPU summary table only
- gpu_lines = re.findall(r"\|\s{1,4}\d{1,3}\s{2,}", output, re.MULTILINE) - return len(gpu_lines) + summary_section = output.split("Processes:")[0] + gpu_lines = re.findall(r"\|\s{1,4}\d{1,3}\s{2,}", summary_section, re.MULTILINE) + return len(gpu_lines)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@isvtest/src/isvtest/core/nvidia.py` around lines 55 - 67, The GPU counting logic in count_gpus_from_full_output is matching rows from the Processes section as well as the GPU summary table, which inflates counts. Fix this by scoping the search to the summary portion of the nvidia-smi output in count_gpus_from_full_output (for example, stop parsing at the Processes: marker or require summary-specific tokens) so K8sNvidiaSmiCheck gets an accurate GPU count.
🧹 Nitpick comments (1)
isvtest/src/isvtest/workloads/k8s_nccl_multinode.py (1)
240-240: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMove
reback to module scope.This local import adds no lazy-load benefit here and breaks the file-level import rule.
As per coding guidelines, "Place all imports at the top of the file; defer imports inside functions only with a one-line comment giving the reason."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@isvtest/src/isvtest/workloads/k8s_nccl_multinode.py` at line 240, Move the `re` import back to module scope in `k8s_nccl_multinode.py`; the local `import re as _re` inside the function should be removed and replaced with a top-level import alongside the other module imports. Use the existing `re`/`_re` usage in the surrounding function to update references accordingly, and keep any deferred imports inside functions only if they have a one-line reason comment.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@isvtest/src/isvtest/validations/k8s_conformance.py`:
- Around line 216-223: The retry loop in k8s_conformance.py’s JUnit retrieval
path logs “retrying in 5s” and sleeps after the final `attempt == 3`, which is
misleading and adds an unnecessary delay before the `kubectl cp` fallback.
Update the loop around `_exec_cat` so the warning and `time.sleep(5)` only
happen when another exec retry will actually follow (for example, gate them on
`attempt < 3`), while keeping the existing break behavior when `ok and
junit_xml` succeeds.
In `@isvtest/src/isvtest/workloads/k8s_nim.py`:
- Around line 103-105: The two global yaml_content.replace calls in the NIM
memory override logic can overlap and overwrite each other, causing the request
and limit fields to collapse to the same value. Update the replacement logic in
the k8s_nim workload code so the memory request and memory limit are targeted
separately, using distinct field-specific matching around the NIM server
container rather than replacing every occurrence of the raw values. Keep the
behavior scoped to the intended container/spec fields and avoid reusing a prior
replacement result when applying the second one.
In `@isvtest/src/isvtest/workloads/k8s_stress.py`:
- Around line 46-50: The run() method in k8s_stress.py is missing the required
PEP 257 docstring even though it is being modified. Add a concise docstring
directly on isvtest.workloads.k8s_stress.StressTest.run() that describes what
the method does and, if helpful, its key configuration behavior such as
runtime_class_name handling.
---
Outside diff comments:
In `@isvtest/src/isvtest/core/nvidia.py`:
- Around line 55-67: The GPU counting logic in count_gpus_from_full_output is
matching rows from the Processes section as well as the GPU summary table, which
inflates counts. Fix this by scoping the search to the summary portion of the
nvidia-smi output in count_gpus_from_full_output (for example, stop parsing at
the Processes: marker or require summary-specific tokens) so K8sNvidiaSmiCheck
gets an accurate GPU count.
---
Nitpick comments:
In `@isvtest/src/isvtest/workloads/k8s_nccl_multinode.py`:
- Line 240: Move the `re` import back to module scope in
`k8s_nccl_multinode.py`; the local `import re as _re` inside the function should
be removed and replaced with a top-level import alongside the other module
imports. Use the existing `re`/`_re` usage in the surrounding function to update
references accordingly, and keep any deferred imports inside functions only if
they have a one-line reason comment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2415d488-f292-48bb-b014-82bb3a6c7720
📒 Files selected for processing (7)
isvtest/src/isvtest/core/nvidia.pyisvtest/src/isvtest/validations/k8s_conformance.pyisvtest/src/isvtest/workloads/k8s_nccl_multinode.pyisvtest/src/isvtest/workloads/k8s_nim.pyisvtest/src/isvtest/workloads/k8s_nim_helm.pyisvtest/src/isvtest/workloads/k8s_stress.pyisvtest/tests/test_k8s_conformance.py
- core/nvidia: anchor the GPU-count regex on the PCI Bus-Id so process-table rows (which also start with a GPU index) are not over-counted; add unit tests covering a non-NVIDIA-named GPU and a multi-GPU node with processes. - k8s_nim: scope the memory request/limit substitutions to the GPU container's own limits/requests blocks so they can't clobber each other when the request value equals the original limit literal. - k8s_conformance: skip the "retrying in 5s" log and sleep on the final exec attempt before falling back to kubectl cp. - k8s_stress: add the missing PEP 257 docstring to run(). make lint and make test pass (1060 isvtest unit tests). Signed-off-by: Saiyam Pathak <[email protected]>
|
Thanks @coderabbitai and Copilot. Addressed in 2b3b526. Rundown: Fixed
Not a defect
|
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -euo pipefail
cd /tmp
rm -rf repo
git clone --quiet https://github.com/NVIDIA/ai-cloud-validation.git repo
cd repo
git fetch --quiet origin pull/522/head:pr522 || true
git checkout --quiet pr522 2>/dev/null || git checkout --quiet 2b3b526 2>/dev/null || true
git log --oneline -3
echo "---nvidia.py---"
sed -n '1,80p' isvtest/src/isvtest/core/nvidia.py
echo "---k8s_nim.py (memory section)---"
grep -n "memory\|GPU\|nvidia.com/gpu\|limits\|requests" isvtest/src/isvtest/workloads/k8s_nim.py | head -60
echo "---k8s_conformance.py retry/sleep section---"
grep -n "attempt\|sleep\|kubectl cp\|retry" isvtest/src/isvtest/validations/k8s_conformance.py | head -60
echo "---k8s_stress.py run docstring---"
sed -n '1,40p' isvtest/src/isvtest/workloads/k8s_stress.py | grep -n "def run\|\"\"\""Length of output: 6140
On the "not a defect" item regarding Nice work tightening these up, and thanks for the clear write-up tying each fix back to its origin. [resolve] ヽ(・∀・)ノ |
|
@abegnoche this is the split you suggested on #414 — just the provider-agnostic |
Summary
Provider-agnostic framework and workload improvements, split out from #414 per @abegnoche's note that new provider directories (
isvctl/configs/providers/**) will only be accepted at the 1.0.0 release this fall. Theseisvtest/changes benefit every provider and stand alone, so this PR carries just those. #414 will be rebased down to only the vCluster provider config and held for 1.0.0.All changes are additive and backward-compatible (new optional config params + resilience fallbacks); existing provider behavior is unchanged when the new options are unset.
Changes
workloads/k8s_nim.py—runtime_class_name,nim_memory_request,nim_memory_limitconfig params.workloads/k8s_nim_helm.py—--kubeconfigtargeting (extracts kubeconfig fromKUBECTL), memory +runtimeClassNameparams, andNIM_MAX_MODEL_LENenv via--set-string.workloads/k8s_nccl_multinode.py—override_launcher_affinity(drop the launcher'snode-role.kubernetes.io/control-planeaffinity on clusters that don't expose control-plane nodes) andruntime_class_name.workloads/k8s_stress.py—runtime_class_nameconfig param.validations/k8s_conformance.py— resilient JUnit retrieval for managed-K8s LoadBalancers that reset long-runningkubectl execstreams: retryexec cat(3x), fall back tokubectl cp(tar framing), and support an opt-in pre-staged local file viaISVTEST_CONFORMANCE_JUNIT_LOCAL_PATH.core/nvidia.py— broaden the GPU-count regex so non-standardnvidia-smiGPU names (e.g.Tesla T4) are counted.Testing
make lint— ruff, yamlfmt, SPDX headers all pass.make test— full unit suite passes (1057 isvtest tests, incl. updatedtest_k8s_conformance).Rebased onto current
main, so it composes cleanly with #449 (test_ids/labels in YAML) and #473 (CUDA UMD Version parsing).Summary by CodeRabbit
New Features
runtimeClassName.Bug Fixes