Skip to content

Add SLES 16.0 support and extend SLES 15 SP7 driver versions#562

Open
Priyankasaggu11929 wants to merge 6 commits into
ROCm:mainfrom
Priyankasaggu11929:sles-16-support
Open

Add SLES 16.0 support and extend SLES 15 SP7 driver versions#562
Priyankasaggu11929 wants to merge 6 commits into
ROCm:mainfrom
Priyankasaggu11929:sles-16-support

Conversation

@Priyankasaggu11929
Copy link
Copy Markdown
Contributor

@Priyankasaggu11929 Priyankasaggu11929 commented May 29, 2026

Motivation

Follow up to #365 to extend SLES 16.0 support, and some controller validation improvements.

Technical Details

The PR is making the folllowing changes:

  • extend SLES support to include SLES 16.0 OS (along with the existing SLES 15 SP7)

    • new prebuilt driver versions for both codestreams:
      • SLES 16.0 31.10, 31.20, 31.30
      • SLES 15 SP7 - 30.20.1, 30.30.3, 31.10, 31.20, 31.30
    • refactor the driver image lookup into the shared internal/utils package
      (so, both KMM module builder and the spec validator use the same meta)
  • add reconcile-time validation to reject unsupported driver versions for SLES nodes with a clear status condition on the DeviceConfig object

    > kubectl get deviceconfig mi355x-driver-test -n kube-amd-gpu -o jsonpath='{.status}'| jq .
    {
      "conditions": [
        {
          "lastTransitionTime": "2026-05-29T13:11:09Z",
          "message": "Validation failed: [driver version validation failed for node 144.xx.xx.xx.xx: driver version \"31.50\" is not supported for SLES 16.0; supported versions: 31.10, 31.20, 31.30]",
          "reason": "ValidationError",
          "status": "True",
          "type": "Error"
        },
        {
          "lastTransitionTime": "2026-05-29T13:11:09Z",
          "message": "",
          "reason": "OperatorReady",
          "status": "False",
          "type": "Ready"
        }
      ]
    }
    
  • also add a new NoMatchingNodes condition in the DeviceConfig status, to surface error when no cluster nodes match the node selector defined in the DeviceConfig manifest.

    This condition will clear when node selector condition is satisfied.

    In my local testing, when NFD labels (for allowing GPU device ID) are absent on the node, the DeviceConfig admission happens silently with no error/feedback in the DeviceConfig status. but the operator loops indefinitely with reconciler errors in logs (and these errors are easy to go unnoticed).

    > kubectl get deviceconfig mi355x-driver-test -n kube-amd-gpu -o jsonpath='{.status}'| jq .
    {
      "conditions": [
        {
          "lastTransitionTime": "2026-05-29T13:44:42Z",
          "message": "",
          "reason": "OperatorReady",
          "status": "False",
          "type": "Ready"
        },
        {
          "lastTransitionTime": "2026-05-29T13:44:42Z",
          "message": "no nodes found matching selector feature.node.kubernetes.io/amd-gpu=true; verify node labels or check that NFD has labeled the GPU nodes",
          "reason": "NoMatchingNodes",
          "status": "True",
          "type": "Error"
        }
      ],
      ...
    

Test Plan

Please note, these changes are validated on a MI355X system with SLES 16.0 (provided by AMD) across all amdgpu driver versions included in the PR.

Test Result

Added unit tests for

  • SLESDefaultDriverVersionsMapper (covering SP7 and 16.0 OS image parsing)
  • ValidateSLESDriverVersion (to cover valid/invalid driver versions for both codestreams and non-SLES passthrough)
  • resolveDockerfile (covering SLES 16.0 prebuilt image tag resolution in both default and custom registry scenarios)

truncated output of make unit-test

> make unit-test
go vet ./...
go test ./internal ./internal/controllers ./internal/kmmmodule -v -coverprofile cover.out
...
...
=== RUN   TestSLESDefaultDriverVersionsMapper
=== RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP7
=== RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP7_with_dash_format
=== RUN   TestSLESDefaultDriverVersionsMapper/SLES_16.0
=== RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP6_unsupported
=== RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_base_unsupported
--- PASS: TestSLESDefaultDriverVersionsMapper (0.00s)
    --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP7 (0.00s)
    --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP7_with_dash_format (0.00s)
    --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_16.0 (0.00s)
    --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP6_unsupported (0.00s)
    --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_base_unsupported (0.00s)
=== RUN   TestValidateSLESDriverVersion
=== RUN   TestValidateSLESDriverVersion/non-SLES_OS_is_always_valid
=== RUN   TestValidateSLESDriverVersion/valid_version_for_SLES_15_SP7
=== RUN   TestValidateSLESDriverVersion/valid_version_for_SLES_16.0
=== RUN   TestValidateSLESDriverVersion/unsupported_version_on_SLES_16.0
=== RUN   TestValidateSLESDriverVersion/unsupported_version_on_SLES_15_SP7
--- PASS: TestValidateSLESDriverVersion (0.00s)
    --- PASS: TestValidateSLESDriverVersion/non-SLES_OS_is_always_valid (0.00s)
    --- PASS: TestValidateSLESDriverVersion/valid_version_for_SLES_15_SP7 (0.00s)
    --- PASS: TestValidateSLESDriverVersion/valid_version_for_SLES_16.0 (0.00s)
    --- PASS: TestValidateSLESDriverVersion/unsupported_version_on_SLES_16.0 (0.00s)
    --- PASS: TestValidateSLESDriverVersion/unsupported_version_on_SLES_15_SP7 (0.00s)
PASS
coverage: 53.2% of statements
ok  	github.com/ROCm/gpu-operator/internal	0.019s	coverage: 53.2% of statements
=== RUN   TestAPIs
Running Suite: Controller Suite - /home/amd/work-suse/may-29-2026/gpu-operator/internal/controllers
===================================================================================================
Random Seed: 1780062369

Will run 25 of 25 specs
•••••••••••••••••••••••••

Ran 25 of 25 Specs in 0.003 seconds
SUCCESS! -- 25 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestAPIs (0.00s)
PASS
coverage: 7.4% of statements
ok  	github.com/ROCm/gpu-operator/internal/controllers	0.023s	coverage: 7.4% of statements
=== RUN   TestAPIs
Running Suite: KMMModule Suite - /home/amd/work-suse/may-29-2026/gpu-operator/internal/kmmmodule
================================================================================================
Random Seed: 1780062369

Will run 9 of 9 specs
testing multiple valid homogeneous nodes
testing multiple valid heterogeneous nodes
testing multiple valid heterogeneous nodes + one unsupported node
testing multiple unsupported nodes
testing empty node list
•••••••<moduleName>
<amdgpu>
•<moduleName>
<amdgpu>
•

Ran 9 of 9 Specs in 0.004 seconds
SUCCESS! -- 9 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestAPIs (0.01s)
PASS
coverage: 46.4% of statements
ok  	github.com/ROCm/gpu-operator/internal/kmmmodule	0.020s	coverage: 46.4% of statements

Submission Checklist

@Priyankasaggu11929
Copy link
Copy Markdown
Contributor Author

cc: @yansun1996 for your review.

Please note, these changes are validated on a MI355X system with SLES 16.0 (provided by AMD) across all amdgpu driver versions included in the PR.

@yansun1996
Copy link
Copy Markdown
Member

Hi @Priyankasaggu11929 thanks for the PR, we're raising CI test against your change, will let you know if anything requires further change

@Priyankasaggu11929
Copy link
Copy Markdown
Contributor Author

we're raising CI test against your change, will let you know if anything requires further change

Ack, thank you @yansun1996!

Comment thread internal/utils_test.go Outdated
Comment thread internal/controllers/device_config_reconciler.go Outdated
Comment thread internal/controllers/device_config_reconciler.go
Comment thread internal/utils.go
Comment thread internal/kmmmodule/kmmmodule.go Outdated
@yansun1996
Copy link
Copy Markdown
Member

Hi @Priyankasaggu11929 , basic CI test passed, it would be better to handle the above comments before we merge this PR, thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants