Skip to content

add hook for building GROMACS on NVIDIA Grace CPUs with hwloc support#225

Open
bedroge wants to merge 12 commits into
EESSI:mainfrom
bedroge:gromacs_202602_grace_fix
Open

add hook for building GROMACS on NVIDIA Grace CPUs with hwloc support#225
bedroge wants to merge 12 commits into
EESSI:mainfrom
bedroge:gromacs_202602_grace_fix

Conversation

@bedroge
Copy link
Copy Markdown
Contributor

@bedroge bedroge commented May 8, 2026

Trying the suggestion from EESSI/software-layer#1497 (comment).

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 8, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 8, 2026

Ah, need to wait for a dirty frag mitigation to be deployed.

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 8, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented May 8, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.05/pr_225/14735133

date job status comment
May 08 10:33:30 UTC 2026 submitted job id 14735133 awaits release by job manager
May 08 10:34:19 UTC 2026 released job awaits launch by Slurm scheduler
May 08 10:35:23 UTC 2026 running job 14735133 is running
May 08 11:17:50 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14735133.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17782386700.tar.gzsize: 0 MiB (27304 bytes)
entries: 1
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
no module files in tarball
software under 2025.06/software/linux/aarch64/nvidia/grace/software
no software packages in tarball
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
no reprod directories in tarball
other under 2025.06/software/linux/aarch64/nvidia/grace
2025.06/init/easybuild/eb_hooks.py
May 08 11:17:50 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 17/29 test case(s) from 29 check(s) (4 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-14735133.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 8, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented May 8, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.05/pr_225/14736649

date job status comment
May 08 18:22:28 UTC 2026 submitted job id 14736649 awaits release by job manager
May 08 18:22:54 UTC 2026 released job awaits launch by Slurm scheduler
May 08 18:23:57 UTC 2026 running job 14736649 is running
May 08 19:07:24 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14736649.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17782668120.tar.gzsize: 0 MiB (27318 bytes)
entries: 1
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
no module files in tarball
software under 2025.06/software/linux/aarch64/nvidia/grace/software
no software packages in tarball
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
no reprod directories in tarball
other under 2025.06/software/linux/aarch64/nvidia/grace
2025.06/init/easybuild/eb_hooks.py
May 08 19:07:24 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 17/29 test case(s) from 29 check(s) (4 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-14736649.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 8, 2026

Hmm, it now runs it with export HWLOC_KEEP_NVIDIA_GPU_NUMA_NODES=0 && make check -j 16, but I'm still getting this:

[ RUN      ] HardwareTopologyTest.NumaCacheSelfconsistency
/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:190: Failure
Expected equality of these values:
  processorsinNumaNudes
    Which is: 576
  hwTop.machine().logicalProcessors.size()
    Which is: 72

/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:211: Failure
Expected: (n.memory) > (0), actual: 0 vs 0

/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:211: Failure
Expected: (n.memory) > (0), actual: 0 vs 0

/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:211: Failure
Expected: (n.memory) > (0), actual: 0 vs 0

/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:211: Failure
Expected: (n.memory) > (0), actual: 0 vs 0

/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:211: Failure
Expected: (n.memory) > (0), actual: 0 vs 0

/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:211: Failure
Expected: (n.memory) > (0), actual: 0 vs 0

/tmp/eessibot/easybuild/build/GROMACS/2026.2/foss-2025b/gromacs-2026.2/src/gromacs/hardware/tests/hardwaretopology.cpp:211: Failure
Expected: (n.memory) > (0), actual: 0 vs 0

[  FAILED  ] HardwareTopologyTest.NumaCacheSelfconsistency (14 ms)
[----------] 4 tests from HardwareTopologyTest (57 ms total)

cc @al42and

@al42and
Copy link
Copy Markdown

al42and commented May 8, 2026

Huh, fun. The GPU NUMA node seems to have gone away (processorsinNumaNudes went down from 9*72 to 8*72; need to make the name less lewd), but there are still seven mystery NUMA nodes left.

Any chance you can get hwloc XML from the machine? hwloc-ls aarch64-grace.xml

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 10, 2026

Huh, fun. The GPU NUMA node seems to have gone away (processorsinNumaNudes went down from 972 to 872; need to make the name less lewd), but there are still seven mystery NUMA nodes left.

Any chance you can get hwloc XML from the machine? hwloc-ls aarch64-grace.xml

Sure, here it is.
aarch64-grace.xml

I have to confirm it, but it looked like it worked with make check -j 1 instead of make check -j 16 (EasyBuild automatically uses the number of requested/available cores for the build job), could it in any way be related to this?

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 10, 2026

Let me try it here with max-parallel: 1.

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented May 10, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.05/pr_225/14744590

date job status comment
May 10 17:35:31 UTC 2026 submitted job id 14744590 awaits release by job manager
May 10 17:35:58 UTC 2026 released job awaits launch by Slurm scheduler
May 10 17:37:02 UTC 2026 running job 14744590 is running
May 10 17:43:13 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14744590.out
✅ no message matching FATAL:
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17784346690.tar.gzsize: 0 MiB (27318 bytes)
entries: 1
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
no module files in tarball
software under 2025.06/software/linux/aarch64/nvidia/grace/software
no software packages in tarball
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
no reprod directories in tarball
other under 2025.06/software/linux/aarch64/nvidia/grace
2025.06/init/easybuild/eb_hooks.py
May 10 17:43:13 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 17/29 test case(s) from 29 check(s) (4 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-14744590.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 10, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented May 10, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.05/pr_225/14744699

date job status comment
May 10 18:41:00 UTC 2026 submitted job id 14744699 awaits release by job manager
May 10 18:41:26 UTC 2026 released job awaits launch by Slurm scheduler
May 10 18:42:30 UTC 2026 running job 14744699 is running
May 10 20:00:58 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14744699.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17784428850.tar.gzsize: 0 MiB (27315 bytes)
entries: 1
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
no module files in tarball
software under 2025.06/software/linux/aarch64/nvidia/grace/software
no software packages in tarball
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
no reprod directories in tarball
other under 2025.06/software/linux/aarch64/nvidia/grace
2025.06/init/easybuild/eb_hooks.py
May 10 20:00:58 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 17/29 test case(s) from 29 check(s) (4 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-14744699.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 11, 2026

Hmm, same error now. The other main difference between my manual attempt to build it and the builds done by the bot is that the latter uses a container. I can give it a try in the same container.

@al42and
Copy link
Copy Markdown

al42and commented May 11, 2026

The other main difference between my manual attempt to build it and the builds done by the bot is that the latter uses a container. I can give it a try in the same container.

Have you generated the XML in the container or not? The XML does not have any of the "phantom" numa nodes GROMACS is complaining about. So if it was done bare-metal, than everything points to a side-effect of containerization.

@al42and
Copy link
Copy Markdown

al42and commented May 11, 2026

@bedroge. Can you simply add a call to hwloc-ls before make check? While Hwloc's XML has more details and can be used to add new unit tests, the stdout of hwloc-ls should be sufficient to figure out where the mystery numa nodes are coming from and how to filter them out; all without the hassle of getting the files out of container.

Comment thread eb_hooks.py
bedroge and others added 2 commits May 11, 2026 13:00
@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 11, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented May 11, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.05/pr_225/14747306

date job status comment
May 11 11:01:27 UTC 2026 submitted job id 14747306 awaits release by job manager
May 11 11:02:01 UTC 2026 released job awaits launch by Slurm scheduler
May 11 11:03:05 UTC 2026 running job 14747306 is running
May 11 11:48:40 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14747306.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17784997500.tar.gzsize: 0 MiB (27323 bytes)
entries: 1
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
no module files in tarball
software under 2025.06/software/linux/aarch64/nvidia/grace/software
no software packages in tarball
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
no reprod directories in tarball
other under 2025.06/software/linux/aarch64/nvidia/grace
2025.06/init/easybuild/eb_hooks.py
May 11 11:48:40 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 17/29 test case(s) from 29 check(s) (4 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-14747306.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 11, 2026

The other main difference between my manual attempt to build it and the builds done by the bot is that the latter uses a container. I can give it a try in the same container.

Have you generated the XML in the container or not? The XML does not have any of the "phantom" numa nodes GROMACS is complaining about. So if it was done bare-metal, than everything points to a side-effect of containerization.

That one was indeed generated without the container. The bot always uses the container for doing the software builds, so I'm now trying your approach.

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 11, 2026

@al42and Here it is:

Machine (478GB total)
  Package L#0
    NUMANode L#0 (P#0 478GB)
    NUMANode L#1 (P#2)
    NUMANode L#2 (P#3)
    NUMANode L#3 (P#4)
    NUMANode L#4 (P#5)
    NUMANode L#5 (P#6)
    NUMANode L#6 (P#7)
    NUMANode L#7 (P#8)
    L3 L#0 (114MB)
      L2 L#0 (1024KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (1024KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1)
      L2 L#2 (1024KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2)
      L2 L#3 (1024KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3)
      L2 L#4 (1024KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#4)
      L2 L#5 (1024KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#5)
      L2 L#6 (1024KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 (P#6)
      L2 L#7 (1024KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 (P#7)
      L2 L#8 (1024KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8 (P#8)
      L2 L#9 (1024KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9 (P#9)
      L2 L#10 (1024KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU L#10 (P#10)
      L2 L#11 (1024KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU L#11 (P#11)
      L2 L#12 (1024KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU L#12 (P#12)
      L2 L#13 (1024KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU L#13 (P#13)
      L2 L#14 (1024KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU L#14 (P#14)
      L2 L#15 (1024KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU L#15 (P#15)
      L2 L#16 (1024KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU L#16 (P#16)
      L2 L#17 (1024KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU L#17 (P#17)
      L2 L#18 (1024KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU L#18 (P#18)
      L2 L#19 (1024KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU L#19 (P#19)
      L2 L#20 (1024KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU L#20 (P#20)
      L2 L#21 (1024KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU L#21 (P#21)
      L2 L#22 (1024KB) + L1d L#22 (64KB) + L1i L#22 (64KB) + Core L#22 + PU L#22 (P#22)
      L2 L#23 (1024KB) + L1d L#23 (64KB) + L1i L#23 (64KB) + Core L#23 + PU L#23 (P#23)
      L2 L#24 (1024KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + PU L#24 (P#24)
      L2 L#25 (1024KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + PU L#25 (P#25)
      L2 L#26 (1024KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + PU L#26 (P#26)
      L2 L#27 (1024KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + PU L#27 (P#27)
      L2 L#28 (1024KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core L#28 + PU L#28 (P#28)
      L2 L#29 (1024KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core L#29 + PU L#29 (P#29)
      L2 L#30 (1024KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core L#30 + PU L#30 (P#30)
      L2 L#31 (1024KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core L#31 + PU L#31 (P#31)
      L2 L#32 (1024KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + PU L#32 (P#32)
      L2 L#33 (1024KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + PU L#33 (P#33)
      L2 L#34 (1024KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + PU L#34 (P#34)
      L2 L#35 (1024KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + PU L#35 (P#35)
      L2 L#36 (1024KB) + L1d L#36 (64KB) + L1i L#36 (64KB) + Core L#36 + PU L#36 (P#36)
      L2 L#37 (1024KB) + L1d L#37 (64KB) + L1i L#37 (64KB) + Core L#37 + PU L#37 (P#37)
      L2 L#38 (1024KB) + L1d L#38 (64KB) + L1i L#38 (64KB) + Core L#38 + PU L#38 (P#38)
      L2 L#39 (1024KB) + L1d L#39 (64KB) + L1i L#39 (64KB) + Core L#39 + PU L#39 (P#39)
      L2 L#40 (1024KB) + L1d L#40 (64KB) + L1i L#40 (64KB) + Core L#40 + PU L#40 (P#40)
      L2 L#41 (1024KB) + L1d L#41 (64KB) + L1i L#41 (64KB) + Core L#41 + PU L#41 (P#41)
      L2 L#42 (1024KB) + L1d L#42 (64KB) + L1i L#42 (64KB) + Core L#42 + PU L#42 (P#42)
      L2 L#43 (1024KB) + L1d L#43 (64KB) + L1i L#43 (64KB) + Core L#43 + PU L#43 (P#43)
      L2 L#44 (1024KB) + L1d L#44 (64KB) + L1i L#44 (64KB) + Core L#44 + PU L#44 (P#44)
      L2 L#45 (1024KB) + L1d L#45 (64KB) + L1i L#45 (64KB) + Core L#45 + PU L#45 (P#45)
      L2 L#46 (1024KB) + L1d L#46 (64KB) + L1i L#46 (64KB) + Core L#46 + PU L#46 (P#46)
      L2 L#47 (1024KB) + L1d L#47 (64KB) + L1i L#47 (64KB) + Core L#47 + PU L#47 (P#47)
      L2 L#48 (1024KB) + L1d L#48 (64KB) + L1i L#48 (64KB) + Core L#48 + PU L#48 (P#48)
      L2 L#49 (1024KB) + L1d L#49 (64KB) + L1i L#49 (64KB) + Core L#49 + PU L#49 (P#49)
      L2 L#50 (1024KB) + L1d L#50 (64KB) + L1i L#50 (64KB) + Core L#50 + PU L#50 (P#50)
      L2 L#51 (1024KB) + L1d L#51 (64KB) + L1i L#51 (64KB) + Core L#51 + PU L#51 (P#51)
      L2 L#52 (1024KB) + L1d L#52 (64KB) + L1i L#52 (64KB) + Core L#52 + PU L#52 (P#52)
      L2 L#53 (1024KB) + L1d L#53 (64KB) + L1i L#53 (64KB) + Core L#53 + PU L#53 (P#53)
      L2 L#54 (1024KB) + L1d L#54 (64KB) + L1i L#54 (64KB) + Core L#54 + PU L#54 (P#54)
      L2 L#55 (1024KB) + L1d L#55 (64KB) + L1i L#55 (64KB) + Core L#55 + PU L#55 (P#55)
      L2 L#56 (1024KB) + L1d L#56 (64KB) + L1i L#56 (64KB) + Core L#56 + PU L#56 (P#56)
      L2 L#57 (1024KB) + L1d L#57 (64KB) + L1i L#57 (64KB) + Core L#57 + PU L#57 (P#57)
      L2 L#58 (1024KB) + L1d L#58 (64KB) + L1i L#58 (64KB) + Core L#58 + PU L#58 (P#58)
      L2 L#59 (1024KB) + L1d L#59 (64KB) + L1i L#59 (64KB) + Core L#59 + PU L#59 (P#59)
      L2 L#60 (1024KB) + L1d L#60 (64KB) + L1i L#60 (64KB) + Core L#60 + PU L#60 (P#60)
      L2 L#61 (1024KB) + L1d L#61 (64KB) + L1i L#61 (64KB) + Core L#61 + PU L#61 (P#61)
      L2 L#62 (1024KB) + L1d L#62 (64KB) + L1i L#62 (64KB) + Core L#62 + PU L#62 (P#62)
      L2 L#63 (1024KB) + L1d L#63 (64KB) + L1i L#63 (64KB) + Core L#63 + PU L#63 (P#63)
      L2 L#64 (1024KB) + L1d L#64 (64KB) + L1i L#64 (64KB) + Core L#64 + PU L#64 (P#64)
      L2 L#65 (1024KB) + L1d L#65 (64KB) + L1i L#65 (64KB) + Core L#65 + PU L#65 (P#65)
      L2 L#66 (1024KB) + L1d L#66 (64KB) + L1i L#66 (64KB) + Core L#66 + PU L#66 (P#66)
      L2 L#67 (1024KB) + L1d L#67 (64KB) + L1i L#67 (64KB) + Core L#67 + PU L#67 (P#67)
      L2 L#68 (1024KB) + L1d L#68 (64KB) + L1i L#68 (64KB) + Core L#68 + PU L#68 (P#68)
      L2 L#69 (1024KB) + L1d L#69 (64KB) + L1i L#69 (64KB) + Core L#69 + PU L#69 (P#69)
      L2 L#70 (1024KB) + L1d L#70 (64KB) + L1i L#70 (64KB) + Core L#70 + PU L#70 (P#70)
      L2 L#71 (1024KB) + L1d L#71 (64KB) + L1i L#71 (64KB) + Core L#71 + PU L#71 (P#71)
  HostBridge
    PCIBridge
      PCI 0000:01:00.0 (InfiniBand)
        Net "ibp1s0f0"
        OpenFabrics "mlx5_0"
      PCI 0000:01:00.1 (InfiniBand)
        Net "ibp1s0f1"
        OpenFabrics "mlx5_1"
  HostBridge
    PCIBridge
      PCI 0002:01:00.0 (Ethernet)
        Net "enP2p1s0f0np0"
      PCI 0002:01:00.1 (Ethernet)
        Net "enP2p1s0f1np1"
  HostBridge
    PCIBridge
      PCI 0004:01:00.0 (NVMExp)
        Block(Disk) "nvme0n1"
  HostBridge
    PCIBridge
      PCI 0006:01:00.0 (SAS)
        Block "sda"
  HostBridge
    PCIBridge
      PCIBridge
        PCIBridge
          PCIBridge
            PCI 0008:04:00.0 (VGA)
  HostBridge
    PCIBridge
      PCI 0009:01:00.0 (3D)

@al42and
Copy link
Copy Markdown

al42and commented May 11, 2026

I have a GH200 node, and there is a similar pattern if I look at /sys/devices/system/node/: after each GPU NUMA node there are 7 more NUMA nodes with no memory (possibly MIG-related? but we have MIG disabled).

However, bare-metal hwloc correctly filters them out, unless HWLOC_ALLOW=all is set, in which case I see a similar thing (there are 4 CPUs and 4 GPUs here, so NUMA P# numbers are different).

$ hwloc-ls
Machine (856GB total)
  Package L#0
    NUMANode L#0 (P#0 119GB)
    NUMANode(GPUMemory) L#1 (P#4 95GB)
    L3 L#0 (114MB)
...
$ HWLOC_ALLOW=all hwloc-ls
Machine (856GB total)
  Package L#0
    NUMANode L#0 (P#0 119GB)
    NUMANode(GPUMemory) L#1 (P#4 95GB)
    NUMANode L#2 (P#5)
    NUMANode L#3 (P#6)
    NUMANode L#4 (P#7)
    NUMANode L#5 (P#8)
    NUMANode L#6 (P#9)
    NUMANode L#7 (P#10)
    NUMANode L#8 (P#11)
    L3 L#0 (114MB)
...

There does not seem to be a clean way to filter them out except by the lack of memory. P#0 is MemoryTier 0, P#4 is MemoryTier 2 and subtype="GPUMemory" and a valid PCIBusID, and P#5 to P#11 have MemoryTier 1 (placeholder?).

I guess this means that either HWLOC_ALLOW is set somewhere, or the containerization messes up whatever mechanism hwloc is using to hide these nodes.

Skipping this test seems like the right solution now; in the next GROMACS version, we can filter invalid NUMA nodes ourselves.

To me, the hwloc behavior seems wrong here, but I don't know to what extent you can/want debug the containerization. We don't have any containers on our nodes, so I don't think I could dig any deeper. But at least I can reproduce the thing on my end, so I won't pester you about testing the workaround :)

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 18, 2026

I tried it on the host itself (i.e. without the container), and then I do get the same output as you. Inside the container (same hwloc version) I get the additional NUMA nodes, even though that environment variable is not set. So, looks like you're right and the containerization somehow messes things up...

I think I'll try disabling that test for now. Thanks again for your help!

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 18, 2026

Looks like bind mounting /sys into the container also solves the issue.

edit: I figured it may be cgroup-related (since it's running as a Slurm job), and indeed, bind mounting /sys/fs/cgroup specifically also solves the issue.

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 18, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented May 18, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.05/pr_225/14769812

date job status comment
May 18 15:16:47 UTC 2026 submitted job id 14769812 awaits release by job manager
May 18 15:17:51 UTC 2026 released job awaits launch by Slurm scheduler
May 18 15:18:55 UTC 2026 running job 14769812 is running
May 18 17:03:38 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-14769812.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17791230840.tar.gzsize: 46 MiB (49228440 bytes)
entries: 831
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
GROMACS/2026.2-foss-2025b.lua
software under 2025.06/software/linux/aarch64/nvidia/grace/software
GROMACS/2026.2-foss-2025b
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
GROMACS/2026.2-foss-2025b/20260518_165002UTC
other under 2025.06/software/linux/aarch64/nvidia/grace
2025.06/init/easybuild/eb_hooks.py
May 18 17:03:38 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 19/31 test case(s) from 31 check(s) (4 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-14769812.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

Comment thread eb_hooks.py Outdated
Comment thread eb_hooks.py Outdated
@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 18, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented May 18, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.05/pr_225/14770447

date job status comment
May 18 20:44:44 UTC 2026 submitted job id 14770447 awaits release by job manager
May 18 20:44:49 UTC 2026 released job awaits launch by Slurm scheduler
May 18 20:45:53 UTC 2026 running job 14770447 is running
May 18 22:27:19 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-14770447.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17791424780.tar.gzsize: 46 MiB (49212509 bytes)
entries: 831
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
GROMACS/2026.2-foss-2025b.lua
software under 2025.06/software/linux/aarch64/nvidia/grace/software
GROMACS/2026.2-foss-2025b
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
GROMACS/2026.2-foss-2025b/20260518_221323UTC
other under 2025.06/software/linux/aarch64/nvidia/grace
2025.06/init/easybuild/eb_hooks.py
May 18 22:27:19 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 19/31 test case(s) from 31 check(s) (4 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-14770447.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Contributor Author

bedroge commented May 19, 2026

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented May 19, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2023.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.05/pr_225/14771128

date job status comment
May 19 05:35:04 UTC 2026 submitted job id 14771128 awaits release by job manager
May 19 05:35:28 UTC 2026 released job awaits launch by Slurm scheduler
May 19 05:36:36 UTC 2026 running job 14771128 is running
May 19 06:53:09 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-14771128.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2023.06-software-linux-aarch64-nvidia-grace-17791690460.tar.gzsize: 0 MiB (27302 bytes)
entries: 1
modules under 2023.06/software/linux/aarch64/nvidia/grace/modules/all
no module files in tarball
software under 2023.06/software/linux/aarch64/nvidia/grace/software
no software packages in tarball
reprod directories under 2023.06/software/linux/aarch64/nvidia/grace/reprod
no reprod directories in tarball
other under 2023.06/software/linux/aarch64/nvidia/grace
2023.06/init/easybuild/eb_hooks.py
May 19 06:53:10 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] ( 1/28) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] ( 2/28) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] ( 3/28) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] ( 4/28) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] ( 5/28) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] ( 6/28) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] ( 7/28) Skipping GPU test : only 1 GPU available for this test case
[ SKIP ] ( 8/28) Skipping GPU test : only 1 GPU available for this test case
[ OK ] ( 9/28) EESSI_LAMMPS_lj %device_type=gpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1 %scale=1_node /1f2ca7c1 @BotBuildTests:aarch64-nvidia-grace+default
P: perf: 7047.831 timesteps/s (r:0, l:None, u:None)
[ OK ] (10/28) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:aarch64-nvidia-grace+default
P: perf: 1531.066 timesteps/s (r:0, l:None, u:None)
[ OK ] (11/28) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:aarch64-nvidia-grace+default
P: perf: 1510.81 timesteps/s (r:0, l:None, u:None)
[ OK ] (12/28) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1 %scale=1_node /b22a48f5 @BotBuildTests:aarch64-nvidia-grace+default
P: perf: 1495.473 timesteps/s (r:0, l:None, u:None)
[ OK ] (13/28) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /775175bf @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 4.83 us (r:0, l:None, u:None)
[ OK ] (14/28) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /52707c40 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 4.86 us (r:0, l:None, u:None)
[ OK ] (15/28) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0 %scale=1_node %device_type=cpu /95ac9526 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 5.75 us (r:0, l:None, u:None)
[ OK ] (16/28) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023a-CUDA-12.1.1 %scale=1_node %device_type=cpu /1cff5d41 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 5.78 us (r:0, l:None, u:None)
[ OK ] (17/28) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /b1aacda9 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 7.1 us (r:0, l:None, u:None)
[ OK ] (18/28) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /c6bad193 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 27.38 us (r:0, l:None, u:None)
[ OK ] (19/28) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0 %scale=1_node %device_type=cpu /0edb8a95 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 10.78 us (r:0, l:None, u:None)
[ OK ] (20/28) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023a-CUDA-12.1.1 %scale=1_node %device_type=cpu /b2ab2213 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 10.83 us (r:0, l:None, u:None)
[ OK ] (21/28) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 0.48 us (r:0, l:None, u:None)
[ OK ] (22/28) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 0.45 us (r:0, l:None, u:None)
[ OK ] (23/28) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0 %scale=1_node /8ec94746 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 0.51 us (r:0, l:None, u:None)
[ OK ] (24/28) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023a-CUDA-12.1.1 %scale=1_node /1a3a497b @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 0.55 us (r:0, l:None, u:None)
[ OK ] (25/28) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:aarch64-nvidia-grace+default
P: bandwidth: 40751.22 MB/s (r:0, l:None, u:None)
[ OK ] (26/28) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:aarch64-nvidia-grace+default
P: bandwidth: 37545.41 MB/s (r:0, l:None, u:None)
[ OK ] (27/28) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0 %scale=1_node /c9ca6dc1 @BotBuildTests:aarch64-nvidia-grace+default
P: bandwidth: 40674.61 MB/s (r:0, l:None, u:None)
[ OK ] (28/28) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023a-CUDA-12.1.1 %scale=1_node /17289b2f @BotBuildTests:aarch64-nvidia-grace+default
P: bandwidth: 41126.25 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 20/28 test case(s) from 28 check(s) (0 failure(s), 8 skipped, 0 aborted)
Details
✅ job output file slurm-14771128.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc Bot commented May 19, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.05/pr_225/14771129

date job status comment
May 19 05:35:09 UTC 2026 submitted job id 14771129 awaits release by job manager
May 19 05:35:24 UTC 2026 released job awaits launch by Slurm scheduler
May 19 05:36:40 UTC 2026 running job 14771129 is running
May 19 06:53:13 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-14771129.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17791690180.tar.gzsize: 0 MiB (27304 bytes)
entries: 1
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
no module files in tarball
software under 2025.06/software/linux/aarch64/nvidia/grace/software
no software packages in tarball
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
no reprod directories in tarball
other under 2025.06/software/linux/aarch64/nvidia/grace
2025.06/init/easybuild/eb_hooks.py
May 19 06:53:13 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite produced failures.
ReFrame Summary
[ FAILED ] Ran 17/29 test case(s) from 29 check(s) (4 failure(s), 12 skipped, 0 aborted)
Details
✅ job output file slurm-14771129.out
❌ found message matching ERROR:
❌ found message matching [\s*FAILED\s*].*Ran .* test case

@EESSI EESSI deleted a comment from eessi-bot-jsc Bot May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants