I am posting here as libnvme issue, but I am not really sure if the problem is in libnvme, nvme-cli, the kernel or somewhere else.
I have 2 computers, pc1 and pc2, connected via ConnectX3 Infiniband adapters, with pc1 mounting 3 remote nvme SSDs from pc2: a samsung 990 Pro 2TB and 2 Corsair MP600 Pro 8TB.
I am having 2 problems:
- In the nvme discover
nvme discover -t rdma -a [IP] -s 4420, I get the error:
could not alloc memory for discovery log page
failed to get discovery log: Cannot allocate memory
- I can still connect, via
nvme connect -n [SUBSYS] -t rdma -a [IP] -s 4420, and then the Samsung 2TB SDD seem to work normally, but I cannot mount a file system from the Corsair SSDs (even though I see them normally in /proc/partitions`).
I tried the following things already:
- When I switch everything to TCP via IPoIB, and I just use
tcp type instead of rdma in all nvme commands, everything works.
- I have another pc pc3 with 2 Samsung NVME drives and the same ConnectX3 InfiniBand card. Mounting these 2 SSDs from pc1 via RDMA works without a problem, and I also do not get the error when doing
nvme discover.
- I tried with linux kernels 6.6, 6.7, 6.8, and nvme-cli and libnvme 1.7, 1.8, and the latest master version - all behave the same.
- For the
nvme discovery error: I debugged what happens in libnvme when I get the could not alloc memory for discovery log page, and it seems the numrec value here
|
numrec = le64_to_cpu(log->numrec); |
is bogus, thus libnvme tries to allocate a very large amount of memory and fails.
- For the mount failure: In the kernel log, when trying to mount the EFI FAT partition on the disk, I am getting
FAT-fs (nvme1n1p1): bogus number of reserved sectors and from the mount command: wrong fs type, bad option, bad superblock on ..., missing codepage or helper program, or other error. Mounting the very same file system when connecting via tcp instead of rdma works correctly. I wanted to check if I get corrupted data from this disk via rdma, so I dumped the whole filesystem to an external disk with dd, and compared what I read when connecting with rdma and with tcp, and it is fully identical. So the data I am getting from the disk is correct, mounting fails for a different reason. Sector size and disk size I get when opening the disk with gdisk are also identical.
So in summary, everything works when connecting via tcp, but when connecting via rdma the nvme discovery fails since the numrec entry is bogus, and with rdma I can mount filesystems from the 2TB samsung SSD correctly, but filesystems from the 8TB corsair SSDs fail to mount, despite the data I am reading from the disk with dd is fully correct.
I would be thankful for any advise, or recommendation where to ask. I you'd need me to conduct any tests, please let me know.
I am posting here as libnvme issue, but I am not really sure if the problem is in libnvme, nvme-cli, the kernel or somewhere else.
I have 2 computers, pc1 and pc2, connected via ConnectX3 Infiniband adapters, with pc1 mounting 3 remote nvme SSDs from pc2: a samsung 990 Pro 2TB and 2 Corsair MP600 Pro 8TB.
I am having 2 problems:
nvme discover -t rdma -a [IP] -s 4420, I get the error:nvme connect -n [SUBSYS] -t rdma -a [IP] -s 4420, and then the Samsung 2TB SDD seem to work normally, but I cannot mount a file system from the Corsair SSDs (even though I see them normally in/proc/partitions`).I tried the following things already:
tcptype instead ofrdmain all nvme commands, everything works.nvme discover.nvme discoveryerror: I debugged what happens in libnvme when I get thecould not alloc memory for discovery log page, and it seems thenumrecvalue herelibnvme/src/nvme/fabrics.c
Line 1089 in 691f809
FAT-fs (nvme1n1p1): bogus number of reserved sectorsand from the mount command:wrong fs type, bad option, bad superblock on ..., missing codepage or helper program, or other error. Mounting the very same file system when connecting viatcpinstead ofrdmaworks correctly. I wanted to check if I get corrupted data from this disk viardma, so I dumped the whole filesystem to an external disk withdd, and compared what I read when connecting withrdmaand withtcp, and it is fully identical. So the data I am getting from the disk is correct, mounting fails for a different reason. Sector size and disk size I get when opening the disk withgdiskare also identical.So in summary, everything works when connecting via
tcp, but when connecting viardmathenvme discoveryfails since thenumrecentry is bogus, and withrdmaI can mount filesystems from the 2TB samsung SSD correctly, but filesystems from the 8TB corsair SSDs fail to mount, despite the data I am reading from the disk withddis fully correct.I would be thankful for any advise, or recommendation where to ask. I you'd need me to conduct any tests, please let me know.