Skip to content

"could not alloc memory for discovery log page" and mount failure with Corsair 8TB drive over RDMA nvmet #807

@davidrohr

Description

@davidrohr

I am posting here as libnvme issue, but I am not really sure if the problem is in libnvme, nvme-cli, the kernel or somewhere else.

I have 2 computers, pc1 and pc2, connected via ConnectX3 Infiniband adapters, with pc1 mounting 3 remote nvme SSDs from pc2: a samsung 990 Pro 2TB and 2 Corsair MP600 Pro 8TB.

I am having 2 problems:

  • In the nvme discover nvme discover -t rdma -a [IP] -s 4420, I get the error:
could not alloc memory for discovery log page
failed to get discovery log: Cannot allocate memory
  • I can still connect, via nvme connect -n [SUBSYS] -t rdma -a [IP] -s 4420, and then the Samsung 2TB SDD seem to work normally, but I cannot mount a file system from the Corsair SSDs (even though I see them normally in /proc/partitions`).

I tried the following things already:

  • When I switch everything to TCP via IPoIB, and I just use tcp type instead of rdma in all nvme commands, everything works.
  • I have another pc pc3 with 2 Samsung NVME drives and the same ConnectX3 InfiniBand card. Mounting these 2 SSDs from pc1 via RDMA works without a problem, and I also do not get the error when doing nvme discover.
  • I tried with linux kernels 6.6, 6.7, 6.8, and nvme-cli and libnvme 1.7, 1.8, and the latest master version - all behave the same.
  • For the nvme discovery error: I debugged what happens in libnvme when I get the could not alloc memory for discovery log page, and it seems the numrec value here
    numrec = le64_to_cpu(log->numrec);
    is bogus, thus libnvme tries to allocate a very large amount of memory and fails.
  • For the mount failure: In the kernel log, when trying to mount the EFI FAT partition on the disk, I am getting FAT-fs (nvme1n1p1): bogus number of reserved sectors and from the mount command: wrong fs type, bad option, bad superblock on ..., missing codepage or helper program, or other error. Mounting the very same file system when connecting via tcp instead of rdma works correctly. I wanted to check if I get corrupted data from this disk via rdma, so I dumped the whole filesystem to an external disk with dd, and compared what I read when connecting with rdma and with tcp, and it is fully identical. So the data I am getting from the disk is correct, mounting fails for a different reason. Sector size and disk size I get when opening the disk with gdisk are also identical.

So in summary, everything works when connecting via tcp, but when connecting via rdma the nvme discovery fails since the numrec entry is bogus, and with rdma I can mount filesystems from the 2TB samsung SSD correctly, but filesystems from the 8TB corsair SSDs fail to mount, despite the data I am reading from the disk with dd is fully correct.

I would be thankful for any advise, or recommendation where to ask. I you'd need me to conduct any tests, please let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions