Skip to content

nvme discover over RDMA parses Discovery Log Entry incorrectly on nvme-cli 2.13 / libnvme 1.13, while direct nvme connect works #1099

@iops-hunter

Description

@iops-hunter

Environment

Working node:

  • Proxmox VE 8.4
  • Debian 12 (bookworm)
  • nvme-cli 2.4+really2.3-3
  • libnvme 1.3-1+deb12u1

Broken node:

  • Proxmox VE 9.1
  • Debian 13 (trixie)
  • nvme-cli 2.13-2
  • libnvme 1.13-2

Target:

  • Huawei NVMe-oF target
  • transport: RDMA / RoCE v2

NIC:

  • Mellanox ConnectX-6 Lx

Summary

On the old node (nvme-cli 2.3 / libnvme 1.3), nvme discover returns a correct Discovery Log Entry.

On the new node (nvme-cli 2.13 / libnvme 1.13), the discovery controller connection succeeds, Get Log Page commands succeed, and direct nvme connect to the same subsystem works correctly, but nvme discover prints a bogus Discovery Log Entry:

trtype:  fc
adrfam:
subtype: unrecognized
treq:    not specified
portid:  0
trsvcid:
subnqn:
traddr:
eflags:  none

This looks like the discovery log is successfully retrieved but decoded incorrectly in userspace.

Expected behavior

nvme discover -t rdma -a <target> -s 4420 should return something like:

Discovery Log Number of Records 1, Generation counter 2
=====Discovery Log Entry 0======
trtype:  rdma
adrfam:  ipv4
subtype: nvme subsystem
treq:    not specified
portid:  0
trsvcid: 4420
subnqn:  nqn.2020-02.huawei.nvme:nvm-subsystem-sn-[XXX]
traddr:  192.168.10.10
eflags:  none
rdma_prtype: roce-v2
rdma_qptype: connected
rdma_cms:    rdma-cm
rdma_pkey: 0x0000

Actual behavior

Discovery Log Number of Records 1, Generation counter 2
=====Discovery Log Entry 0======
trtype:  fc
adrfam:
subtype: unrecognized
treq:    not specified
portid:  0
trsvcid:
subnqn:
traddr:
eflags:  none

Important observation

Direct connect to the same subsystem works correctly on the same host:

nvme connect -t rdma -n nqn.2020-02.huawei.nvme:nvm-subsystem-sn-[XXX] -a 192.168.10.10 -s 4420

and then nvme list-subsys shows:

nvme-subsys1 - NQN=nqn.2020-02.huawei.nvme:nvm-subsystem-sn-[XXX]
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:[YYY]
\
 +- nvme1 rdma traddr=192.168.10.10,trsvcid=4420 live

So the RDMA transport and controller connection are functional.

Verbose discover output

nvme discover -t rdma -a 192.168.10.10 -s 4420 -vv

connect ctrl, 'nqn=nqn.2014-08.org.nvmexpress.discovery,transport=rdma,traddr=192.168.10.10,trsvcid=4420,
hostnqn=nqn.2014-08.org.nvmexpress:uuid:[YYY],hostid=[YYY],ctrl_loss_tmo=600'
connect ctrl, response 'instance=1,cntlid=0'
nvme1: nqn.2014-08.org.nvmexpress.discovery connected

opcode       : 06
...
data_len     : 00001000
cdw10        : 00000001
...
err          : 0

nvme1: get header (try 0/10)

opcode       : 02
...
data_len     : 00000014
cdw10        : 00040070
...
err          : 0

nvme1: get 1 records (genctr 2)

opcode       : 02
...
data_len     : 00000400
cdw10        : 00ff0070
cdw12        : 00000400
...
err          : 0

nvme1: get header again

opcode       : 02
...
data_len     : 00000014
cdw10        : 00040070
...
err          : 0

Discovery Log Number of Records 1, Generation counter 2
=====Discovery Log Entry 0======
trtype:  fc
adrfam:
subtype: unrecognized
treq:    not specified
portid:  0
trsvcid:
subnqn:
traddr:
eflags:  none

nvme1: nqn.2014-08.org.nvmexpress.discovery disconnected

What was already ruled out

  • RDMA stack mismatch was fixed
  • nvme_rdma, nvme_fabrics, rdma_cm, ib_core are all now loaded from the same DKMS stack
  • direct nvme connect works
  • host NQN and host ID are consistent
  • the same target works fine on the older node with nvme-cli 2.3 / libnvme 1.3

Version info

Broken node:

nvme version
nvme version 2.13 (git 2.13)
libnvme versio n 1.13 (git 1.13)

dpkg -l | egrep 'nvme-cli|libnvme'
ii  libnvme1t64  1.13-2  amd64  NVMe management library (library)
ii  nvme-cli     2.13-2  amd64  NVMe management tool

Working node:

nvme version
nvme version 2.3 (git 2.3)
libnvme version 1.3 (git 1.3)

dpkg -l | egrep 'nvme-cli|libnvme'
ii  libnvme1   1.3-1+deb12u1     amd64
ii  nvme-cli   2.4+really2.3-3   amd64

Related issues (what I've found so far)

nvme-cli #1555
nvme-cli #2555
libnvme #807

Question

Given the related issues, could this be a regression in discovery log parsing/decoding in newer nvme-cli / libnvme, possibly triggered by a Huawei discovery log format or vendor-specific quirk?
Or something else? What do you think?

Respectfully tagging some of the related people: @igaw @davidrohr @hreinecke

Thanks all for contribution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions