Environment
Working node:
- Proxmox VE 8.4
- Debian 12 (bookworm)
nvme-cli 2.4+really2.3-3
libnvme 1.3-1+deb12u1
Broken node:
- Proxmox VE 9.1
- Debian 13 (trixie)
nvme-cli 2.13-2
libnvme 1.13-2
Target:
- Huawei NVMe-oF target
- transport: RDMA / RoCE v2
NIC:
Summary
On the old node (nvme-cli 2.3 / libnvme 1.3), nvme discover returns a correct Discovery Log Entry.
On the new node (nvme-cli 2.13 / libnvme 1.13), the discovery controller connection succeeds, Get Log Page commands succeed, and direct nvme connect to the same subsystem works correctly, but nvme discover prints a bogus Discovery Log Entry:
trtype: fc
adrfam:
subtype: unrecognized
treq: not specified
portid: 0
trsvcid:
subnqn:
traddr:
eflags: none
This looks like the discovery log is successfully retrieved but decoded incorrectly in userspace.
Expected behavior
nvme discover -t rdma -a <target> -s 4420 should return something like:
Discovery Log Number of Records 1, Generation counter 2
=====Discovery Log Entry 0======
trtype: rdma
adrfam: ipv4
subtype: nvme subsystem
treq: not specified
portid: 0
trsvcid: 4420
subnqn: nqn.2020-02.huawei.nvme:nvm-subsystem-sn-[XXX]
traddr: 192.168.10.10
eflags: none
rdma_prtype: roce-v2
rdma_qptype: connected
rdma_cms: rdma-cm
rdma_pkey: 0x0000
Actual behavior
Discovery Log Number of Records 1, Generation counter 2
=====Discovery Log Entry 0======
trtype: fc
adrfam:
subtype: unrecognized
treq: not specified
portid: 0
trsvcid:
subnqn:
traddr:
eflags: none
Important observation
Direct connect to the same subsystem works correctly on the same host:
nvme connect -t rdma -n nqn.2020-02.huawei.nvme:nvm-subsystem-sn-[XXX] -a 192.168.10.10 -s 4420
and then nvme list-subsys shows:
nvme-subsys1 - NQN=nqn.2020-02.huawei.nvme:nvm-subsystem-sn-[XXX]
hostnqn=nqn.2014-08.org.nvmexpress:uuid:[YYY]
\
+- nvme1 rdma traddr=192.168.10.10,trsvcid=4420 live
So the RDMA transport and controller connection are functional.
Verbose discover output
nvme discover -t rdma -a 192.168.10.10 -s 4420 -vv
connect ctrl, 'nqn=nqn.2014-08.org.nvmexpress.discovery,transport=rdma,traddr=192.168.10.10,trsvcid=4420,
hostnqn=nqn.2014-08.org.nvmexpress:uuid:[YYY],hostid=[YYY],ctrl_loss_tmo=600'
connect ctrl, response 'instance=1,cntlid=0'
nvme1: nqn.2014-08.org.nvmexpress.discovery connected
opcode : 06
...
data_len : 00001000
cdw10 : 00000001
...
err : 0
nvme1: get header (try 0/10)
opcode : 02
...
data_len : 00000014
cdw10 : 00040070
...
err : 0
nvme1: get 1 records (genctr 2)
opcode : 02
...
data_len : 00000400
cdw10 : 00ff0070
cdw12 : 00000400
...
err : 0
nvme1: get header again
opcode : 02
...
data_len : 00000014
cdw10 : 00040070
...
err : 0
Discovery Log Number of Records 1, Generation counter 2
=====Discovery Log Entry 0======
trtype: fc
adrfam:
subtype: unrecognized
treq: not specified
portid: 0
trsvcid:
subnqn:
traddr:
eflags: none
nvme1: nqn.2014-08.org.nvmexpress.discovery disconnected
What was already ruled out
- RDMA stack mismatch was fixed
- nvme_rdma, nvme_fabrics, rdma_cm, ib_core are all now loaded from the same DKMS stack
- direct nvme connect works
- host NQN and host ID are consistent
- the same target works fine on the older node with nvme-cli 2.3 / libnvme 1.3
Version info
Broken node:
nvme version
nvme version 2.13 (git 2.13)
libnvme versio n 1.13 (git 1.13)
dpkg -l | egrep 'nvme-cli|libnvme'
ii libnvme1t64 1.13-2 amd64 NVMe management library (library)
ii nvme-cli 2.13-2 amd64 NVMe management tool
Working node:
nvme version
nvme version 2.3 (git 2.3)
libnvme version 1.3 (git 1.3)
dpkg -l | egrep 'nvme-cli|libnvme'
ii libnvme1 1.3-1+deb12u1 amd64
ii nvme-cli 2.4+really2.3-3 amd64
Related issues (what I've found so far)
nvme-cli #1555
nvme-cli #2555
libnvme #807
Question
Given the related issues, could this be a regression in discovery log parsing/decoding in newer nvme-cli / libnvme, possibly triggered by a Huawei discovery log format or vendor-specific quirk?
Or something else? What do you think?
Respectfully tagging some of the related people: @igaw @davidrohr @hreinecke
Thanks all for contribution.
Environment
Working node:
nvme-cli 2.4+really2.3-3libnvme 1.3-1+deb12u1Broken node:
nvme-cli 2.13-2libnvme 1.13-2Target:
NIC:
Summary
On the old node (
nvme-cli 2.3 / libnvme 1.3),nvme discoverreturns a correct Discovery Log Entry.On the new node (
nvme-cli 2.13 / libnvme 1.13), the discovery controller connection succeeds, Get Log Page commands succeed, and directnvme connectto the same subsystem works correctly, butnvme discoverprints a bogus Discovery Log Entry:This looks like the discovery log is successfully retrieved but decoded incorrectly in userspace.
Expected behavior
nvme discover -t rdma -a <target> -s 4420should return something like:Actual behavior
Important observation
Direct connect to the same subsystem works correctly on the same host:
nvme connect -t rdma -n nqn.2020-02.huawei.nvme:nvm-subsystem-sn-[XXX] -a 192.168.10.10 -s 4420and then
nvme list-subsysshows:So the RDMA transport and controller connection are functional.
Verbose discover output
nvme discover -t rdma -a 192.168.10.10 -s 4420 -vvWhat was already ruled out
Version info
Broken node:
Working node:
Related issues (what I've found so far)
nvme-cli #1555
nvme-cli #2555
libnvme #807
Question
Given the related issues, could this be a regression in discovery log parsing/decoding in newer nvme-cli / libnvme, possibly triggered by a Huawei discovery log format or vendor-specific quirk?
Or something else? What do you think?
Respectfully tagging some of the related people: @igaw @davidrohr @hreinecke
Thanks all for contribution.