nvme discover over RDMA parses Discovery Log Entry incorrectly on nvme-cli 2.13 / libnvme 1.13, while direct nvme connect works

### Environment

Working node:
- Proxmox VE 8.4
- Debian 12 (bookworm)
- `nvme-cli 2.4+really2.3-3`
- `libnvme 1.3-1+deb12u1`

Broken node:
- Proxmox VE 9.1
- Debian 13 (trixie)
- `nvme-cli 2.13-2`
- `libnvme 1.13-2`

Target:
- Huawei NVMe-oF target
- transport: RDMA / RoCE v2

NIC:
- Mellanox ConnectX-6 Lx

### Summary

On the old node (`nvme-cli 2.3 / libnvme 1.3`), `nvme discover` returns a correct Discovery Log Entry.

On the new node (`nvme-cli 2.13 / libnvme 1.13`), the discovery controller connection succeeds, Get Log Page commands succeed, and direct `nvme connect` to the same subsystem works correctly, but `nvme discover` prints a bogus Discovery Log Entry:

```text
trtype:  fc
adrfam:
subtype: unrecognized
treq:    not specified
portid:  0
trsvcid:
subnqn:
traddr:
eflags:  none
```

This looks like the discovery log is successfully retrieved but decoded incorrectly in userspace.

### Expected behavior

`nvme discover -t rdma -a <target> -s 4420` should return something like:

```text
Discovery Log Number of Records 1, Generation counter 2
=====Discovery Log Entry 0======
trtype:  rdma
adrfam:  ipv4
subtype: nvme subsystem
treq:    not specified
portid:  0
trsvcid: 4420
subnqn:  nqn.2020-02.huawei.nvme:nvm-subsystem-sn-[XXX]
traddr:  192.168.10.10
eflags:  none
rdma_prtype: roce-v2
rdma_qptype: connected
rdma_cms:    rdma-cm
rdma_pkey: 0x0000
```

### Actual behavior

```text
Discovery Log Number of Records 1, Generation counter 2
=====Discovery Log Entry 0======
trtype:  fc
adrfam:
subtype: unrecognized
treq:    not specified
portid:  0
trsvcid:
subnqn:
traddr:
eflags:  none
```

### Important observation

Direct connect to the same subsystem works correctly on the same host:

`nvme connect -t rdma -n nqn.2020-02.huawei.nvme:nvm-subsystem-sn-[XXX] -a 192.168.10.10 -s 4420`

and then `nvme list-subsys` shows:

```text
nvme-subsys1 - NQN=nqn.2020-02.huawei.nvme:nvm-subsystem-sn-[XXX]
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:[YYY]
\
 +- nvme1 rdma traddr=192.168.10.10,trsvcid=4420 live
```

So the RDMA transport and controller connection are functional.

### Verbose discover output

`nvme discover -t rdma -a 192.168.10.10 -s 4420 -vv`

```text
connect ctrl, 'nqn=nqn.2014-08.org.nvmexpress.discovery,transport=rdma,traddr=192.168.10.10,trsvcid=4420,
hostnqn=nqn.2014-08.org.nvmexpress:uuid:[YYY],hostid=[YYY],ctrl_loss_tmo=600'
connect ctrl, response 'instance=1,cntlid=0'
nvme1: nqn.2014-08.org.nvmexpress.discovery connected

opcode       : 06
...
data_len     : 00001000
cdw10        : 00000001
...
err          : 0

nvme1: get header (try 0/10)

opcode       : 02
...
data_len     : 00000014
cdw10        : 00040070
...
err          : 0

nvme1: get 1 records (genctr 2)

opcode       : 02
...
data_len     : 00000400
cdw10        : 00ff0070
cdw12        : 00000400
...
err          : 0

nvme1: get header again

opcode       : 02
...
data_len     : 00000014
cdw10        : 00040070
...
err          : 0

Discovery Log Number of Records 1, Generation counter 2
=====Discovery Log Entry 0======
trtype:  fc
adrfam:
subtype: unrecognized
treq:    not specified
portid:  0
trsvcid:
subnqn:
traddr:
eflags:  none

nvme1: nqn.2014-08.org.nvmexpress.discovery disconnected
```

### What was already ruled out

- RDMA stack mismatch was fixed
- nvme_rdma, nvme_fabrics, rdma_cm, ib_core are all now loaded from the same DKMS stack
- direct nvme connect works
- host NQN and host ID are consistent
- the same target works fine on the older node with nvme-cli 2.3 / libnvme 1.3

### Version info

Broken node:

```
nvme version
nvme version 2.13 (git 2.13)
libnvme versio n 1.13 (git 1.13)

dpkg -l | egrep 'nvme-cli|libnvme'
ii  libnvme1t64  1.13-2  amd64  NVMe management library (library)
ii  nvme-cli     2.13-2  amd64  NVMe management tool
```

Working node:

```
nvme version
nvme version 2.3 (git 2.3)
libnvme version 1.3 (git 1.3)

dpkg -l | egrep 'nvme-cli|libnvme'
ii  libnvme1   1.3-1+deb12u1     amd64
ii  nvme-cli   2.4+really2.3-3   amd64
```

### Related issues (what I've found so far)

[nvme-cli #1555](https://github.com/linux-nvme/nvme-cli/issues/1555)
[nvme-cli #2555](https://github.com/linux-nvme/nvme-cli/issues/2555)
[libnvme #807](https://github.com/linux-nvme/libnvme/issues/807)

### Question

Given the related issues, could this be a regression in discovery log parsing/decoding in newer nvme-cli / libnvme, possibly triggered by a Huawei discovery log format or vendor-specific quirk? 
Or something else? What do you think?

Respectfully tagging some of the related people: @igaw @davidrohr @hreinecke

Thanks all for contribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvme discover over RDMA parses Discovery Log Entry incorrectly on nvme-cli 2.13 / libnvme 1.13, while direct nvme connect works #1099

Environment

Summary

Expected behavior

Actual behavior

Important observation

Verbose discover output

What was already ruled out

Version info

Related issues (what I've found so far)

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nvme discover over RDMA parses Discovery Log Entry incorrectly on nvme-cli 2.13 / libnvme 1.13, while direct nvme connect works #1099

Description

Environment

Summary

Expected behavior

Actual behavior

Important observation

Verbose discover output

What was already ruled out

Version info

Related issues (what I've found so far)

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions