Error 881 in dmesg in RDMAoF

Hello!

We have three nodes of Gooxi SR201-G2 with motherboard G2DRLO with 2x AMD EPYC 9474F 48-Core Processor.
Each node have 2x Mellanox  CX-5 CN boards connected to huawei Dorado 5000 V6 NVME.  Storage is available on network over four paths distributed in two VLANs

```
root@node1:~# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 13 (trixie)"
NAME="Debian GNU/Linux"
VERSION_ID="13"
VERSION="13 (trixie)"
VERSION_CODENAME=trixie
DEBIAN_VERSION_FULL=13.4
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
root@node1:~#

```
Two nodes work quite fine, but first one (node1) at any attempt of nvme discovery reports error:

```
root@node1:~# nvme discover -t rdma -a 172.16.0.35 -vv
kernel supports: instance cntlid transport traddr trsvcid nqn queue_size nr_io_queues reconnect_delay ctrl_loss_tmo keep_alive_tmo hostnqn host_traddr host_iface hostid duplicate_connect disable_sqflow hdr_digest data_digest nr_write_queues nr_poll_queues tos keyring tls_key fast_io_fail_tmo discovery dhchap_secret dhchap_ctrl_secret tls concat recovery_delay
connect ctrl, 'nqn=nqn.2014-08.org.nvmexpress.discovery,transport=rdma,traddr=172.16.0.35,trsvcid=4420,hostnqn=nqn.2014-08.org.nvmexpress:uuid:717a9176-ac73-4ea3-829e-e4ccf0b5735f,hostid=717a9176-ac73-4ea3-829e-e4ccf0b5735f,ctrl_loss_tmo=600'
**Failed to write to /dev/nvme-fabrics: Input/output error
failed to add controller, error failed to write to nvme-fabrics device**
```

In dmesg:
```
[356612.263017] nvme nvme0: I/O tag 0 (0000) opcode 0x7f (Fabrics Cmd) QID 0 timeout
[356612.263054] nvme nvme0: Connect command failed, error wo/DNR bit: 881
[356612.263463] nvme nvme0: failed to connect queue: 0 ret=881
```

nvme version:
```
root@node1:~# nvme version
nvme version 2.13 (git 2.13)
libnvme version 1.13 (git 1.13)

```

Primarily we used Alt Linux r11, but this problem also noticed at Ubuntu, Debian 12, Debian 13.
At first, we strongy suggested about hardware problem, since two nodes works fine, but we swapped everything(!) - risers, nvme hdd, NICs, and even power supplies.

Yep, working nodes use nvme 2.3 и libnvme 1.3 and they have been tested on node1, same effect 

Supplier even changed motherboard, but no luck. We even remove one processor and set only one core of 48.
And checked BIOS settings between nodes, everything looks identical.

The only known workaround is to clone disks from working server by Clonezilla, this is hint for possible software problem.

I would like any investigation on our problem.
 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error 881 in dmesg in RDMAoF #3268

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error 881 in dmesg in RDMAoF #3268

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions