Skip to content

Error 881 in dmesg in RDMAoF #3268

@gridd20-pixel

Description

@gridd20-pixel

Hello!

We have three nodes of Gooxi SR201-G2 with motherboard G2DRLO with 2x AMD EPYC 9474F 48-Core Processor.
Each node have 2x Mellanox CX-5 CN boards connected to huawei Dorado 5000 V6 NVME. Storage is available on network over four paths distributed in two VLANs

root@node1:~# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 13 (trixie)"
NAME="Debian GNU/Linux"
VERSION_ID="13"
VERSION="13 (trixie)"
VERSION_CODENAME=trixie
DEBIAN_VERSION_FULL=13.4
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
root@node1:~#

Two nodes work quite fine, but first one (node1) at any attempt of nvme discovery reports error:

root@node1:~# nvme discover -t rdma -a 172.16.0.35 -vv
kernel supports: instance cntlid transport traddr trsvcid nqn queue_size nr_io_queues reconnect_delay ctrl_loss_tmo keep_alive_tmo hostnqn host_traddr host_iface hostid duplicate_connect disable_sqflow hdr_digest data_digest nr_write_queues nr_poll_queues tos keyring tls_key fast_io_fail_tmo discovery dhchap_secret dhchap_ctrl_secret tls concat recovery_delay
connect ctrl, 'nqn=nqn.2014-08.org.nvmexpress.discovery,transport=rdma,traddr=172.16.0.35,trsvcid=4420,hostnqn=nqn.2014-08.org.nvmexpress:uuid:717a9176-ac73-4ea3-829e-e4ccf0b5735f,hostid=717a9176-ac73-4ea3-829e-e4ccf0b5735f,ctrl_loss_tmo=600'
**Failed to write to /dev/nvme-fabrics: Input/output error
failed to add controller, error failed to write to nvme-fabrics device**

In dmesg:

[356612.263017] nvme nvme0: I/O tag 0 (0000) opcode 0x7f (Fabrics Cmd) QID 0 timeout
[356612.263054] nvme nvme0: Connect command failed, error wo/DNR bit: 881
[356612.263463] nvme nvme0: failed to connect queue: 0 ret=881

nvme version:

root@node1:~# nvme version
nvme version 2.13 (git 2.13)
libnvme version 1.13 (git 1.13)

Primarily we used Alt Linux r11, but this problem also noticed at Ubuntu, Debian 12, Debian 13.
At first, we strongy suggested about hardware problem, since two nodes works fine, but we swapped everything(!) - risers, nvme hdd, NICs, and even power supplies.

Yep, working nodes use nvme 2.3 и libnvme 1.3 and they have been tested on node1, same effect

Supplier even changed motherboard, but no luck. We even remove one processor and set only one core of 48.
And checked BIOS settings between nodes, everything looks identical.

The only known workaround is to clone disks from working server by Clonezilla, this is hint for possible software problem.

I would like any investigation on our problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions