Skip to content

fabrics: retry connect command on EINTR#987

Merged
igaw merged 1 commit intolinux-nvme:masterfrom
igaw:retry-eintr
Apr 9, 2025
Merged

fabrics: retry connect command on EINTR#987
igaw merged 1 commit intolinux-nvme:masterfrom
igaw:retry-eintr

Conversation

@igaw
Copy link
Copy Markdown
Collaborator

@igaw igaw commented Apr 8, 2025

When the system call fails with EINTR, retry the operation.

Fixes: #linux-nvme/nvme-cli#2760

Comment thread meson.build
long int __result = 0; \
do { \
__result = (long int)(exp); \
} while ((__result == -1) && (errno == EINTR)); \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we be stuck here for a very long time?

EINTR happens when a signal is received. For example, if someone types CTRL-C (SIGINT) to stop a process that seems stuck. With this, CTRL-C will have no effect. Just wondering if that's what we want.

Copy link
Copy Markdown
Collaborator Author

@igaw igaw Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a problem. Though I am not sure what /dev/nvme-fabrics does on SIGINT.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a problem. Though I am not sure what /dev/nvme-fabrics does on SIGINT.

Right! I remember being stuck on /dev/nvme-fabrics before and there is no way to get out. The kernel just doesn't allow interrupting operations on that device. So, maybe my comment does not apply here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And when nvme request are cancelled the syscall will also return EINTR.

So there are at least three sources/reasons for an EINTR. Don't know if we can handle this correctly in userspace.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a problem. Though I am not sure what /dev/nvme-fabrics does on SIGINT.

Right! I remember being stuck on /dev/nvme-fabrics before and there is no way to get out. The kernel just doesn't allow interrupting operations on that device. So, maybe my comment does not apply here.

Indeed, though I think it should be possible to handle SIGINTs, it's just not there yet.

Copy link
Copy Markdown
Collaborator Author

@igaw igaw Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a bit of refreshing my memory on signals, I don't think there is a real problem. The ctrl-c is translated to SIGINT which is then delivered to the process. The default handler for SIGINT is to terminate the process. That means when the syscall returns the default handler gets executed (terminate process) and thus the loop will not be run anymore.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the TFR macro from the open call, as this call shouldn't be a blocking call.

Recent changes in the nvme-tcp transport started to return EINTR due to
a signal while reading from a socket. The transport could retry the
operation but so far this hasn't been implemented. Thus we retry in
userpace the connect call when EINTR is returned.

Signed-off-by: Daniel Wagner <[email protected]>
@igaw igaw merged commit 2b30489 into linux-nvme:master Apr 9, 2025
12 checks passed
@igaw igaw deleted the retry-eintr branch April 9, 2025 08:48
@igaw
Copy link
Copy Markdown
Collaborator Author

igaw commented Apr 9, 2025

FTR, I've tested this by adding a EINTR return into to the admin queue connect code path which fired every other time. Userspace saw the EINTR and retried. So that part seems to work as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants