Skip to content

logging: retry ioctl on EINTR return value#2775

Closed
igaw wants to merge 1 commit intolinux-nvme:masterfrom
igaw:fix-eintr
Closed

logging: retry ioctl on EINTR return value#2775
igaw wants to merge 1 commit intolinux-nvme:masterfrom
igaw:fix-eintr

Conversation

@igaw
Copy link
Copy Markdown
Collaborator

@igaw igaw commented Apr 16, 2025

Recent changes in the nvme-tcp transport started to return EINTR due to a signal while reading from a socket. The transport could retry the operation but so far this hasn't been implemented. Thus we retry in userpace the connect call when EINTR is returned.

Fixes: ##2760

Recent changes in the nvme-tcp transport started to return EINTR due to
a signal while reading from a socket. The transport could retry the
operation but so far this hasn't been implemented. Thus we retry in
userpace the connect call when EINTR is returned.

Signed-off-by: Daniel Wagner <[email protected]>
@ikegami-t
Copy link
Copy Markdown
Contributor

Just confirmed below the EINTR case but still the PR changes correct for the case alos?

  1. nvme_execute_rq() returns -EINTR when nvme_req(rq)->flags set NVME_REQ_CANCELLED.
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/core.c#n1123
int nvme_execute_rq(struct request *rq, bool at_head)
{
	blk_status_t status;

	status = blk_execute_rq(rq, at_head);
	if (nvme_req(rq)->flags & NVME_REQ_CANCELLED)
		return -EINTR;
...
  1. The nvme_req(req)->flags set NVME_REQ_CANCELLED if the controller deleting for the disable.
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c#n1476
static enum blk_eh_timer_return nvme_timeout(struct request *req)
{
...
	switch (nvme_ctrl_state(&dev->ctrl)) {
	case NVME_CTRL_CONNECTING:
		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
		fallthrough;
	case NVME_CTRL_DELETING:
		dev_warn_ratelimited(dev->ctrl.device,
			 "I/O tag %d (%04x) QID %d timeout, disable controller\n",
			 req->tag, nvme_cid(req), nvmeq->qid);
		nvme_req(req)->flags |= NVME_REQ_CANCELLED;

Let me ask below questions since I am not sure the PR changes correct (but seems for some cases there are some risks and not good).

  1. Is basically the EINTR error reuqired to retry? (Seems it is only for the EGAIN error basically.)
  2. Is the nvme-tcp transport EINTR error return a special behavior? (Seems the nvme-tcp transport should return EGAIN error instead.)
  3. For other nvme driver cases to return EINTR is there no risk to be continued the retry handling? (Concerning the nvme-cli command hangup for the controller reset or disable cases, etc.)
  4. Is the EINTR only returned temporary case? (Since it means Interrupted function call so no concern for above questions? Concerning for some error cases if the EINTR is continued to return.)

@igaw
Copy link
Copy Markdown
Collaborator Author

igaw commented Apr 22, 2025

I share your concerns. And after reading a bit more on it, I think the simple approach with retrying on EINTR is not really covering all use cases, e.g. see https://250bpm.com/blog:12/

The kernel returns EINTR for things which are not caused by signals. Though the semantics of EINTR is fairly clear though:

https://pubs.opengroup.org/onlinepubs/009696799/functions/xsh_chap02_03.html

that means we should be able to retry. I don't want to start handling signal in the library if possible.

One thing we could do is to limit the retry loop to avoid live locks. Anyway, needs a bit more thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants