fallocate: introduce FALLOC_FL_WRITE_ZEROES flag#5
Closed
blktests-ci[bot] wants to merge 18 commits intofor-next_basefrom
Closed
fallocate: introduce FALLOC_FL_WRITE_ZEROES flag#5blktests-ci[bot] wants to merge 18 commits intofor-next_basefrom
blktests-ci[bot] wants to merge 18 commits intofor-next_basefrom
Conversation
* io_uring-6.16: MAINTAINERS: remove myself from io_uring io_uring/net: only consider msg_inq if larger than 1 io_uring/zcrx: fix area release on registration failure io_uring/zcrx: init id for xa_find
* block-6.16: selftests: ublk: cover PER_IO_DAEMON in more stress tests Documentation: ublk: document UBLK_F_PER_IO_DAEMON selftests: ublk: add stress test for per io daemons selftests: ublk: add functional test for per io daemons selftests: ublk: kublk: decouple ublk_queues from ublk server threads selftests: ublk: kublk: move per-thread data out of ublk_queue selftests: ublk: kublk: lift queue initialization out of thread selftests: ublk: kublk: tie sqe allocation to io instead of queue selftests: ublk: kublk: plumb q_id in io_uring user_data ublk: have a per-io daemon instead of a per-queue daemon md/md-bitmap: remove parameter slot from bitmap_create() md/md-bitmap: cleanup bitmap_ops->startwrite() md/dm-raid: remove max_write_behind setting limit md/md-bitmap: fix dm-raid max_write_behind setting md/raid1,raid10: don't handle IO error for REQ_RAHEAD and REQ_NOWAIT loop: add file_start_write() and file_end_write() bcache: reserve more RESERVE_BTREE buckets to prevent allocator hang bcache: remove unused constants bcache: fix NULL pointer in cache_set_flush()
* io_uring-6.16: io_uring/kbuf: limit legacy provided buffer lists to USHRT_MAX
* block-6.16: block: drop direction param from bio_integrity_copy_user()
* block-6.16: selftests: ublk: kublk: improve behavior on init failure block: flip iter directions in blk_rq_integrity_map_user()
* io_uring-6.16: io_uring/futex: mark wait requests as inflight io_uring/futex: get rid of struct io_futex addr union
* block-6.16: nvme: spelling fixes nvme-tcp: fix I/O stalls on congested sockets nvme-tcp: sanitize request list handling nvme-tcp: remove tag set when second admin queue config fails nvme: enable vectored registered bufs for passthrough cmds nvme: fix implicit bool to flags conversion nvme: fix command limits status code
Currently, disks primarily implement the write zeroes command (aka REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves physically writing zeros to the disk media (e.g., HDDs), while the second performs an unmap operation on the logical blocks, effectively putting them into a deallocated state (e.g., SSDs). The first method is generally slow, while the second method is typically very fast. For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate the write zeros operation by placing disk blocks into a deallocated state, which opportunistically avoids writing zeroes to media while still guaranteeing that subsequent reads from the specified block range will return zeroed data. This is a best-effort optimization, not a mandatory requirement, some devices may partially fall back to writing physical zeroes due to factors such as misalignment or being asked to clear a block range smaller than the device's internal allocation unit. Therefore, the speed of this operation is not guaranteed. It is difficult to determine whether the storage device supports unmap write zeroes operation. We cannot determine this by only querying bdev_limits(bdev)->max_write_zeroes_sectors. First, add a new queue limit feature, BLK_FEAT_WRITE_ZEROES_UNMAP, to indicate whether a device supports this unmap write zeroes operation. Then, add a new counterpart flag, BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED and a sysfs entry, which allow users to disable this operation if the speed is very slow on some sepcial devices. Finally, for the stacked devices cases, the BLK_FEAT_WRITE_ZEROES_UNMAP should be supported both by the stacking driver and all underlying devices. Thanks to Martin K. Petersen for optimizing the documentation of the write_zeroes_unmap sysfs interface. Signed-off-by: Zhang Yi <[email protected]>
When the device supports the Write Zeroes command and the DEAC bit, it indicates that the deallocate bit in the Write Zeroes command is supported, and the bytes read from a deallocated logical block are zeroes. This means the device supports unmap Write Zeroes, so set the BLK_FEAT_WRITE_ZEROES_UNMAP feature to the device's queue limit. Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature while creating multipath stacking queue limits by default. This feature shall be disabled if any attached namespace does not support it. Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
Set WZDS and DRB bit to the namespace dlfeat if the underlying block device supports BLK_FEAT_WRITE_ZEROES_UNMAP, make the nvme target device supports unmaped write zeroes command. Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
…roing mode When the device supports the Write Zeroes command and the zeroing mode is set to SD_ZERO_WS16_UNMAP or SD_ZERO_WS10_UNMAP, this means that the device supports unmap Write Zeroes, so set the corresponding BLK_FEAT_WRITE_ZEROES_UNMAP feature to the device's queue limit. Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature on stacking queue limits by default. This feature shall be disabled if any underlying device does not support it. Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Benjamin Marzinski <[email protected]>
With the development of flash-based storage devices, we can quickly
write zeros to SSDs using the WRITE_ZERO command if the devices do not
actually write physical zeroes to the media. Therefore, we can use this
command to quickly preallocate a real all-zero file with written
extents. This approach should be beneficial for subsequent pure
overwriting within this file, as it can save on block allocation and,
consequently, significant metadata changes, which should greatly improve
overwrite performance on certain filesystems.
Therefore, introduce a new operation FALLOC_FL_WRITE_ZEROES to
fallocate. This flag is used to convert a specified range of a file to
zeros by issuing a zeroing operation. Blocks should be allocated for the
regions that span holes in the file, and the entire range is converted
to written extents. If the underlying device supports the actual offload
write zeroes command, the process of zeroing out operation can be
accelerated. If it does not, we currently don't prevent the file system
from writing actual zeros to the device. This provides users with a new
method to quickly generate a zeroed file, users no longer need to write
zero data to create a file with written extents.
Users can determine whether a disk supports the unmap write zeroes
operation through querying this sysfs interface:
/sys/block/<disk>/queue/write_zeroes_unmap
Finally, this flag cannot be specified in conjunction with the
FALLOC_FL_KEEP_SIZE since allocating written extents beyond file EOF is
not permitted. In addition, filesystems that always require out-of-place
writes should not support this flag since they still need to allocated
new blocks during subsequent overwrites.
Signed-off-by: Zhang Yi <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Only the flags passed to blkdev_issue_zeroout() differ among the two zeroing branches in blkdev_fallocate(). Therefore, do cleanup by factoring them out. Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
Add support for FALLOC_FL_WRITE_ZEROES, if the block device enables the unmap write zeroes operation, it will issue a write zeroes command. Signed-off-by: Zhang Yi <[email protected]>
Add support for FALLOC_FL_WRITE_ZEROES if the underlying device enable the unmap write zeroes operation. This first allocates blocks as unwritten, then issues a zero command outside of the running journal handle, and finally converts them to a written state. Signed-off-by: Zhang Yi <[email protected]>
Author
|
Upstream branch: 38f4878 |
blktests-ci Bot
pushed a commit
that referenced
this pull request
Jul 10, 2025
When reconnecting a channel in smb2_reconnect_server(), a dummy tcon is passed down to smb2_reconnect() with ->query_interface uninitialized, so we can't call queue_delayed_work() on it. Fix the following warning by ensuring that we're queueing the delayed worker from correct tcon. WARNING: CPU: 4 PID: 1126 at kernel/workqueue.c:2498 __queue_delayed_work+0x1d2/0x200 Modules linked in: cifs cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CPU: 4 UID: 0 PID: 1126 Comm: kworker/4:0 Not tainted 6.16.0-rc3 #5 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-4.fc42 04/01/2014 Workqueue: cifsiod smb2_reconnect_server [cifs] RIP: 0010:__queue_delayed_work+0x1d2/0x200 Code: 41 5e 41 5f e9 7f ee ff ff 90 0f 0b 90 e9 5d ff ff ff bf 02 00 00 00 e8 6c f3 07 00 89 c3 eb bd 90 0f 0b 90 e9 57 f> 0b 90 e9 65 fe ff ff 90 0f 0b 90 e9 72 fe ff ff 90 0f 0b 90 e9 RSP: 0018:ffffc900014afad8 EFLAGS: 00010003 RAX: 0000000000000000 RBX: ffff888124d99988 RCX: ffffffff81399cc1 RDX: dffffc0000000000 RSI: ffff888114326e00 RDI: ffff888124d999f0 RBP: 000000000000ea60 R08: 0000000000000001 R09: ffffed10249b3331 R10: ffff888124d9998f R11: 0000000000000004 R12: 0000000000000040 R13: ffff888114326e00 R14: ffff888124d999d8 R15: ffff888114939020 FS: 0000000000000000(0000) GS:ffff88829f7fe000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffe7a2b4038 CR3: 0000000120a6f000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> queue_delayed_work_on+0xb4/0xc0 smb2_reconnect+0xb22/0xf50 [cifs] smb2_reconnect_server+0x413/0xd40 [cifs] ? __pfx_smb2_reconnect_server+0x10/0x10 [cifs] ? local_clock_noinstr+0xd/0xd0 ? local_clock+0x15/0x30 ? lock_release+0x29b/0x390 process_one_work+0x4c5/0xa10 ? __pfx_process_one_work+0x10/0x10 ? __list_add_valid_or_report+0x37/0x120 worker_thread+0x2f1/0x5a0 ? __kthread_parkme+0xde/0x100 ? __pfx_worker_thread+0x10/0x10 kthread+0x1fe/0x380 ? kthread+0x10f/0x380 ? __pfx_kthread+0x10/0x10 ? local_clock_noinstr+0xd/0xd0 ? ret_from_fork+0x1b/0x1f0 ? local_clock+0x15/0x30 ? lock_release+0x29b/0x390 ? rcu_is_watching+0x20/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x15b/0x1f0 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> irq event stamp: 1116206 hardirqs last enabled at (1116205): [<ffffffff8143af42>] __up_console_sem+0x52/0x60 hardirqs last disabled at (1116206): [<ffffffff81399f0e>] queue_delayed_work_on+0x6e/0xc0 softirqs last enabled at (1116138): [<ffffffffc04562fd>] __smb_send_rqst+0x42d/0x950 [cifs] softirqs last disabled at (1116136): [<ffffffff823d35e1>] release_sock+0x21/0xf0 Cc: [email protected] Reported-by: David Howells <[email protected]> Fixes: 42ca547 ("cifs: do not disable interface polling on failure") Reviewed-by: David Howells <[email protected]> Tested-by: David Howells <[email protected]> Reviewed-by: Shyam Prasad N <[email protected]> Signed-off-by: Paulo Alcantara (Red Hat) <[email protected]> Signed-off-by: David Howells <[email protected]> Tested-by: Steve French <[email protected]> Signed-off-by: Steve French <[email protected]>
a73409f to
d74f66f
Compare
Author
|
Upstream branch: f4ca523 |
Author
|
Upstream branch: f4ca523 |
Author
|
Github failed to update this PR after force push. Close it. |
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 3, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 4, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 4, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 4, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 5, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 5, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 10, 2026
This leak will cause a hang when tearing down the SCSI host. For example, iscsid hangs with the following call trace: [130120.652718] scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured PID: 2528 TASK: ffff9d0408974e00 CPU: 3 COMMAND: "iscsid" #0 [ffffb5b9c134b9e0] __schedule at ffffffff860657d4 #1 [ffffb5b9c134ba28] schedule at ffffffff86065c6f #2 [ffffb5b9c134ba40] schedule_timeout at ffffffff86069fb0 #3 [ffffb5b9c134bab0] __wait_for_common at ffffffff8606674f #4 [ffffb5b9c134bb10] scsi_remove_host at ffffffff85bfe84b #5 [ffffb5b9c134bb30] iscsi_sw_tcp_session_destroy at ffffffffc03031c4 [iscsi_tcp] #6 [ffffb5b9c134bb48] iscsi_if_recv_msg at ffffffffc0292692 [scsi_transport_iscsi] #7 [ffffb5b9c134bb98] iscsi_if_rx at ffffffffc02929c2 [scsi_transport_iscsi] #8 [ffffb5b9c134bbf0] netlink_unicast at ffffffff85e551d6 #9 [ffffb5b9c134bc38] netlink_sendmsg at ffffffff85e554ef Fixes: 8fe4ce5 ("scsi: core: Fix a use-after-free") Cc: [email protected] Signed-off-by: Junxiao Bi <[email protected]> Reviewed-by: Mike Christie <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Martin K. Petersen <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 10, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 10, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 11, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 12, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 13, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 15, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 18, 2026
This patch fixes an out-of-bounds access in ceph_handle_auth_reply() that can be triggered by a message of type CEPH_MSG_AUTH_REPLY. In ceph_handle_auth_reply(), the value of the payload_len field of such a message is stored in a variable of type int. A value greater than INT_MAX leads to an integer overflow and is interpreted as a negative value. This leads to decrementing the pointer address by this value and subsequently accessing it because ceph_decode_need() only checks that the memory access does not exceed the end address of the allocation. This patch fixes the issue by changing the data type of payload_len to u32. Additionally, the data type of result_msg_len is changed to u32, as it is also a variable holding a non-negative length. Also, an additional layer of sanity checks is introduced, ensuring that directly after reading it from the message, payload_len and result_msg_len are not greater than the overall segment length. BUG: KASAN: slab-out-of-bounds in ceph_handle_auth_reply+0x642/0x7a0 [libceph] Read of size 4 at addr ffff88811404df14 by task kworker/20:1/262 CPU: 20 UID: 0 PID: 262 Comm: kworker/20:1 Not tainted 6.19.2 #5 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Workqueue: ceph-msgr ceph_con_workfn [libceph] Call Trace: <TASK> dump_stack_lvl+0x76/0xa0 print_report+0xd1/0x620 ? __pfx__raw_spin_lock_irqsave+0x10/0x10 ? kasan_complete_mode_report_info+0x72/0x210 kasan_report+0xe7/0x130 ? ceph_handle_auth_reply+0x642/0x7a0 [libceph] ? ceph_handle_auth_reply+0x642/0x7a0 [libceph] __asan_report_load_n_noabort+0xf/0x20 ceph_handle_auth_reply+0x642/0x7a0 [libceph] mon_dispatch+0x973/0x23d0 [libceph] ? apparmor_socket_recvmsg+0x6b/0xa0 ? __pfx_mon_dispatch+0x10/0x10 [libceph] ? __kasan_check_write+0x14/0x30i ? mutex_unlock+0x7f/0xd0 ? __pfx_mutex_unlock+0x10/0x10 ? __pfx_do_recvmsg+0x10/0x10 [libceph] ceph_con_process_message+0x1f1/0x650 [libceph] process_message+0x1e/0x450 [libceph] ceph_con_v2_try_read+0x2e48/0x6c80 [libceph] ? __pfx_ceph_con_v2_try_read+0x10/0x10 [libceph] ? save_fpregs_to_fpstate+0xb0/0x230 ? raw_spin_rq_unlock+0x17/0xa0 ? finish_task_switch.isra.0+0x13b/0x760 ? __switch_to+0x385/0xda0 ? __kasan_check_write+0x14/0x30 ? mutex_lock+0x8d/0xe0 ? __pfx_mutex_lock+0x10/0x10 ceph_con_workfn+0x248/0x10c0 [libceph] process_one_work+0x629/0xf80 ? __kasan_check_write+0x14/0x30 worker_thread+0x87f/0x1570 ? __pfx__raw_spin_lock_irqsave+0x10/0x10 ? __pfx_try_to_wake_up+0x10/0x10 ? kasan_print_address_stack_frame+0x1f7/0x280 ? __pfx_worker_thread+0x10/0x10 kthread+0x396/0x830 ? __pfx__raw_spin_lock_irq+0x10/0x10 ? __pfx_kthread+0x10/0x10 ? __kasan_check_write+0x14/0x30 ? recalc_sigpending+0x180/0x210 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x3f7/0x610 ? __pfx_ret_from_fork+0x10/0x10 ? __switch_to+0x385/0xda0 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> [ idryomov: replace if statements with ceph_decode_need() for payload_len and result_msg_len ] Cc: [email protected] Signed-off-by: Raphael Zimmer <[email protected]> Reviewed-by: Viacheslav Dubeyko <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 18, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 18, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 22, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 23, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 24, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 25, 2026
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 25, 2026
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 25, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 27, 2026
The devm_free_irq() and devm_request_irq() functions should not be
executed in an atomic context.
During device suspend, all userspace processes and most kernel threads
are frozen. Additionally, we flush all tx/rx status, disable all macb
interrupts, and halt rx operations. Therefore, it is safe to split the
region protected by bp->lock into two independent sections, allowing
devm_free_irq() and devm_request_irq() to run in a non-atomic context.
This modification resolves the following lockdep warning:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:591
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 501, name: rtcwake
preempt_count: 1, expected: 0
RCU nest depth: 1, expected: 0
7 locks held by rtcwake/501:
#0: ffff0008038c3408 (sb_writers#5){.+.+}-{0:0}, at: vfs_write+0xf8/0x368
#1: ffff0008049a5e88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0xbc/0x1c8
#2: ffff00080098d588 (kn->active#70){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xcc/0x1c8
#3: ffff800081c84888 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0x1ec/0x290
#4: ffff0008009ba0f8 (&dev->mutex){....}-{4:4}, at: device_suspend+0x118/0x4f0
#5: ffff800081d00458 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48
#6: ffff0008031fb9e0 (&bp->lock){-.-.}-{3:3}, at: macb_suspend+0x144/0x558
irq event stamp: 8682
hardirqs last enabled at (8681): [<ffff8000813c7d7c>] _raw_spin_unlock_irqrestore+0x44/0x88
hardirqs last disabled at (8682): [<ffff8000813c7b58>] _raw_spin_lock_irqsave+0x38/0x98
softirqs last enabled at (7322): [<ffff8000800f1b4c>] handle_softirqs+0x52c/0x588
softirqs last disabled at (7317): [<ffff800080010310>] __do_softirq+0x20/0x2c
CPU: 1 UID: 0 PID: 501 Comm: rtcwake Not tainted 7.0.0-rc3-next-20260310-yocto-standard+ #125 PREEMPT
Hardware name: ZynqMP ZCU102 Rev1.1 (DT)
Call trace:
show_stack+0x24/0x38 (C)
__dump_stack+0x28/0x38
dump_stack_lvl+0x64/0x88
dump_stack+0x18/0x24
__might_resched+0x200/0x218
__might_sleep+0x38/0x98
__mutex_lock_common+0x7c/0x1378
mutex_lock_nested+0x38/0x50
free_irq+0x68/0x2b0
devm_irq_release+0x24/0x38
devres_release+0x40/0x80
devm_free_irq+0x48/0x88
macb_suspend+0x298/0x558
device_suspend+0x218/0x4f0
dpm_suspend+0x244/0x3a0
dpm_suspend_start+0x50/0x78
suspend_devices_and_enter+0xec/0x560
pm_suspend+0x194/0x290
state_store+0x110/0x158
kobj_attr_store+0x1c/0x30
sysfs_kf_write+0xa8/0xd0
kernfs_fop_write_iter+0x11c/0x1c8
vfs_write+0x248/0x368
ksys_write+0x7c/0xf8
__arm64_sys_write+0x28/0x40
invoke_syscall+0x4c/0xe8
el0_svc_common+0x98/0xf0
do_el0_svc+0x28/0x40
el0_svc+0x54/0x1e0
el0t_64_sync_handler+0x84/0x130
el0t_64_sync+0x198/0x1a0
Fixes: 558e35c ("net: macb: WoL support for GEM type of Ethernet controller")
Cc: [email protected]
Reviewed-by: Théo Lebrun <[email protected]>
Signed-off-by: Kevin Hao <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 27, 2026
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 27, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 27, 2026
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 27, 2026
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 28, 2026
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 29, 2026
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 30, 2026
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull request for series with
subject: fallocate: introduce FALLOC_FL_WRITE_ZEROES flag
version: 1
url: https://patchwork.kernel.org/project/linux-block/list/?series=968463