fallocate: introduce FALLOC_FL_WRITE_ZEROES flag by blktests-ci[bot] · Pull Request #5 · linux-blktests/linux-block

blktests-ci · 2025-06-10T04:36:18Z

Pull request for series with
subject: fallocate: introduce FALLOC_FL_WRITE_ZEROES flag
version: 1
url: https://patchwork.kernel.org/project/linux-block/list/?series=968463

* io_uring-6.16: MAINTAINERS: remove myself from io_uring io_uring/net: only consider msg_inq if larger than 1 io_uring/zcrx: fix area release on registration failure io_uring/zcrx: init id for xa_find

* block-6.16: selftests: ublk: cover PER_IO_DAEMON in more stress tests Documentation: ublk: document UBLK_F_PER_IO_DAEMON selftests: ublk: add stress test for per io daemons selftests: ublk: add functional test for per io daemons selftests: ublk: kublk: decouple ublk_queues from ublk server threads selftests: ublk: kublk: move per-thread data out of ublk_queue selftests: ublk: kublk: lift queue initialization out of thread selftests: ublk: kublk: tie sqe allocation to io instead of queue selftests: ublk: kublk: plumb q_id in io_uring user_data ublk: have a per-io daemon instead of a per-queue daemon md/md-bitmap: remove parameter slot from bitmap_create() md/md-bitmap: cleanup bitmap_ops->startwrite() md/dm-raid: remove max_write_behind setting limit md/md-bitmap: fix dm-raid max_write_behind setting md/raid1,raid10: don't handle IO error for REQ_RAHEAD and REQ_NOWAIT loop: add file_start_write() and file_end_write() bcache: reserve more RESERVE_BTREE buckets to prevent allocator hang bcache: remove unused constants bcache: fix NULL pointer in cache_set_flush()

* io_uring-6.16: io_uring/kbuf: limit legacy provided buffer lists to USHRT_MAX

* block-6.16: block: drop direction param from bio_integrity_copy_user()

* block-6.16: selftests: ublk: kublk: improve behavior on init failure block: flip iter directions in blk_rq_integrity_map_user()

* io_uring-6.16: io_uring/futex: mark wait requests as inflight io_uring/futex: get rid of struct io_futex addr union

* block-6.16: nvme: spelling fixes nvme-tcp: fix I/O stalls on congested sockets nvme-tcp: sanitize request list handling nvme-tcp: remove tag set when second admin queue config fails nvme: enable vectored registered bufs for passthrough cmds nvme: fix implicit bool to flags conversion nvme: fix command limits status code

Currently, disks primarily implement the write zeroes command (aka REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves physically writing zeros to the disk media (e.g., HDDs), while the second performs an unmap operation on the logical blocks, effectively putting them into a deallocated state (e.g., SSDs). The first method is generally slow, while the second method is typically very fast. For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate the write zeros operation by placing disk blocks into a deallocated state, which opportunistically avoids writing zeroes to media while still guaranteeing that subsequent reads from the specified block range will return zeroed data. This is a best-effort optimization, not a mandatory requirement, some devices may partially fall back to writing physical zeroes due to factors such as misalignment or being asked to clear a block range smaller than the device's internal allocation unit. Therefore, the speed of this operation is not guaranteed. It is difficult to determine whether the storage device supports unmap write zeroes operation. We cannot determine this by only querying bdev_limits(bdev)->max_write_zeroes_sectors. First, add a new queue limit feature, BLK_FEAT_WRITE_ZEROES_UNMAP, to indicate whether a device supports this unmap write zeroes operation. Then, add a new counterpart flag, BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED and a sysfs entry, which allow users to disable this operation if the speed is very slow on some sepcial devices. Finally, for the stacked devices cases, the BLK_FEAT_WRITE_ZEROES_UNMAP should be supported both by the stacking driver and all underlying devices. Thanks to Martin K. Petersen for optimizing the documentation of the write_zeroes_unmap sysfs interface. Signed-off-by: Zhang Yi <[email protected]>

When the device supports the Write Zeroes command and the DEAC bit, it indicates that the deallocate bit in the Write Zeroes command is supported, and the bytes read from a deallocated logical block are zeroes. This means the device supports unmap Write Zeroes, so set the BLK_FEAT_WRITE_ZEROES_UNMAP feature to the device's queue limit. Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>

Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature while creating multipath stacking queue limits by default. This feature shall be disabled if any attached namespace does not support it. Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>

Set WZDS and DRB bit to the namespace dlfeat if the underlying block device supports BLK_FEAT_WRITE_ZEROES_UNMAP, make the nvme target device supports unmaped write zeroes command. Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>

…roing mode When the device supports the Write Zeroes command and the zeroing mode is set to SD_ZERO_WS16_UNMAP or SD_ZERO_WS10_UNMAP, this means that the device supports unmap Write Zeroes, so set the corresponding BLK_FEAT_WRITE_ZEROES_UNMAP feature to the device's queue limit. Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>

Set the BLK_FEAT_WRITE_ZEROES_UNMAP feature on stacking queue limits by default. This feature shall be disabled if any underlying device does not support it. Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Benjamin Marzinski <[email protected]>

With the development of flash-based storage devices, we can quickly write zeros to SSDs using the WRITE_ZERO command if the devices do not actually write physical zeroes to the media. Therefore, we can use this command to quickly preallocate a real all-zero file with written extents. This approach should be beneficial for subsequent pure overwriting within this file, as it can save on block allocation and, consequently, significant metadata changes, which should greatly improve overwrite performance on certain filesystems. Therefore, introduce a new operation FALLOC_FL_WRITE_ZEROES to fallocate. This flag is used to convert a specified range of a file to zeros by issuing a zeroing operation. Blocks should be allocated for the regions that span holes in the file, and the entire range is converted to written extents. If the underlying device supports the actual offload write zeroes command, the process of zeroing out operation can be accelerated. If it does not, we currently don't prevent the file system from writing actual zeros to the device. This provides users with a new method to quickly generate a zeroed file, users no longer need to write zero data to create a file with written extents. Users can determine whether a disk supports the unmap write zeroes operation through querying this sysfs interface: /sys/block/<disk>/queue/write_zeroes_unmap Finally, this flag cannot be specified in conjunction with the FALLOC_FL_KEEP_SIZE since allocating written extents beyond file EOF is not permitted. In addition, filesystems that always require out-of-place writes should not support this flag since they still need to allocated new blocks during subsequent overwrites. Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>

Only the flags passed to blkdev_issue_zeroout() differ among the two zeroing branches in blkdev_fallocate(). Therefore, do cleanup by factoring them out. Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>

Add support for FALLOC_FL_WRITE_ZEROES, if the block device enables the unmap write zeroes operation, it will issue a write zeroes command. Signed-off-by: Zhang Yi <[email protected]>

Add support for FALLOC_FL_WRITE_ZEROES if the underlying device enable the unmap write zeroes operation. This first allocates blocks as unwritten, then issues a zero command outside of the running journal handle, and finally converts them to a written state. Signed-off-by: Zhang Yi <[email protected]>

blktests-ci · 2025-06-10T04:36:19Z

Upstream branch: 38f4878
series: https://patchwork.kernel.org/project/linux-block/list/?series=968463
version: 1

When reconnecting a channel in smb2_reconnect_server(), a dummy tcon is passed down to smb2_reconnect() with ->query_interface uninitialized, so we can't call queue_delayed_work() on it. Fix the following warning by ensuring that we're queueing the delayed worker from correct tcon. WARNING: CPU: 4 PID: 1126 at kernel/workqueue.c:2498 __queue_delayed_work+0x1d2/0x200 Modules linked in: cifs cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CPU: 4 UID: 0 PID: 1126 Comm: kworker/4:0 Not tainted 6.16.0-rc3 #5 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-4.fc42 04/01/2014 Workqueue: cifsiod smb2_reconnect_server [cifs] RIP: 0010:__queue_delayed_work+0x1d2/0x200 Code: 41 5e 41 5f e9 7f ee ff ff 90 0f 0b 90 e9 5d ff ff ff bf 02 00 00 00 e8 6c f3 07 00 89 c3 eb bd 90 0f 0b 90 e9 57 f> 0b 90 e9 65 fe ff ff 90 0f 0b 90 e9 72 fe ff ff 90 0f 0b 90 e9 RSP: 0018:ffffc900014afad8 EFLAGS: 00010003 RAX: 0000000000000000 RBX: ffff888124d99988 RCX: ffffffff81399cc1 RDX: dffffc0000000000 RSI: ffff888114326e00 RDI: ffff888124d999f0 RBP: 000000000000ea60 R08: 0000000000000001 R09: ffffed10249b3331 R10: ffff888124d9998f R11: 0000000000000004 R12: 0000000000000040 R13: ffff888114326e00 R14: ffff888124d999d8 R15: ffff888114939020 FS: 0000000000000000(0000) GS:ffff88829f7fe000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffe7a2b4038 CR3: 0000000120a6f000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> queue_delayed_work_on+0xb4/0xc0 smb2_reconnect+0xb22/0xf50 [cifs] smb2_reconnect_server+0x413/0xd40 [cifs] ? __pfx_smb2_reconnect_server+0x10/0x10 [cifs] ? local_clock_noinstr+0xd/0xd0 ? local_clock+0x15/0x30 ? lock_release+0x29b/0x390 process_one_work+0x4c5/0xa10 ? __pfx_process_one_work+0x10/0x10 ? __list_add_valid_or_report+0x37/0x120 worker_thread+0x2f1/0x5a0 ? __kthread_parkme+0xde/0x100 ? __pfx_worker_thread+0x10/0x10 kthread+0x1fe/0x380 ? kthread+0x10f/0x380 ? __pfx_kthread+0x10/0x10 ? local_clock_noinstr+0xd/0xd0 ? ret_from_fork+0x1b/0x1f0 ? local_clock+0x15/0x30 ? lock_release+0x29b/0x390 ? rcu_is_watching+0x20/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x15b/0x1f0 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> irq event stamp: 1116206 hardirqs last enabled at (1116205): [<ffffffff8143af42>] __up_console_sem+0x52/0x60 hardirqs last disabled at (1116206): [<ffffffff81399f0e>] queue_delayed_work_on+0x6e/0xc0 softirqs last enabled at (1116138): [<ffffffffc04562fd>] __smb_send_rqst+0x42d/0x950 [cifs] softirqs last disabled at (1116136): [<ffffffff823d35e1>] release_sock+0x21/0xf0 Cc: [email protected] Reported-by: David Howells <[email protected]> Fixes: 42ca547 ("cifs: do not disable interface polling on failure") Reviewed-by: David Howells <[email protected]> Tested-by: David Howells <[email protected]> Reviewed-by: Shyam Prasad N <[email protected]> Signed-off-by: Paulo Alcantara (Red Hat) <[email protected]> Signed-off-by: David Howells <[email protected]> Tested-by: Steve French <[email protected]> Signed-off-by: Steve French <[email protected]>

blktests-ci · 2025-07-10T12:03:09Z

Upstream branch: f4ca523
series: https://patchwork.kernel.org/project/linux-block/list/?series=973801
version: 2

blktests-ci · 2025-07-10T12:17:06Z

Upstream branch: f4ca523
series: https://patchwork.kernel.org/project/linux-block/list/?series=973801
version: 2

blktests-ci · 2025-07-10T12:19:58Z

Github failed to update this PR after force push. Close it.

When a bio goes through the rq_qos infrastructure on a path's request queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These flags indicate that rq_qos_done_bio() should be called on completion to update rq_qos accounting. During path failover in nvme_failover_req(), the bio's bi_bdev is redirected from the failed path's disk to the multipath head's disk via bio_set_dev(). However, the BIO_QOS flags are not cleared. When the bio eventually completes (either successfully via a new path or with an error via bio_io_error()), rq_qos_done_bio() checks for these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is obtained from the bio's current bi_bdev - which is now the multipath head's queue, not the original path's queue. The multipath head's queue does not have rq_qos enabled (q->rq_qos is NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos must be valid. This breaks when a bio is moved between queues during NVMe multipath failover, leading to a NULL pointer dereference. Execution Context timeline :- * =====> dd process context [USER] dd process [SYSCALL] write() - dd process context submit_bio() nvme_ns_head_submit_bio() - path selection blk_mq_submit_bio() #### QOS FLAGS SET HERE [USER] dd waits or returns ==== I/O in flight on NVMe hardware ===== ===== End of submission path ==== ------------------------------------------------------ * dd ====> Interrupt context; [IRQ] NVMe completion interrupt nvme_irq() nvme_complete_rq() nvme_failover_req() ### BIO MOVED TO HEAD spin_lock_irqsave (atomic section) bio_set_dev() changes bi_bdev ### BUG: QOS flags NOT cleared kblockd_schedule_work() * Interrupt context =====> kblockd workqueue [WQ] kblockd workqueue - kworker process nvme_requeue_work() submit_bio_noacct() nvme_ns_head_submit_bio() nvme_find_path() returns NULL bio_io_error() bio_endio() rq_qos_done_bio() ### CRASH ### KERNEL PANIC / OOPS Crash from blktests nvme/058 (rapid namespace remapping): [ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1339.641025] nvme nvme4: rescanning namespaces. [ 1339.642064] #PF: supervisor read access in kernel mode [ 1339.642067] #PF: error_code(0x0000) - not-present page [ 1339.642070] PGD 0 P4D 0 [ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI [ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary) [ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST [ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core] [ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40 [ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5 53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee 48 89 df ff d0 0f 1f [ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202 [ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000 [ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000 [ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000 [ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010 [ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020 [ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000 [ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0 [ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee [ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600 [ 1339.748488] Call Trace: [ 1339.749512] <TASK> [ 1339.750449] bio_endio+0x71/0x2e0 [ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core] [ 1339.754073] __submit_bio+0x222/0x5e0 [ 1339.755623] ? rcu_is_watching+0xd/0x40 [ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370 [ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370 [ 1339.761189] ? submit_bio_noacct+0x20/0x620 [ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core] [ 1339.764828] process_one_work+0x20e/0x630 [ 1339.766528] worker_thread+0x184/0x330 [ 1339.768129] ? __pfx_worker_thread+0x10/0x10 [ 1339.769942] kthread+0x10a/0x250 [ 1339.771263] ? __pfx_kthread+0x10/0x10 [ 1339.772776] ? __pfx_kthread+0x10/0x10 [ 1339.774381] ret_from_fork+0x273/0x2e0 [ 1339.775948] ? __pfx_kthread+0x10/0x10 [ 1339.777504] ret_from_fork_asm+0x1a/0x30 [ 1339.779163] </TASK> Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags when bios are redirected to the multipath head in nvme_failover_req(). This is consistent with the existing code that clears REQ_POLLED and REQ_NOWAIT flags when the bio changes queues. Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>

This leak will cause a hang when tearing down the SCSI host. For example, iscsid hangs with the following call trace: [130120.652718] scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured PID: 2528 TASK: ffff9d0408974e00 CPU: 3 COMMAND: "iscsid" #0 [ffffb5b9c134b9e0] __schedule at ffffffff860657d4 #1 [ffffb5b9c134ba28] schedule at ffffffff86065c6f #2 [ffffb5b9c134ba40] schedule_timeout at ffffffff86069fb0 #3 [ffffb5b9c134bab0] __wait_for_common at ffffffff8606674f #4 [ffffb5b9c134bb10] scsi_remove_host at ffffffff85bfe84b #5 [ffffb5b9c134bb30] iscsi_sw_tcp_session_destroy at ffffffffc03031c4 [iscsi_tcp] #6 [ffffb5b9c134bb48] iscsi_if_recv_msg at ffffffffc0292692 [scsi_transport_iscsi] #7 [ffffb5b9c134bb98] iscsi_if_rx at ffffffffc02929c2 [scsi_transport_iscsi] #8 [ffffb5b9c134bbf0] netlink_unicast at ffffffff85e551d6 #9 [ffffb5b9c134bc38] netlink_sendmsg at ffffffff85e554ef Fixes: 8fe4ce5 ("scsi: core: Fix a use-after-free") Cc: [email protected] Signed-off-by: Junxiao Bi <[email protected]> Reviewed-by: Mike Christie <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Martin K. Petersen <[email protected]>

When a bio goes through the rq_qos infrastructure on a path's request queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These flags indicate that rq_qos_done_bio() should be called on completion to update rq_qos accounting. During path failover in nvme_failover_req(), the bio's bi_bdev is redirected from the failed path's disk to the multipath head's disk via bio_set_dev(). However, the BIO_QOS flags are not cleared. When the bio eventually completes (either successfully via a new path or with an error via bio_io_error()), rq_qos_done_bio() checks for these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is obtained from the bio's current bi_bdev - which is now the multipath head's queue, not the original path's queue. The multipath head's queue does not have rq_qos enabled (q->rq_qos is NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos must be valid. This breaks when a bio is moved between queues during NVMe multipath failover, leading to a NULL pointer dereference. Execution Context timeline :- * =====> dd process context [USER] dd process [SYSCALL] write() - dd process context submit_bio() nvme_ns_head_submit_bio() - path selection blk_mq_submit_bio() #### QOS FLAGS SET HERE [USER] dd waits or returns ==== I/O in flight on NVMe hardware ===== ===== End of submission path ==== ------------------------------------------------------ * dd ====> Interrupt context; [IRQ] NVMe completion interrupt nvme_irq() nvme_complete_rq() nvme_failover_req() ### BIO MOVED TO HEAD spin_lock_irqsave (atomic section) bio_set_dev() changes bi_bdev ### BUG: QOS flags NOT cleared kblockd_schedule_work() * Interrupt context =====> kblockd workqueue [WQ] kblockd workqueue - kworker process nvme_requeue_work() submit_bio_noacct() nvme_ns_head_submit_bio() nvme_find_path() returns NULL bio_io_error() bio_endio() rq_qos_done_bio() ### CRASH ### KERNEL PANIC / OOPS Crash from blktests nvme/058 (rapid namespace remapping): [ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1339.641025] nvme nvme4: rescanning namespaces. [ 1339.642064] #PF: supervisor read access in kernel mode [ 1339.642067] #PF: error_code(0x0000) - not-present page [ 1339.642070] PGD 0 P4D 0 [ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI [ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary) [ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST [ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core] [ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40 [ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5 53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee 48 89 df ff d0 0f 1f [ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202 [ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000 [ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000 [ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000 [ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010 [ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020 [ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000 [ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0 [ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee [ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600 [ 1339.748488] Call Trace: [ 1339.749512] <TASK> [ 1339.750449] bio_endio+0x71/0x2e0 [ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core] [ 1339.754073] __submit_bio+0x222/0x5e0 [ 1339.755623] ? rcu_is_watching+0xd/0x40 [ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370 [ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370 [ 1339.761189] ? submit_bio_noacct+0x20/0x620 [ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core] [ 1339.764828] process_one_work+0x20e/0x630 [ 1339.766528] worker_thread+0x184/0x330 [ 1339.768129] ? __pfx_worker_thread+0x10/0x10 [ 1339.769942] kthread+0x10a/0x250 [ 1339.771263] ? __pfx_kthread+0x10/0x10 [ 1339.772776] ? __pfx_kthread+0x10/0x10 [ 1339.774381] ret_from_fork+0x273/0x2e0 [ 1339.775948] ? __pfx_kthread+0x10/0x10 [ 1339.777504] ret_from_fork_asm+0x1a/0x30 [ 1339.779163] </TASK> Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags when bios are redirected to the multipath head in nvme_failover_req(). This is consistent with the existing code that clears REQ_POLLED and REQ_NOWAIT flags when the bio changes queues. Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>

When a bio goes through the rq_qos infrastructure on a path's request queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These flags indicate that rq_qos_done_bio() should be called on completion to update rq_qos accounting. During path failover in nvme_failover_req(), the bio's bi_bdev is redirected from the failed path's disk to the multipath head's disk via bio_set_dev(). However, the BIO_QOS flags are not cleared. When the bio eventually completes (either successfully via a new path or with an error via bio_io_error()), rq_qos_done_bio() checks for these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is obtained from the bio's current bi_bdev - which is now the multipath head's queue, not the original path's queue. The multipath head's queue does not have rq_qos enabled (q->rq_qos is NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos must be valid. This breaks when a bio is moved between queues during NVMe multipath failover, leading to a NULL pointer dereference. Execution Context timeline :- * =====> dd process context [USER] dd process [SYSCALL] write() - dd process context submit_bio() nvme_ns_head_submit_bio() - path selection blk_mq_submit_bio() #### QOS FLAGS SET HERE [USER] dd waits or returns ==== I/O in flight on NVMe hardware ===== ===== End of submission path ==== ------------------------------------------------------ * dd ====> Interrupt context; [IRQ] NVMe completion interrupt nvme_irq() nvme_complete_rq() nvme_failover_req() ### BIO MOVED TO HEAD spin_lock_irqsave (atomic section) bio_set_dev() changes bi_bdev ### BUG: QOS flags NOT cleared kblockd_schedule_work() * Interrupt context =====> kblockd workqueue [WQ] kblockd workqueue - kworker process nvme_requeue_work() submit_bio_noacct() nvme_ns_head_submit_bio() nvme_find_path() returns NULL bio_io_error() bio_endio() rq_qos_done_bio() ### CRASH ### KERNEL PANIC / OOPS Crash from blktests nvme/058 (rapid namespace remapping): [ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1339.641025] nvme nvme4: rescanning namespaces. [ 1339.642064] #PF: supervisor read access in kernel mode [ 1339.642067] #PF: error_code(0x0000) - not-present page [ 1339.642070] PGD 0 P4D 0 [ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI [ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary) [ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST [ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core] [ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40 [ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5 53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee 48 89 df ff d0 0f 1f [ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202 [ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000 [ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000 [ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000 [ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010 [ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020 [ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000 [ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0 [ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee [ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600 [ 1339.748488] Call Trace: [ 1339.749512] <TASK> [ 1339.750449] bio_endio+0x71/0x2e0 [ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core] [ 1339.754073] __submit_bio+0x222/0x5e0 [ 1339.755623] ? rcu_is_watching+0xd/0x40 [ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370 [ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370 [ 1339.761189] ? submit_bio_noacct+0x20/0x620 [ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core] [ 1339.764828] process_one_work+0x20e/0x630 [ 1339.766528] worker_thread+0x184/0x330 [ 1339.768129] ? __pfx_worker_thread+0x10/0x10 [ 1339.769942] kthread+0x10a/0x250 [ 1339.771263] ? __pfx_kthread+0x10/0x10 [ 1339.772776] ? __pfx_kthread+0x10/0x10 [ 1339.774381] ret_from_fork+0x273/0x2e0 [ 1339.775948] ? __pfx_kthread+0x10/0x10 [ 1339.777504] ret_from_fork_asm+0x1a/0x30 [ 1339.779163] </TASK> Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags when bios are redirected to the multipath head in nvme_failover_req(). This is consistent with the existing code that clears REQ_POLLED and REQ_NOWAIT flags when the bio changes queues. Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jens Axboe <[email protected]>

When a bio goes through the rq_qos infrastructure on a path's request queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These flags indicate that rq_qos_done_bio() should be called on completion to update rq_qos accounting. During path failover in nvme_failover_req(), the bio's bi_bdev is redirected from the failed path's disk to the multipath head's disk via bio_set_dev(). However, the BIO_QOS flags are not cleared. When the bio eventually completes (either successfully via a new path or with an error via bio_io_error()), rq_qos_done_bio() checks for these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is obtained from the bio's current bi_bdev - which is now the multipath head's queue, not the original path's queue. The multipath head's queue does not have rq_qos enabled (q->rq_qos is NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos must be valid. This breaks when a bio is moved between queues during NVMe multipath failover, leading to a NULL pointer dereference. Execution Context timeline :- * =====> dd process context [USER] dd process [SYSCALL] write() - dd process context submit_bio() nvme_ns_head_submit_bio() - path selection blk_mq_submit_bio() #### QOS FLAGS SET HERE [USER] dd waits or returns ==== I/O in flight on NVMe hardware ===== ===== End of submission path ==== ------------------------------------------------------ * dd ====> Interrupt context; [IRQ] NVMe completion interrupt nvme_irq() nvme_complete_rq() nvme_failover_req() ### BIO MOVED TO HEAD spin_lock_irqsave (atomic section) bio_set_dev() changes bi_bdev ### BUG: QOS flags NOT cleared kblockd_schedule_work() * Interrupt context =====> kblockd workqueue [WQ] kblockd workqueue - kworker process nvme_requeue_work() submit_bio_noacct() nvme_ns_head_submit_bio() nvme_find_path() returns NULL bio_io_error() bio_endio() rq_qos_done_bio() ### CRASH ### KERNEL PANIC / OOPS Crash from blktests nvme/058 (rapid namespace remapping): [ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1339.641025] nvme nvme4: rescanning namespaces. [ 1339.642064] #PF: supervisor read access in kernel mode [ 1339.642067] #PF: error_code(0x0000) - not-present page [ 1339.642070] PGD 0 P4D 0 [ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI [ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary) [ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST [ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core] [ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40 [ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5 53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee 48 89 df ff d0 0f 1f [ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202 [ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000 [ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000 [ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000 [ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010 [ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020 [ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000 [ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0 [ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee [ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600 [ 1339.748488] Call Trace: [ 1339.749512] <TASK> [ 1339.750449] bio_endio+0x71/0x2e0 [ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core] [ 1339.754073] __submit_bio+0x222/0x5e0 [ 1339.755623] ? rcu_is_watching+0xd/0x40 [ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370 [ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370 [ 1339.761189] ? submit_bio_noacct+0x20/0x620 [ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core] [ 1339.764828] process_one_work+0x20e/0x630 [ 1339.766528] worker_thread+0x184/0x330 [ 1339.768129] ? __pfx_worker_thread+0x10/0x10 [ 1339.769942] kthread+0x10a/0x250 [ 1339.771263] ? __pfx_kthread+0x10/0x10 [ 1339.772776] ? __pfx_kthread+0x10/0x10 [ 1339.774381] ret_from_fork+0x273/0x2e0 [ 1339.775948] ? __pfx_kthread+0x10/0x10 [ 1339.777504] ret_from_fork_asm+0x1a/0x30 [ 1339.779163] </TASK> Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags when bios are redirected to the multipath head in nvme_failover_req(). This is consistent with the existing code that clears REQ_POLLED and REQ_NOWAIT flags when the bio changes queues. Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>

This patch fixes an out-of-bounds access in ceph_handle_auth_reply() that can be triggered by a message of type CEPH_MSG_AUTH_REPLY. In ceph_handle_auth_reply(), the value of the payload_len field of such a message is stored in a variable of type int. A value greater than INT_MAX leads to an integer overflow and is interpreted as a negative value. This leads to decrementing the pointer address by this value and subsequently accessing it because ceph_decode_need() only checks that the memory access does not exceed the end address of the allocation. This patch fixes the issue by changing the data type of payload_len to u32. Additionally, the data type of result_msg_len is changed to u32, as it is also a variable holding a non-negative length. Also, an additional layer of sanity checks is introduced, ensuring that directly after reading it from the message, payload_len and result_msg_len are not greater than the overall segment length. BUG: KASAN: slab-out-of-bounds in ceph_handle_auth_reply+0x642/0x7a0 [libceph] Read of size 4 at addr ffff88811404df14 by task kworker/20:1/262 CPU: 20 UID: 0 PID: 262 Comm: kworker/20:1 Not tainted 6.19.2 #5 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Workqueue: ceph-msgr ceph_con_workfn [libceph] Call Trace: <TASK> dump_stack_lvl+0x76/0xa0 print_report+0xd1/0x620 ? __pfx__raw_spin_lock_irqsave+0x10/0x10 ? kasan_complete_mode_report_info+0x72/0x210 kasan_report+0xe7/0x130 ? ceph_handle_auth_reply+0x642/0x7a0 [libceph] ? ceph_handle_auth_reply+0x642/0x7a0 [libceph] __asan_report_load_n_noabort+0xf/0x20 ceph_handle_auth_reply+0x642/0x7a0 [libceph] mon_dispatch+0x973/0x23d0 [libceph] ? apparmor_socket_recvmsg+0x6b/0xa0 ? __pfx_mon_dispatch+0x10/0x10 [libceph] ? __kasan_check_write+0x14/0x30i ? mutex_unlock+0x7f/0xd0 ? __pfx_mutex_unlock+0x10/0x10 ? __pfx_do_recvmsg+0x10/0x10 [libceph] ceph_con_process_message+0x1f1/0x650 [libceph] process_message+0x1e/0x450 [libceph] ceph_con_v2_try_read+0x2e48/0x6c80 [libceph] ? __pfx_ceph_con_v2_try_read+0x10/0x10 [libceph] ? save_fpregs_to_fpstate+0xb0/0x230 ? raw_spin_rq_unlock+0x17/0xa0 ? finish_task_switch.isra.0+0x13b/0x760 ? __switch_to+0x385/0xda0 ? __kasan_check_write+0x14/0x30 ? mutex_lock+0x8d/0xe0 ? __pfx_mutex_lock+0x10/0x10 ceph_con_workfn+0x248/0x10c0 [libceph] process_one_work+0x629/0xf80 ? __kasan_check_write+0x14/0x30 worker_thread+0x87f/0x1570 ? __pfx__raw_spin_lock_irqsave+0x10/0x10 ? __pfx_try_to_wake_up+0x10/0x10 ? kasan_print_address_stack_frame+0x1f7/0x280 ? __pfx_worker_thread+0x10/0x10 kthread+0x396/0x830 ? __pfx__raw_spin_lock_irq+0x10/0x10 ? __pfx_kthread+0x10/0x10 ? __kasan_check_write+0x14/0x30 ? recalc_sigpending+0x180/0x210 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x3f7/0x610 ? __pfx_ret_from_fork+0x10/0x10 ? __switch_to+0x385/0xda0 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> [ idryomov: replace if statements with ceph_decode_need() for payload_len and result_msg_len ] Cc: [email protected] Signed-off-by: Raphael Zimmer <[email protected]> Reviewed-by: Viacheslav Dubeyko <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>

When a bio goes through the rq_qos infrastructure on a path's request queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These flags indicate that rq_qos_done_bio() should be called on completion to update rq_qos accounting. During path failover in nvme_failover_req(), the bio's bi_bdev is redirected from the failed path's disk to the multipath head's disk via bio_set_dev(). However, the BIO_QOS flags are not cleared. When the bio eventually completes (either successfully via a new path or with an error via bio_io_error()), rq_qos_done_bio() checks for these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is obtained from the bio's current bi_bdev - which is now the multipath head's queue, not the original path's queue. The multipath head's queue does not have rq_qos enabled (q->rq_qos is NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos must be valid. This breaks when a bio is moved between queues during NVMe multipath failover, leading to a NULL pointer dereference. Execution Context timeline :- * =====> dd process context [USER] dd process [SYSCALL] write() - dd process context submit_bio() nvme_ns_head_submit_bio() - path selection blk_mq_submit_bio() #### QOS FLAGS SET HERE [USER] dd waits or returns ==== I/O in flight on NVMe hardware ===== ===== End of submission path ==== ------------------------------------------------------ * dd ====> Interrupt context; [IRQ] NVMe completion interrupt nvme_irq() nvme_complete_rq() nvme_failover_req() ### BIO MOVED TO HEAD spin_lock_irqsave (atomic section) bio_set_dev() changes bi_bdev ### BUG: QOS flags NOT cleared kblockd_schedule_work() * Interrupt context =====> kblockd workqueue [WQ] kblockd workqueue - kworker process nvme_requeue_work() submit_bio_noacct() nvme_ns_head_submit_bio() nvme_find_path() returns NULL bio_io_error() bio_endio() rq_qos_done_bio() ### CRASH ### KERNEL PANIC / OOPS Crash from blktests nvme/058 (rapid namespace remapping): [ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1339.641025] nvme nvme4: rescanning namespaces. [ 1339.642064] #PF: supervisor read access in kernel mode [ 1339.642067] #PF: error_code(0x0000) - not-present page [ 1339.642070] PGD 0 P4D 0 [ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI [ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary) [ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST [ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core] [ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40 [ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5 53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee 48 89 df ff d0 0f 1f [ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202 [ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000 [ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000 [ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000 [ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010 [ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020 [ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000 [ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0 [ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee [ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600 [ 1339.748488] Call Trace: [ 1339.749512] <TASK> [ 1339.750449] bio_endio+0x71/0x2e0 [ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core] [ 1339.754073] __submit_bio+0x222/0x5e0 [ 1339.755623] ? rcu_is_watching+0xd/0x40 [ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370 [ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370 [ 1339.761189] ? submit_bio_noacct+0x20/0x620 [ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core] [ 1339.764828] process_one_work+0x20e/0x630 [ 1339.766528] worker_thread+0x184/0x330 [ 1339.768129] ? __pfx_worker_thread+0x10/0x10 [ 1339.769942] kthread+0x10a/0x250 [ 1339.771263] ? __pfx_kthread+0x10/0x10 [ 1339.772776] ? __pfx_kthread+0x10/0x10 [ 1339.774381] ret_from_fork+0x273/0x2e0 [ 1339.775948] ? __pfx_kthread+0x10/0x10 [ 1339.777504] ret_from_fork_asm+0x1a/0x30 [ 1339.779163] </TASK> Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags when bios are redirected to the multipath head in nvme_failover_req(). This is consistent with the existing code that clears REQ_POLLED and REQ_NOWAIT flags when the bio changes queues. Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>

As reported by syzbot [0], NBD can trigger a deadlock during memory reclaim. This occurs when a process holds lock_sock() on a backend TCP socket and triggers a memory allocation that leads to fs reclaim. If it eventually calls into NBD to send data or shut down the socket, NBD will attempt to acquire the same lock_sock(), resulting in the deadlock. While NBD sets sk->sk_allocation to GFP_NOIO before calling sendmsg(), this does not prevent the issue in some paths where GFP_KERNEL is used directly under lock_sock(). To resolve this, let's use lock_sock_try() for TCP sendmsg() and shutdown(). For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is returned, allowing the request to be retried later (e.g., via was_interrupted() logic). For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(), the operation might be skipped if the lock cannot be acquired. However, this is not expected to occur in practice because the backend TCP socket should not be touched by userspace once it is handed over to NBD. Note that sock_recvmsg() does not require this special handling because it is only called from the workqueue context. Also note that AF_UNIX sockets continue to use sock_sendmsg() and kernel_sock_shutdown() because unix_stream_sendmsg() and unix_shutdown() do not acquire lock_sock(). [0]: WARNING: possible circular locking dependency detected syzkaller #0 Tainted: G L syz.7.2282/12353 is trying to acquire lock: ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline] ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline] ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline] ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918 but task is already holding lock: ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline] ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #6 (sk_lock-AF_INET6){+.+.}-{0:0}: lock_sock_nested+0x41/0xf0 net/core/sock.c:3780 lock_sock include/net/sock.h:1709 [inline] inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919 nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318 sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411 nbd_clear_sock drivers/block/nbd.c:1427 [inline] nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451 nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248 genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114 genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline] genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209 netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592 ___sys_sendmsg+0x190/0x1e0 net/socket.c:2646 __sys_sendmsg+0x170/0x220 net/socket.c:2678 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #5 (&nsock->tx_lock){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776 nbd_handle_cmd drivers/block/nbd.c:1143 [inline] nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207 blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148 __blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline] blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline] __blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307 blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329 blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386 blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949 blk_mq_flush_plug_list block/blk-mq.c:2997 [inline] blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969 __blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230 blk_finish_plug block/blk-core.c:1257 [inline] __submit_bio+0x584/0x6c0 block/blk-core.c:649 __submit_bio_noacct_mq block/blk-core.c:722 [inline] submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753 submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884 blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline] submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821 submit_bh fs/buffer.c:2826 [inline] block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444 filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501 do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101 read_mapping_folio include/linux/pagemap.h:1028 [inline] read_part_sector+0xd1/0x370 block/partitions/core.c:723 adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360 check_partition block/partitions/core.c:142 [inline] blk_add_partitions block/partitions/core.c:590 [inline] bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694 blkdev_get_whole+0x187/0x290 block/bdev.c:764 bdev_open+0x2c7/0xe40 block/bdev.c:973 blkdev_open+0x34e/0x4f0 block/fops.c:697 do_dentry_open+0x6d8/0x1660 fs/open.c:949 vfs_open+0x82/0x3f0 fs/open.c:1081 do_open fs/namei.c:4671 [inline] path_openat+0x208c/0x31a0 fs/namei.c:4830 do_file_open+0x20e/0x430 fs/namei.c:4859 do_sys_openat2+0x10d/0x1e0 fs/open.c:1366 do_sys_open fs/open.c:1372 [inline] __do_sys_openat fs/open.c:1388 [inline] __se_sys_openat fs/open.c:1383 [inline] __x64_sys_openat+0x12d/0x210 fs/open.c:1383 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #4 (&cmd->lock){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776 nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199 blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148 __blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline] blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline] __blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307 blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329 blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386 blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949 blk_mq_flush_plug_list block/blk-mq.c:2997 [inline] blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969 __blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230 blk_finish_plug block/blk-core.c:1257 [inline] __submit_bio+0x584/0x6c0 block/blk-core.c:649 __submit_bio_noacct_mq block/blk-core.c:722 [inline] submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753 submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884 blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline] submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821 submit_bh fs/buffer.c:2826 [inline] block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444 filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501 do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101 read_mapping_folio include/linux/pagemap.h:1028 [inline] read_part_sector+0xd1/0x370 block/partitions/core.c:723 adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360 check_partition block/partitions/core.c:142 [inline] blk_add_partitions block/partitions/core.c:590 [inline] bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694 blkdev_get_whole+0x187/0x290 block/bdev.c:764 bdev_open+0x2c7/0xe40 block/bdev.c:973 blkdev_open+0x34e/0x4f0 block/fops.c:697 do_dentry_open+0x6d8/0x1660 fs/open.c:949 vfs_open+0x82/0x3f0 fs/open.c:1081 do_open fs/namei.c:4671 [inline] path_openat+0x208c/0x31a0 fs/namei.c:4830 do_file_open+0x20e/0x430 fs/namei.c:4859 do_sys_openat2+0x10d/0x1e0 fs/open.c:1366 do_sys_open fs/open.c:1372 [inline] __do_sys_openat fs/open.c:1388 [inline] __se_sys_openat fs/open.c:1383 [inline] __x64_sys_openat+0x12d/0x210 fs/open.c:1383 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #3 (set->srcu){.+.+}-{0:0}: srcu_lock_sync include/linux/srcu.h:199 [inline] __synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505 blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline] blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline] blk_mq_quiesce_queue block/blk-mq.c:304 [inline] blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299 elevator_switch+0x17b/0x7e0 block/elevator.c:576 elevator_change+0x352/0x530 block/elevator.c:681 elevator_set_default+0x29e/0x360 block/elevator.c:754 blk_register_queue+0x412/0x590 block/blk-sysfs.c:946 __add_disk+0x73f/0xe40 block/genhd.c:528 add_disk_fwnode+0x118/0x5c0 block/genhd.c:597 add_disk include/linux/blkdev.h:785 [inline] nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984 nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692 do_one_initcall+0x11d/0x760 init/main.c:1382 do_initcall_level init/main.c:1444 [inline] do_initcalls init/main.c:1460 [inline] do_basic_setup init/main.c:1479 [inline] kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692 kernel_init+0x1f/0x1e0 init/main.c:1582 ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 -> #2 (&q->elevator_lock){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776 elevator_change+0x1bc/0x530 block/elevator.c:679 elevator_set_none+0x92/0xf0 block/elevator.c:769 blk_mq_elv_switch_none block/blk-mq.c:5110 [inline] __blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline] blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220 nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489 nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239 genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114 genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline] genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209 netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592 ___sys_sendmsg+0x190/0x1e0 net/socket.c:2646 __sys_sendmsg+0x170/0x220 net/socket.c:2678 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}: blk_alloc_queue+0x610/0x790 block/blk-core.c:461 blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429 __blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476 nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954 nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692 do_one_initcall+0x11d/0x760 init/main.c:1382 do_initcall_level init/main.c:1444 [inline] do_initcalls init/main.c:1460 [inline] do_basic_setup init/main.c:1479 [inline] kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692 kernel_init+0x1f/0x1e0 init/main.c:1582 ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 -> #0 (fs_reclaim){+.+.}-{0:0}: check_prev_add kernel/locking/lockdep.c:3165 [inline] check_prevs_add kernel/locking/lockdep.c:3284 [inline] validate_chain kernel/locking/lockdep.c:3908 [inline] __lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237 lock_acquire kernel/locking/lockdep.c:5868 [inline] lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825 __fs_reclaim_acquire mm/page_alloc.c:4348 [inline] fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362 might_alloc include/linux/sched/mm.h:317 [inline] slab_pre_alloc_hook mm/slub.c:4489 [inline] slab_alloc_node mm/slub.c:4843 [inline] kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918 __alloc_skb+0x140/0x710 net/core/skbuff.c:702 alloc_skb include/linux/skbuff.h:1383 [inline] tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862 __tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223 tcp_close+0x28/0x110 net/ipv4/tcp.c:3350 inet_release+0xed/0x200 net/ipv4/af_inet.c:443 inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479 __sock_release+0xb3/0x260 net/socket.c:662 sock_close+0x1c/0x30 net/socket.c:1455 __fput+0x3ff/0xb40 fs/file_table.c:469 task_work_run+0x150/0x240 kernel/task_work.c:233 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] __exit_to_user_mode_loop kernel/entry/common.c:67 [inline] exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline] do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Chain exists of: fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sk_lock-AF_INET6); lock(&nsock->tx_lock); lock(sk_lock-AF_INET6); lock(fs_reclaim); *** DEADLOCK *** Fixes: fd8383f ("nbd: convert to blkmq") Reported-by: [email protected] Closes: https://lore.kernel.org/netdev/[email protected]/ Signed-off-by: Kuniyuki Iwashima <[email protected]>

When a bio goes through the rq_qos infrastructure on a path's request queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These flags indicate that rq_qos_done_bio() should be called on completion to update rq_qos accounting. During path failover in nvme_failover_req(), the bio's bi_bdev is redirected from the failed path's disk to the multipath head's disk via bio_set_dev(). However, the BIO_QOS flags are not cleared. When the bio eventually completes (either successfully via a new path or with an error via bio_io_error()), rq_qos_done_bio() checks for these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is obtained from the bio's current bi_bdev - which is now the multipath head's queue, not the original path's queue. The multipath head's queue does not have rq_qos enabled (q->rq_qos is NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos must be valid. This breaks when a bio is moved between queues during NVMe multipath failover, leading to a NULL pointer dereference. Execution Context timeline :- * =====> dd process context [USER] dd process [SYSCALL] write() - dd process context submit_bio() nvme_ns_head_submit_bio() - path selection blk_mq_submit_bio() #### QOS FLAGS SET HERE [USER] dd waits or returns ==== I/O in flight on NVMe hardware ===== ===== End of submission path ==== ------------------------------------------------------ * dd ====> Interrupt context; [IRQ] NVMe completion interrupt nvme_irq() nvme_complete_rq() nvme_failover_req() ### BIO MOVED TO HEAD spin_lock_irqsave (atomic section) bio_set_dev() changes bi_bdev ### BUG: QOS flags NOT cleared kblockd_schedule_work() * Interrupt context =====> kblockd workqueue [WQ] kblockd workqueue - kworker process nvme_requeue_work() submit_bio_noacct() nvme_ns_head_submit_bio() nvme_find_path() returns NULL bio_io_error() bio_endio() rq_qos_done_bio() ### CRASH ### KERNEL PANIC / OOPS Crash from blktests nvme/058 (rapid namespace remapping): [ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1339.641025] nvme nvme4: rescanning namespaces. [ 1339.642064] #PF: supervisor read access in kernel mode [ 1339.642067] #PF: error_code(0x0000) - not-present page [ 1339.642070] PGD 0 P4D 0 [ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI [ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary) [ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST [ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core] [ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40 [ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5 53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee 48 89 df ff d0 0f 1f [ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202 [ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000 [ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000 [ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000 [ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010 [ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020 [ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000 [ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0 [ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee [ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600 [ 1339.748488] Call Trace: [ 1339.749512] <TASK> [ 1339.750449] bio_endio+0x71/0x2e0 [ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core] [ 1339.754073] __submit_bio+0x222/0x5e0 [ 1339.755623] ? rcu_is_watching+0xd/0x40 [ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370 [ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370 [ 1339.761189] ? submit_bio_noacct+0x20/0x620 [ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core] [ 1339.764828] process_one_work+0x20e/0x630 [ 1339.766528] worker_thread+0x184/0x330 [ 1339.768129] ? __pfx_worker_thread+0x10/0x10 [ 1339.769942] kthread+0x10a/0x250 [ 1339.771263] ? __pfx_kthread+0x10/0x10 [ 1339.772776] ? __pfx_kthread+0x10/0x10 [ 1339.774381] ret_from_fork+0x273/0x2e0 [ 1339.775948] ? __pfx_kthread+0x10/0x10 [ 1339.777504] ret_from_fork_asm+0x1a/0x30 [ 1339.779163] </TASK> Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags when bios are redirected to the multipath head in nvme_failover_req(). This is consistent with the existing code that clears REQ_POLLED and REQ_NOWAIT flags when the bio changes queues. Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>

The devm_free_irq() and devm_request_irq() functions should not be executed in an atomic context. During device suspend, all userspace processes and most kernel threads are frozen. Additionally, we flush all tx/rx status, disable all macb interrupts, and halt rx operations. Therefore, it is safe to split the region protected by bp->lock into two independent sections, allowing devm_free_irq() and devm_request_irq() to run in a non-atomic context. This modification resolves the following lockdep warning: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:591 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 501, name: rtcwake preempt_count: 1, expected: 0 RCU nest depth: 1, expected: 0 7 locks held by rtcwake/501: #0: ffff0008038c3408 (sb_writers#5){.+.+}-{0:0}, at: vfs_write+0xf8/0x368 #1: ffff0008049a5e88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0xbc/0x1c8 #2: ffff00080098d588 (kn->active#70){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xcc/0x1c8 #3: ffff800081c84888 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0x1ec/0x290 #4: ffff0008009ba0f8 (&dev->mutex){....}-{4:4}, at: device_suspend+0x118/0x4f0 #5: ffff800081d00458 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48 #6: ffff0008031fb9e0 (&bp->lock){-.-.}-{3:3}, at: macb_suspend+0x144/0x558 irq event stamp: 8682 hardirqs last enabled at (8681): [<ffff8000813c7d7c>] _raw_spin_unlock_irqrestore+0x44/0x88 hardirqs last disabled at (8682): [<ffff8000813c7b58>] _raw_spin_lock_irqsave+0x38/0x98 softirqs last enabled at (7322): [<ffff8000800f1b4c>] handle_softirqs+0x52c/0x588 softirqs last disabled at (7317): [<ffff800080010310>] __do_softirq+0x20/0x2c CPU: 1 UID: 0 PID: 501 Comm: rtcwake Not tainted 7.0.0-rc3-next-20260310-yocto-standard+ #125 PREEMPT Hardware name: ZynqMP ZCU102 Rev1.1 (DT) Call trace: show_stack+0x24/0x38 (C) __dump_stack+0x28/0x38 dump_stack_lvl+0x64/0x88 dump_stack+0x18/0x24 __might_resched+0x200/0x218 __might_sleep+0x38/0x98 __mutex_lock_common+0x7c/0x1378 mutex_lock_nested+0x38/0x50 free_irq+0x68/0x2b0 devm_irq_release+0x24/0x38 devres_release+0x40/0x80 devm_free_irq+0x48/0x88 macb_suspend+0x298/0x558 device_suspend+0x218/0x4f0 dpm_suspend+0x244/0x3a0 dpm_suspend_start+0x50/0x78 suspend_devices_and_enter+0xec/0x560 pm_suspend+0x194/0x290 state_store+0x110/0x158 kobj_attr_store+0x1c/0x30 sysfs_kf_write+0xa8/0xd0 kernfs_fop_write_iter+0x11c/0x1c8 vfs_write+0x248/0x368 ksys_write+0x7c/0xf8 __arm64_sys_write+0x28/0x40 invoke_syscall+0x4c/0xe8 el0_svc_common+0x98/0xf0 do_el0_svc+0x28/0x40 el0_svc+0x54/0x1e0 el0t_64_sync_handler+0x84/0x130 el0t_64_sync+0x198/0x1a0 Fixes: 558e35c ("net: macb: WoL support for GEM type of Ethernet controller") Cc: [email protected] Reviewed-by: Théo Lebrun <[email protected]> Signed-off-by: Kevin Hao <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

As reported by syzbot [0], NBD can trigger a deadlock during memory reclaim. This occurs when a process holds lock_sock() on a backend TCP socket and triggers a memory allocation that leads to fs reclaim. If it eventually calls into NBD to send data or shut down the socket, NBD will attempt to acquire the same lock_sock(), resulting in the deadlock. While NBD sets sk->sk_allocation to GFP_NOIO before calling sendmsg(), this does not prevent the issue in some paths where GFP_KERNEL is used directly under lock_sock(). To resolve this, let's use lock_sock_try() for TCP sendmsg() and shutdown(). For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is returned, allowing the request to be retried later (e.g., via was_interrupted() logic). For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(), the operation might be skipped if the lock cannot be acquired. However, this is not expected to occur in practice because the backend TCP socket should not be touched by userspace once it is handed over to NBD. Note that sock_recvmsg() does not require this special handling because it is only called from the workqueue context. Also note that AF_UNIX sockets continue to use sock_sendmsg() and kernel_sock_shutdown() because unix_stream_sendmsg() and unix_shutdown() do not acquire lock_sock(). [0]: WARNING: possible circular locking dependency detected syzkaller #0 Tainted: G L syz.7.2282/12353 is trying to acquire lock: ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline] ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline] ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline] ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918 but task is already holding lock: ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline] ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #6 (sk_lock-AF_INET6){+.+.}-{0:0}: lock_sock_nested+0x41/0xf0 net/core/sock.c:3780 lock_sock include/net/sock.h:1709 [inline] inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919 nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318 sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411 nbd_clear_sock drivers/block/nbd.c:1427 [inline] nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451 nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248 genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114 genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline] genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209 netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592 ___sys_sendmsg+0x190/0x1e0 net/socket.c:2646 __sys_sendmsg+0x170/0x220 net/socket.c:2678 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #5 (&nsock->tx_lock){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776 nbd_handle_cmd drivers/block/nbd.c:1143 [inline] nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207 blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148 __blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline] blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline] __blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307 blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329 blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386 blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949 blk_mq_flush_plug_list block/blk-mq.c:2997 [inline] blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969 __blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230 blk_finish_plug block/blk-core.c:1257 [inline] __submit_bio+0x584/0x6c0 block/blk-core.c:649 __submit_bio_noacct_mq block/blk-core.c:722 [inline] submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753 submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884 blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline] submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821 submit_bh fs/buffer.c:2826 [inline] block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444 filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501 do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101 read_mapping_folio include/linux/pagemap.h:1028 [inline] read_part_sector+0xd1/0x370 block/partitions/core.c:723 adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360 check_partition block/partitions/core.c:142 [inline] blk_add_partitions block/partitions/core.c:590 [inline] bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694 blkdev_get_whole+0x187/0x290 block/bdev.c:764 bdev_open+0x2c7/0xe40 block/bdev.c:973 blkdev_open+0x34e/0x4f0 block/fops.c:697 do_dentry_open+0x6d8/0x1660 fs/open.c:949 vfs_open+0x82/0x3f0 fs/open.c:1081 do_open fs/namei.c:4671 [inline] path_openat+0x208c/0x31a0 fs/namei.c:4830 do_file_open+0x20e/0x430 fs/namei.c:4859 do_sys_openat2+0x10d/0x1e0 fs/open.c:1366 do_sys_open fs/open.c:1372 [inline] __do_sys_openat fs/open.c:1388 [inline] __se_sys_openat fs/open.c:1383 [inline] __x64_sys_openat+0x12d/0x210 fs/open.c:1383 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #4 (&cmd->lock){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776 nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199 blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148 __blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline] blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline] __blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307 blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329 blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386 blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949 blk_mq_flush_plug_list block/blk-mq.c:2997 [inline] blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969 __blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230 blk_finish_plug block/blk-core.c:1257 [inline] __submit_bio+0x584/0x6c0 block/blk-core.c:649 __submit_bio_noacct_mq block/blk-core.c:722 [inline] submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753 submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884 blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline] submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821 submit_bh fs/buffer.c:2826 [inline] block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444 filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501 do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101 read_mapping_folio include/linux/pagemap.h:1028 [inline] read_part_sector+0xd1/0x370 block/partitions/core.c:723 adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360 check_partition block/partitions/core.c:142 [inline] blk_add_partitions block/partitions/core.c:590 [inline] bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694 blkdev_get_whole+0x187/0x290 block/bdev.c:764 bdev_open+0x2c7/0xe40 block/bdev.c:973 blkdev_open+0x34e/0x4f0 block/fops.c:697 do_dentry_open+0x6d8/0x1660 fs/open.c:949 vfs_open+0x82/0x3f0 fs/open.c:1081 do_open fs/namei.c:4671 [inline] path_openat+0x208c/0x31a0 fs/namei.c:4830 do_file_open+0x20e/0x430 fs/namei.c:4859 do_sys_openat2+0x10d/0x1e0 fs/open.c:1366 do_sys_open fs/open.c:1372 [inline] __do_sys_openat fs/open.c:1388 [inline] __se_sys_openat fs/open.c:1383 [inline] __x64_sys_openat+0x12d/0x210 fs/open.c:1383 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #3 (set->srcu){.+.+}-{0:0}: srcu_lock_sync include/linux/srcu.h:199 [inline] __synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505 blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline] blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline] blk_mq_quiesce_queue block/blk-mq.c:304 [inline] blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299 elevator_switch+0x17b/0x7e0 block/elevator.c:576 elevator_change+0x352/0x530 block/elevator.c:681 elevator_set_default+0x29e/0x360 block/elevator.c:754 blk_register_queue+0x412/0x590 block/blk-sysfs.c:946 __add_disk+0x73f/0xe40 block/genhd.c:528 add_disk_fwnode+0x118/0x5c0 block/genhd.c:597 add_disk include/linux/blkdev.h:785 [inline] nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984 nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692 do_one_initcall+0x11d/0x760 init/main.c:1382 do_initcall_level init/main.c:1444 [inline] do_initcalls init/main.c:1460 [inline] do_basic_setup init/main.c:1479 [inline] kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692 kernel_init+0x1f/0x1e0 init/main.c:1582 ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 -> #2 (&q->elevator_lock){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776 elevator_change+0x1bc/0x530 block/elevator.c:679 elevator_set_none+0x92/0xf0 block/elevator.c:769 blk_mq_elv_switch_none block/blk-mq.c:5110 [inline] __blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline] blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220 nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489 nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239 genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114 genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline] genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209 netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592 ___sys_sendmsg+0x190/0x1e0 net/socket.c:2646 __sys_sendmsg+0x170/0x220 net/socket.c:2678 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}: blk_alloc_queue+0x610/0x790 block/blk-core.c:461 blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429 __blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476 nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954 nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692 do_one_initcall+0x11d/0x760 init/main.c:1382 do_initcall_level init/main.c:1444 [inline] do_initcalls init/main.c:1460 [inline] do_basic_setup init/main.c:1479 [inline] kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692 kernel_init+0x1f/0x1e0 init/main.c:1582 ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 -> #0 (fs_reclaim){+.+.}-{0:0}: check_prev_add kernel/locking/lockdep.c:3165 [inline] check_prevs_add kernel/locking/lockdep.c:3284 [inline] validate_chain kernel/locking/lockdep.c:3908 [inline] __lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237 lock_acquire kernel/locking/lockdep.c:5868 [inline] lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825 __fs_reclaim_acquire mm/page_alloc.c:4348 [inline] fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362 might_alloc include/linux/sched/mm.h:317 [inline] slab_pre_alloc_hook mm/slub.c:4489 [inline] slab_alloc_node mm/slub.c:4843 [inline] kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918 __alloc_skb+0x140/0x710 net/core/skbuff.c:702 alloc_skb include/linux/skbuff.h:1383 [inline] tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862 __tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223 tcp_close+0x28/0x110 net/ipv4/tcp.c:3350 inet_release+0xed/0x200 net/ipv4/af_inet.c:443 inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479 __sock_release+0xb3/0x260 net/socket.c:662 sock_close+0x1c/0x30 net/socket.c:1455 __fput+0x3ff/0xb40 fs/file_table.c:469 task_work_run+0x150/0x240 kernel/task_work.c:233 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] __exit_to_user_mode_loop kernel/entry/common.c:67 [inline] exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline] do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Chain exists of: fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sk_lock-AF_INET6); lock(&nsock->tx_lock); lock(sk_lock-AF_INET6); lock(fs_reclaim); *** DEADLOCK *** Fixes: fd8383f ("nbd: convert to blkmq") Reported-by: [email protected] Closes: https://lore.kernel.org/netdev/[email protected]/ Signed-off-by: Kuniyuki Iwashima <[email protected]>

When a bio goes through the rq_qos infrastructure on a path's request queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These flags indicate that rq_qos_done_bio() should be called on completion to update rq_qos accounting. During path failover in nvme_failover_req(), the bio's bi_bdev is redirected from the failed path's disk to the multipath head's disk via bio_set_dev(). However, the BIO_QOS flags are not cleared. When the bio eventually completes (either successfully via a new path or with an error via bio_io_error()), rq_qos_done_bio() checks for these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is obtained from the bio's current bi_bdev - which is now the multipath head's queue, not the original path's queue. The multipath head's queue does not have rq_qos enabled (q->rq_qos is NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos must be valid. This breaks when a bio is moved between queues during NVMe multipath failover, leading to a NULL pointer dereference. Execution Context timeline :- * =====> dd process context [USER] dd process [SYSCALL] write() - dd process context submit_bio() nvme_ns_head_submit_bio() - path selection blk_mq_submit_bio() #### QOS FLAGS SET HERE [USER] dd waits or returns ==== I/O in flight on NVMe hardware ===== ===== End of submission path ==== ------------------------------------------------------ * dd ====> Interrupt context; [IRQ] NVMe completion interrupt nvme_irq() nvme_complete_rq() nvme_failover_req() ### BIO MOVED TO HEAD spin_lock_irqsave (atomic section) bio_set_dev() changes bi_bdev ### BUG: QOS flags NOT cleared kblockd_schedule_work() * Interrupt context =====> kblockd workqueue [WQ] kblockd workqueue - kworker process nvme_requeue_work() submit_bio_noacct() nvme_ns_head_submit_bio() nvme_find_path() returns NULL bio_io_error() bio_endio() rq_qos_done_bio() ### CRASH ### KERNEL PANIC / OOPS Crash from blktests nvme/058 (rapid namespace remapping): [ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1339.641025] nvme nvme4: rescanning namespaces. [ 1339.642064] #PF: supervisor read access in kernel mode [ 1339.642067] #PF: error_code(0x0000) - not-present page [ 1339.642070] PGD 0 P4D 0 [ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI [ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary) [ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST [ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core] [ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40 [ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5 53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee 48 89 df ff d0 0f 1f [ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202 [ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000 [ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000 [ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000 [ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010 [ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020 [ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000 [ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0 [ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee [ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600 [ 1339.748488] Call Trace: [ 1339.749512] <TASK> [ 1339.750449] bio_endio+0x71/0x2e0 [ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core] [ 1339.754073] __submit_bio+0x222/0x5e0 [ 1339.755623] ? rcu_is_watching+0xd/0x40 [ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370 [ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370 [ 1339.761189] ? submit_bio_noacct+0x20/0x620 [ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core] [ 1339.764828] process_one_work+0x20e/0x630 [ 1339.766528] worker_thread+0x184/0x330 [ 1339.768129] ? __pfx_worker_thread+0x10/0x10 [ 1339.769942] kthread+0x10a/0x250 [ 1339.771263] ? __pfx_kthread+0x10/0x10 [ 1339.772776] ? __pfx_kthread+0x10/0x10 [ 1339.774381] ret_from_fork+0x273/0x2e0 [ 1339.775948] ? __pfx_kthread+0x10/0x10 [ 1339.777504] ret_from_fork_asm+0x1a/0x30 [ 1339.779163] </TASK> Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags when bios are redirected to the multipath head in nvme_failover_req(). This is consistent with the existing code that clears REQ_POLLED and REQ_NOWAIT flags when the bio changes queues. Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>

As reported by syzbot [0], NBD can trigger a deadlock during memory reclaim. This occurs when a process holds lock_sock() on a backend TCP socket and triggers a memory allocation that leads to fs reclaim. If it eventually calls into NBD to send data or shut down the socket, NBD will attempt to acquire the same lock_sock(), resulting in the deadlock. While NBD sets sk->sk_allocation to GFP_NOIO before calling sendmsg(), this does not prevent the issue in some paths where GFP_KERNEL is used directly under lock_sock(). To resolve this, let's use lock_sock_try() for TCP sendmsg() and shutdown(). For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is returned, allowing the request to be retried later (e.g., via was_interrupted() logic). For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(), the operation might be skipped if the lock cannot be acquired. However, this is not expected to occur in practice because the backend TCP socket should not be touched by userspace once it is handed over to NBD. Note that sock_recvmsg() does not require this special handling because it is only called from the workqueue context. Also note that AF_UNIX sockets continue to use sock_sendmsg() and kernel_sock_shutdown() because unix_stream_sendmsg() and unix_shutdown() do not acquire lock_sock(). [0]: WARNING: possible circular locking dependency detected syzkaller #0 Tainted: G L syz.7.2282/12353 is trying to acquire lock: ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline] ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline] ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline] ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918 but task is already holding lock: ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline] ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #6 (sk_lock-AF_INET6){+.+.}-{0:0}: lock_sock_nested+0x41/0xf0 net/core/sock.c:3780 lock_sock include/net/sock.h:1709 [inline] inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919 nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318 sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411 nbd_clear_sock drivers/block/nbd.c:1427 [inline] nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451 nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248 genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114 genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline] genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209 netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592 ___sys_sendmsg+0x190/0x1e0 net/socket.c:2646 __sys_sendmsg+0x170/0x220 net/socket.c:2678 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #5 (&nsock->tx_lock){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776 nbd_handle_cmd drivers/block/nbd.c:1143 [inline] nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207 blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148 __blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline] blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline] __blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307 blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329 blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386 blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949 blk_mq_flush_plug_list block/blk-mq.c:2997 [inline] blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969 __blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230 blk_finish_plug block/blk-core.c:1257 [inline] __submit_bio+0x584/0x6c0 block/blk-core.c:649 __submit_bio_noacct_mq block/blk-core.c:722 [inline] submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753 submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884 blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline] submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821 submit_bh fs/buffer.c:2826 [inline] block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444 filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501 do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101 read_mapping_folio include/linux/pagemap.h:1028 [inline] read_part_sector+0xd1/0x370 block/partitions/core.c:723 adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360 check_partition block/partitions/core.c:142 [inline] blk_add_partitions block/partitions/core.c:590 [inline] bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694 blkdev_get_whole+0x187/0x290 block/bdev.c:764 bdev_open+0x2c7/0xe40 block/bdev.c:973 blkdev_open+0x34e/0x4f0 block/fops.c:697 do_dentry_open+0x6d8/0x1660 fs/open.c:949 vfs_open+0x82/0x3f0 fs/open.c:1081 do_open fs/namei.c:4671 [inline] path_openat+0x208c/0x31a0 fs/namei.c:4830 do_file_open+0x20e/0x430 fs/namei.c:4859 do_sys_openat2+0x10d/0x1e0 fs/open.c:1366 do_sys_open fs/open.c:1372 [inline] __do_sys_openat fs/open.c:1388 [inline] __se_sys_openat fs/open.c:1383 [inline] __x64_sys_openat+0x12d/0x210 fs/open.c:1383 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #4 (&cmd->lock){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776 nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199 blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148 __blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline] blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline] __blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307 blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329 blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386 blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949 blk_mq_flush_plug_list block/blk-mq.c:2997 [inline] blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969 __blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230 blk_finish_plug block/blk-core.c:1257 [inline] __submit_bio+0x584/0x6c0 block/blk-core.c:649 __submit_bio_noacct_mq block/blk-core.c:722 [inline] submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753 submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884 blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline] submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821 submit_bh fs/buffer.c:2826 [inline] block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444 filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501 do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101 read_mapping_folio include/linux/pagemap.h:1028 [inline] read_part_sector+0xd1/0x370 block/partitions/core.c:723 adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360 check_partition block/partitions/core.c:142 [inline] blk_add_partitions block/partitions/core.c:590 [inline] bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694 blkdev_get_whole+0x187/0x290 block/bdev.c:764 bdev_open+0x2c7/0xe40 block/bdev.c:973 blkdev_open+0x34e/0x4f0 block/fops.c:697 do_dentry_open+0x6d8/0x1660 fs/open.c:949 vfs_open+0x82/0x3f0 fs/open.c:1081 do_open fs/namei.c:4671 [inline] path_openat+0x208c/0x31a0 fs/namei.c:4830 do_file_open+0x20e/0x430 fs/namei.c:4859 do_sys_openat2+0x10d/0x1e0 fs/open.c:1366 do_sys_open fs/open.c:1372 [inline] __do_sys_openat fs/open.c:1388 [inline] __se_sys_openat fs/open.c:1383 [inline] __x64_sys_openat+0x12d/0x210 fs/open.c:1383 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #3 (set->srcu){.+.+}-{0:0}: srcu_lock_sync include/linux/srcu.h:199 [inline] __synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505 blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline] blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline] blk_mq_quiesce_queue block/blk-mq.c:304 [inline] blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299 elevator_switch+0x17b/0x7e0 block/elevator.c:576 elevator_change+0x352/0x530 block/elevator.c:681 elevator_set_default+0x29e/0x360 block/elevator.c:754 blk_register_queue+0x412/0x590 block/blk-sysfs.c:946 __add_disk+0x73f/0xe40 block/genhd.c:528 add_disk_fwnode+0x118/0x5c0 block/genhd.c:597 add_disk include/linux/blkdev.h:785 [inline] nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984 nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692 do_one_initcall+0x11d/0x760 init/main.c:1382 do_initcall_level init/main.c:1444 [inline] do_initcalls init/main.c:1460 [inline] do_basic_setup init/main.c:1479 [inline] kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692 kernel_init+0x1f/0x1e0 init/main.c:1582 ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 -> #2 (&q->elevator_lock){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776 elevator_change+0x1bc/0x530 block/elevator.c:679 elevator_set_none+0x92/0xf0 block/elevator.c:769 blk_mq_elv_switch_none block/blk-mq.c:5110 [inline] __blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline] blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220 nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489 nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239 genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114 genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline] genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209 netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592 ___sys_sendmsg+0x190/0x1e0 net/socket.c:2646 __sys_sendmsg+0x170/0x220 net/socket.c:2678 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}: blk_alloc_queue+0x610/0x790 block/blk-core.c:461 blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429 __blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476 nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954 nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692 do_one_initcall+0x11d/0x760 init/main.c:1382 do_initcall_level init/main.c:1444 [inline] do_initcalls init/main.c:1460 [inline] do_basic_setup init/main.c:1479 [inline] kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692 kernel_init+0x1f/0x1e0 init/main.c:1582 ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 -> #0 (fs_reclaim){+.+.}-{0:0}: check_prev_add kernel/locking/lockdep.c:3165 [inline] check_prevs_add kernel/locking/lockdep.c:3284 [inline] validate_chain kernel/locking/lockdep.c:3908 [inline] __lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237 lock_acquire kernel/locking/lockdep.c:5868 [inline] lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825 __fs_reclaim_acquire mm/page_alloc.c:4348 [inline] fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362 might_alloc include/linux/sched/mm.h:317 [inline] slab_pre_alloc_hook mm/slub.c:4489 [inline] slab_alloc_node mm/slub.c:4843 [inline] kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918 __alloc_skb+0x140/0x710 net/core/skbuff.c:702 alloc_skb include/linux/skbuff.h:1383 [inline] tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862 __tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223 tcp_close+0x28/0x110 net/ipv4/tcp.c:3350 inet_release+0xed/0x200 net/ipv4/af_inet.c:443 inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479 __sock_release+0xb3/0x260 net/socket.c:662 sock_close+0x1c/0x30 net/socket.c:1455 __fput+0x3ff/0xb40 fs/file_table.c:469 task_work_run+0x150/0x240 kernel/task_work.c:233 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] __exit_to_user_mode_loop kernel/entry/common.c:67 [inline] exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline] do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Chain exists of: fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sk_lock-AF_INET6); lock(&nsock->tx_lock); lock(sk_lock-AF_INET6); lock(fs_reclaim); *** DEADLOCK *** Fixes: fd8383f ("nbd: convert to blkmq") Reported-by: [email protected] Closes: https://lore.kernel.org/netdev/[email protected]/ Signed-off-by: Kuniyuki Iwashima <[email protected]>

When a bio goes through the rq_qos infrastructure on a path's request queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These flags indicate that rq_qos_done_bio() should be called on completion to update rq_qos accounting. During path failover in nvme_failover_req(), the bio's bi_bdev is redirected from the failed path's disk to the multipath head's disk via bio_set_dev(). However, the BIO_QOS flags are not cleared. When the bio eventually completes (either successfully via a new path or with an error via bio_io_error()), rq_qos_done_bio() checks for these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is obtained from the bio's current bi_bdev - which is now the multipath head's queue, not the original path's queue. The multipath head's queue does not have rq_qos enabled (q->rq_qos is NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos must be valid. This breaks when a bio is moved between queues during NVMe multipath failover, leading to a NULL pointer dereference. Execution Context timeline :- * =====> dd process context [USER] dd process [SYSCALL] write() - dd process context submit_bio() nvme_ns_head_submit_bio() - path selection blk_mq_submit_bio() #### QOS FLAGS SET HERE [USER] dd waits or returns ==== I/O in flight on NVMe hardware ===== ===== End of submission path ==== ------------------------------------------------------ * dd ====> Interrupt context; [IRQ] NVMe completion interrupt nvme_irq() nvme_complete_rq() nvme_failover_req() ### BIO MOVED TO HEAD spin_lock_irqsave (atomic section) bio_set_dev() changes bi_bdev ### BUG: QOS flags NOT cleared kblockd_schedule_work() * Interrupt context =====> kblockd workqueue [WQ] kblockd workqueue - kworker process nvme_requeue_work() submit_bio_noacct() nvme_ns_head_submit_bio() nvme_find_path() returns NULL bio_io_error() bio_endio() rq_qos_done_bio() ### CRASH ### KERNEL PANIC / OOPS Crash from blktests nvme/058 (rapid namespace remapping): [ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1339.641025] nvme nvme4: rescanning namespaces. [ 1339.642064] #PF: supervisor read access in kernel mode [ 1339.642067] #PF: error_code(0x0000) - not-present page [ 1339.642070] PGD 0 P4D 0 [ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI [ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary) [ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST [ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core] [ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40 [ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5 53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee 48 89 df ff d0 0f 1f [ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202 [ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000 [ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000 [ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000 [ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010 [ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020 [ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000 [ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0 [ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee [ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600 [ 1339.748488] Call Trace: [ 1339.749512] <TASK> [ 1339.750449] bio_endio+0x71/0x2e0 [ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core] [ 1339.754073] __submit_bio+0x222/0x5e0 [ 1339.755623] ? rcu_is_watching+0xd/0x40 [ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370 [ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370 [ 1339.761189] ? submit_bio_noacct+0x20/0x620 [ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core] [ 1339.764828] process_one_work+0x20e/0x630 [ 1339.766528] worker_thread+0x184/0x330 [ 1339.768129] ? __pfx_worker_thread+0x10/0x10 [ 1339.769942] kthread+0x10a/0x250 [ 1339.771263] ? __pfx_kthread+0x10/0x10 [ 1339.772776] ? __pfx_kthread+0x10/0x10 [ 1339.774381] ret_from_fork+0x273/0x2e0 [ 1339.775948] ? __pfx_kthread+0x10/0x10 [ 1339.777504] ret_from_fork_asm+0x1a/0x30 [ 1339.779163] </TASK> Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags when bios are redirected to the multipath head in nvme_failover_req(). This is consistent with the existing code that clears REQ_POLLED and REQ_NOWAIT flags when the bio changes queues. Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>

As reported by syzbot [0], NBD can trigger a deadlock during memory reclaim. This occurs when a process holds lock_sock() on a backend TCP socket and triggers a memory allocation that leads to fs reclaim. If it eventually calls into NBD to send data or shut down the socket, NBD will attempt to acquire the same lock_sock(), resulting in the deadlock. While NBD sets sk->sk_allocation to GFP_NOIO before calling sendmsg(), this does not prevent the issue in some paths where GFP_KERNEL is used directly under lock_sock(). To resolve this, let's use lock_sock_try() for TCP sendmsg() and shutdown(). For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is returned, allowing the request to be retried later (e.g., via was_interrupted() logic). For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(), the operation might be skipped if the lock cannot be acquired. However, this is not expected to occur in practice because the backend TCP socket should not be touched by userspace once it is handed over to NBD. Note that sock_recvmsg() does not require this special handling because it is only called from the workqueue context. Also note that AF_UNIX sockets continue to use sock_sendmsg() and kernel_sock_shutdown() because unix_stream_sendmsg() and unix_shutdown() do not acquire lock_sock(). [0]: WARNING: possible circular locking dependency detected syzkaller #0 Tainted: G L syz.7.2282/12353 is trying to acquire lock: ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline] ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline] ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline] ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918 but task is already holding lock: ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline] ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #6 (sk_lock-AF_INET6){+.+.}-{0:0}: lock_sock_nested+0x41/0xf0 net/core/sock.c:3780 lock_sock include/net/sock.h:1709 [inline] inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919 nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318 sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411 nbd_clear_sock drivers/block/nbd.c:1427 [inline] nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451 nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248 genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114 genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline] genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209 netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592 ___sys_sendmsg+0x190/0x1e0 net/socket.c:2646 __sys_sendmsg+0x170/0x220 net/socket.c:2678 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #5 (&nsock->tx_lock){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776 nbd_handle_cmd drivers/block/nbd.c:1143 [inline] nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207 blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148 __blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline] blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline] __blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307 blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329 blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386 blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949 blk_mq_flush_plug_list block/blk-mq.c:2997 [inline] blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969 __blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230 blk_finish_plug block/blk-core.c:1257 [inline] __submit_bio+0x584/0x6c0 block/blk-core.c:649 __submit_bio_noacct_mq block/blk-core.c:722 [inline] submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753 submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884 blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline] submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821 submit_bh fs/buffer.c:2826 [inline] block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444 filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501 do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101 read_mapping_folio include/linux/pagemap.h:1028 [inline] read_part_sector+0xd1/0x370 block/partitions/core.c:723 adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360 check_partition block/partitions/core.c:142 [inline] blk_add_partitions block/partitions/core.c:590 [inline] bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694 blkdev_get_whole+0x187/0x290 block/bdev.c:764 bdev_open+0x2c7/0xe40 block/bdev.c:973 blkdev_open+0x34e/0x4f0 block/fops.c:697 do_dentry_open+0x6d8/0x1660 fs/open.c:949 vfs_open+0x82/0x3f0 fs/open.c:1081 do_open fs/namei.c:4671 [inline] path_openat+0x208c/0x31a0 fs/namei.c:4830 do_file_open+0x20e/0x430 fs/namei.c:4859 do_sys_openat2+0x10d/0x1e0 fs/open.c:1366 do_sys_open fs/open.c:1372 [inline] __do_sys_openat fs/open.c:1388 [inline] __se_sys_openat fs/open.c:1383 [inline] __x64_sys_openat+0x12d/0x210 fs/open.c:1383 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #4 (&cmd->lock){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776 nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199 blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148 __blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline] blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline] __blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307 blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329 blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386 blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949 blk_mq_flush_plug_list block/blk-mq.c:2997 [inline] blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969 __blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230 blk_finish_plug block/blk-core.c:1257 [inline] __submit_bio+0x584/0x6c0 block/blk-core.c:649 __submit_bio_noacct_mq block/blk-core.c:722 [inline] submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753 submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884 blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline] submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821 submit_bh fs/buffer.c:2826 [inline] block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444 filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501 do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101 read_mapping_folio include/linux/pagemap.h:1028 [inline] read_part_sector+0xd1/0x370 block/partitions/core.c:723 adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360 check_partition block/partitions/core.c:142 [inline] blk_add_partitions block/partitions/core.c:590 [inline] bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694 blkdev_get_whole+0x187/0x290 block/bdev.c:764 bdev_open+0x2c7/0xe40 block/bdev.c:973 blkdev_open+0x34e/0x4f0 block/fops.c:697 do_dentry_open+0x6d8/0x1660 fs/open.c:949 vfs_open+0x82/0x3f0 fs/open.c:1081 do_open fs/namei.c:4671 [inline] path_openat+0x208c/0x31a0 fs/namei.c:4830 do_file_open+0x20e/0x430 fs/namei.c:4859 do_sys_openat2+0x10d/0x1e0 fs/open.c:1366 do_sys_open fs/open.c:1372 [inline] __do_sys_openat fs/open.c:1388 [inline] __se_sys_openat fs/open.c:1383 [inline] __x64_sys_openat+0x12d/0x210 fs/open.c:1383 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #3 (set->srcu){.+.+}-{0:0}: srcu_lock_sync include/linux/srcu.h:199 [inline] __synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505 blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline] blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline] blk_mq_quiesce_queue block/blk-mq.c:304 [inline] blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299 elevator_switch+0x17b/0x7e0 block/elevator.c:576 elevator_change+0x352/0x530 block/elevator.c:681 elevator_set_default+0x29e/0x360 block/elevator.c:754 blk_register_queue+0x412/0x590 block/blk-sysfs.c:946 __add_disk+0x73f/0xe40 block/genhd.c:528 add_disk_fwnode+0x118/0x5c0 block/genhd.c:597 add_disk include/linux/blkdev.h:785 [inline] nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984 nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692 do_one_initcall+0x11d/0x760 init/main.c:1382 do_initcall_level init/main.c:1444 [inline] do_initcalls init/main.c:1460 [inline] do_basic_setup init/main.c:1479 [inline] kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692 kernel_init+0x1f/0x1e0 init/main.c:1582 ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 -> #2 (&q->elevator_lock){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776 elevator_change+0x1bc/0x530 block/elevator.c:679 elevator_set_none+0x92/0xf0 block/elevator.c:769 blk_mq_elv_switch_none block/blk-mq.c:5110 [inline] __blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline] blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220 nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489 nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239 genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114 genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline] genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209 netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline] netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344 netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592 ___sys_sendmsg+0x190/0x1e0 net/socket.c:2646 __sys_sendmsg+0x170/0x220 net/socket.c:2678 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}: blk_alloc_queue+0x610/0x790 block/blk-core.c:461 blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429 __blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476 nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954 nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692 do_one_initcall+0x11d/0x760 init/main.c:1382 do_initcall_level init/main.c:1444 [inline] do_initcalls init/main.c:1460 [inline] do_basic_setup init/main.c:1479 [inline] kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692 kernel_init+0x1f/0x1e0 init/main.c:1582 ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 -> #0 (fs_reclaim){+.+.}-{0:0}: check_prev_add kernel/locking/lockdep.c:3165 [inline] check_prevs_add kernel/locking/lockdep.c:3284 [inline] validate_chain kernel/locking/lockdep.c:3908 [inline] __lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237 lock_acquire kernel/locking/lockdep.c:5868 [inline] lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825 __fs_reclaim_acquire mm/page_alloc.c:4348 [inline] fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362 might_alloc include/linux/sched/mm.h:317 [inline] slab_pre_alloc_hook mm/slub.c:4489 [inline] slab_alloc_node mm/slub.c:4843 [inline] kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918 __alloc_skb+0x140/0x710 net/core/skbuff.c:702 alloc_skb include/linux/skbuff.h:1383 [inline] tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862 __tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223 tcp_close+0x28/0x110 net/ipv4/tcp.c:3350 inet_release+0xed/0x200 net/ipv4/af_inet.c:443 inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479 __sock_release+0xb3/0x260 net/socket.c:662 sock_close+0x1c/0x30 net/socket.c:1455 __fput+0x3ff/0xb40 fs/file_table.c:469 task_work_run+0x150/0x240 kernel/task_work.c:233 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] __exit_to_user_mode_loop kernel/entry/common.c:67 [inline] exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline] syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline] do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Chain exists of: fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sk_lock-AF_INET6); lock(&nsock->tx_lock); lock(sk_lock-AF_INET6); lock(fs_reclaim); *** DEADLOCK *** Fixes: fd8383f ("nbd: convert to blkmq") Reported-by: [email protected] Closes: https://lore.kernel.org/netdev/[email protected]/ Signed-off-by: Kuniyuki Iwashima <[email protected]>

axboe and others added 18 commits June 2, 2025 12:00

Merge branch 'io_uring-6.16' into for-next

1672aaf

* io_uring-6.16: MAINTAINERS: remove myself from io_uring io_uring/net: only consider msg_inq if larger than 1 io_uring/zcrx: fix area release on registration failure io_uring/zcrx: init id for xa_find

Merge branch 'io_uring-6.16' into for-next

54c27d6

* io_uring-6.16: io_uring/kbuf: limit legacy provided buffer lists to USHRT_MAX

Merge branch 'block-6.16' into for-next

e74d036

* block-6.16: block: drop direction param from bio_integrity_copy_user()

Merge branch 'block-6.16' into for-next

a1dcac5

* block-6.16: selftests: ublk: kublk: improve behavior on init failure block: flip iter directions in blk_rq_integrity_map_user()

Merge branch 'io_uring-6.16' into for-next

6784ad1

* io_uring-6.16: io_uring/futex: mark wait requests as inflight io_uring/futex: get rid of struct io_futex addr union

adding ci files

a73409f

block: add FALLOC_FL_WRITE_ZEROES support

3da3890

Add support for FALLOC_FL_WRITE_ZEROES, if the block device enables the unmap write zeroes operation, it will issue a write zeroes command. Signed-off-by: Zhang Yi <[email protected]>

blktests-ci Bot added new V1 for-next labels Jun 10, 2025

blktests-ci Bot force-pushed the for-next_base branch from a73409f to d74f66f Compare July 10, 2025 11:56

blktests-ci Bot added V2 and removed V1 labels Jul 10, 2025

blktests-ci Bot closed this Jul 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fallocate: introduce FALLOC_FL_WRITE_ZEROES flag#5

fallocate: introduce FALLOC_FL_WRITE_ZEROES flag#5
blktests-ci[bot] wants to merge 18 commits intofor-next_basefrom
series/968463=>for-next

blktests-ci Bot commented Jun 10, 2025

Uh oh!

blktests-ci Bot commented Jun 10, 2025

Uh oh!

blktests-ci Bot commented Jul 10, 2025

Uh oh!

blktests-ci Bot commented Jul 10, 2025

Uh oh!

blktests-ci Bot commented Jul 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

blktests-ci Bot commented Jun 10, 2025

Uh oh!

blktests-ci Bot commented Jun 10, 2025

Uh oh!

blktests-ci Bot commented Jul 10, 2025

Uh oh!

blktests-ci Bot commented Jul 10, 2025

Uh oh!

blktests-ci Bot commented Jul 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants