From 42081e3a96b7adbcfde85125ba6f67c2dcb72870 Mon Sep 17 00:00:00 2001 From: Chao Shi Date: Sun, 26 Apr 2026 20:34:56 -0400 Subject: [PATCH 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug When an NVMe namespace is configured with embedded metadata (flbas bit 4 set, NVME_NS_FLBAS_META_EXT) but no Protection Information (dps=0) and no NVME_NS_METADATA_SUPPORTED, nvme_setup_rw() fires WARN_ON_ONCE on any request that reaches it with REQ_INTEGRITY unset. The WARN was observed repeatedly during NVMe fuzz testing with a FEMU-based fuzzer that performs semantic mutation of Identify Namespace responses. The trigger requires three conditions to align: (a) a namespace transitions through the EXT_LBAS non-PI state (head->ms != 0, features & NVME_NS_EXT_LBAS, !(features & NVME_NS_METADATA_SUPPORTED)), (b) nvme_init_integrity() returns false through the early-exit branch at core.c:1834 without populating bi->metadata_size, leaving the disk without an integrity profile (blk_get_integrity() returns NULL), and (c) a request that was admitted to the block layer before the namespace update reaches nvme_setup_rw() after it. The admission gap arises in two places. First, the plug-list flush path: a process with dirty pages queued in a plug before the namespace update flushes them on file close (blk_finish_plug -> blk_mq_dispatch -> nvme_setup_rw), bypassing any capacity-zero gate. Second, the cached-rq path: blk_mq_submit_bio() at blk-mq.c:3155 may find a cached request; if so, the bio_queue_enter() freeze-serialization guard at blk-mq.c:3174-3176 is skipped and the bio is dispatched immediately. In both cases the bio was submitted without REQ_INTEGRITY (because blk_get_integrity() returned NULL at dispatch time, so bio_integrity_action() returned 0 and bio_integrity_prep() was not called), and it reaches nvme_setup_rw() for a namespace where head->ms != 0. The existing BLK_STS_NOTSUPP return correctly handles this dispatch; the WARN_ON_ONCE is a false positive. The WARN was reproduced six times over four days of fuzzing (April 2026). A representative crash shows the plug-flush path: nvme0n1: detected capacity change from 2097152 to 0 WARNING: drivers/nvme/host/core.c:1042 at nvme_setup_rw+0x768/0xfd0 PID: 785 (systemd-udevd) Call Trace: nvme_setup_cmd / nvme_queue_rq / blk_mq_dispatch_rq_list blk_mq_flush_plug_list / blk_finish_plug / blkdev_writepages sync_blockdev / bdev_release / __fput / sys_close Replace WARN_ON_ONCE with pr_debug_ratelimited so the condition is logged at debug level without splat. The BLK_STS_NOTSUPP return is preserved; I/O to the transitioning namespace is still rejected. An alternative approach that addresses the root cause at the integrity-profile level is proposed in patch 2/2: populate bi->metadata_size for EXT_LBAS non-PI namespaces in nvme_init_integrity() so that bio_integrity_action() returns non-zero, bio_integrity_prep() sets REQ_INTEGRITY, and nvme_setup_rw() never reaches this branch. Both patches are sent as RFC for maintainer guidance on the preferred direction. Tested: Compiled on linux-kcov-debug (6.19.0+, KASAN/DEBUG_LIST). Boot-tested under FEMU with NVME_MALICIOUS_RESPONDER=1 NVME_SEMANTIC_DATA_MUTATOR=1; ran 4 concurrent dd processes plus 500 rescan_controller cycles. No WARN, BUG, or Oops observed. Found by FuzzNvme(Syzkaller with FEMU fuzzing framework). Acked-by: Sungwoo Kim Acked-by: Dave Tian Acked-by: Weidong Zhu Signed-off-by: Chao Shi --- drivers/nvme/host/core.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1e33af94c24b9..6947342178295 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1039,8 +1039,12 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, * namespace capacity to zero to prevent any I/O. */ if (!blk_integrity_rq(req)) { - if (WARN_ON_ONCE(!nvme_ns_has_pi(ns->head))) + if (!nvme_ns_has_pi(ns->head)) { + pr_debug_ratelimited("nvme: %s: metadata (ms=%u) without PI or integrity request, returning NOTSUPP\n", + ns->disk->disk_name, + ns->head->ms); return BLK_STS_NOTSUPP; + } control |= NVME_RW_PRINFO_PRACT; nvme_set_ref_tag(ns, cmnd, req); } From 1f3fcb5ad75ec3b3f02898ea6a7d806fe45f5133 Mon Sep 17 00:00:00 2001 From: Chao Shi Date: Sun, 26 Apr 2026 20:34:57 -0400 Subject: [PATCH 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace This patch is an alternative to patch 1/2: instead of downgrading the assertion in nvme_setup_rw(), it addresses the root cause at the integrity-profile level so that the assertion is never reached. For PCIe namespaces with extended LBAs (NVME_NS_EXT_LBAS set, flbas bit 4) but without PI and without NVME_NS_METADATA_SUPPORTED, the early- exit branch of nvme_init_integrity() at core.c:1834 returns false without populating bi->metadata_size. As a result blk_get_integrity() returns NULL (it checks q->limits.integrity.metadata_size via blk_integrity_queue_supports_integrity()), bio_integrity_action() returns 0, bio_integrity_prep() is never called, and REQ_INTEGRITY is never set on bios dispatched to the namespace. Any such bio that reaches nvme_setup_rw() triggers WARN_ON_ONCE because head->ms != 0 but blk_integrity_rq() returns false. Populate bi->metadata_size = head->ms in the early-exit path for the EXT_LBAS non-PI case. This is sufficient to make blk_get_integrity() return non-NULL, which causes bio_integrity_action() to return non-zero, which causes bio_integrity_prep() to run and set REQ_INTEGRITY on any bio submitted to the namespace. Requests that reach nvme_setup_rw() then satisfy blk_integrity_rq() and the assertion is not reached. blk_validate_integrity_limits() accepts this configuration: with csum_type=BLK_INTEGRITY_CSUM_NONE, pi_tuple_size=0, and pi_offset=0, all checks pass (pi_offset + pi_tuple_size <= metadata_size, pi_tuple_size must be 0 for CSUM_NONE), and interval_exp is auto-filled to ilog2(logical_block_size). No generate/verify callbacks are configured, so no actual integrity computation occurs; only the blk_integrity_rq() predicate is satisfied. Capacity is still forced to 0 by set_capacity_and_notify(), so new bios are rejected by bio_check_eod() before queue entry. Tested: Compiled on linux-kcov-debug (6.19.0+, KASAN/DEBUG_LIST). Boot-tested under FEMU with NVME_SEMANTIC_DATA_MUTATOR=1; ran 4 concurrent dd processes plus 500 rescan_controller cycles with no WARN, BUG, or Oops. The EXT_LBAS + ms!=0 + !PI combination was not triggered during testing (FEMU's mutator varies flbas and lbaf[0].ms independently; flbas=0x10 with lbaf_idx=0 was not produced in this run). The bi->metadata_size assignment path was not exercised in testing; correctness of blk_validate_integrity_limits() for this configuration was verified by code inspection. Provided as RFC. Found by FuzzNvme(Syzkaller with FEMU fuzzing framework). Acked-by: Sungwoo Kim Acked-by: Dave Tian Acked-by: Weidong Zhu Signed-off-by: Chao Shi --- drivers/nvme/host/core.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 6947342178295..176b6268639d7 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1836,8 +1836,29 @@ static bool nvme_init_integrity(struct nvme_ns_head *head, * insert/strip it, which is not possible for other kinds of metadata. */ if (!IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) || - !(head->features & NVME_NS_METADATA_SUPPORTED)) - return nvme_ns_has_pi(head); + !(head->features & NVME_NS_METADATA_SUPPORTED)) { + bool has_pi = nvme_ns_has_pi(head); + + /* + * For PCIe EXT_LBAS non-PI namespaces the block layer sets + * capacity to 0 (we return false) to prevent block I/O, but a + * cached-rq bio may bypass bio_queue_enter freeze serialisation + * and reach nvme_setup_rw() with head->ms != 0 and no + * REQ_INTEGRITY set. Populate bi->metadata_size so that + * bio_integrity_action() returns non-zero and bio_integrity_prep() + * sets REQ_INTEGRITY on any such bio, preventing the WARN_ON_ONCE + * at nvme_setup_rw() (addressed by patch 1/2). + * + * NOTE: only metadata_size is populated; no csum or PI profile is + * configured. Actual data integrity for EXT_LBAS non-PI workloads + * is untested; this patch is RFC for direction discussion. + */ + if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && + (head->features & NVME_NS_EXT_LBAS) && + head->ms && !has_pi) + bi->metadata_size = head->ms; + return has_pi; + } switch (head->pi_type) { case NVME_NS_DPS_PI_TYPE3: