loop: add regression test for partscan double-scan race by daandemeyer · Pull Request #240 · linux-blktests/blktests

daandemeyer · 2026-03-31T10:45:02Z

Add a stress test that detects spurious partition removal events when setting up a loop device with partscan enabled.

The kernel bug was that disk_force_media_change() set GD_NEED_PART_SCAN, causing udev's device open to trigger a partition scan racing with the explicit scan from loop_reread_partitions(). The second scan would drop and re-add all partitions, making partition devices briefly disappear.

The test monitors kernel uevents while repeatedly setting up and tearing down a loop device with partscan. Each cycle should produce exactly one add and one remove uevent for the partition device. Extra events indicate the double-scan race was triggered.

Link: https://lore.kernel.org/linux-block/[email protected]/T/#u

When LOOP_CONFIGURE is called with LO_FLAGS_PARTSCAN, the following sequence occurs: 1. disk_force_media_change() sets GD_NEED_PART_SCAN 2. Uevent suppression is lifted and a KOBJ_CHANGE uevent is sent 3. loop_global_unlock() releases the lock 4. loop_reread_partitions() calls bdev_disk_changed() to scan There is a race between steps 2 and 4: when udev receives the uevent and opens the device before loop_reread_partitions() runs, blkdev_get_whole() in bdev.c sees GD_NEED_PART_SCAN set and calls bdev_disk_changed() for a first scan. Then loop_reread_partitions() does a second scan. The open_mutex serializes these two scans, but does not prevent both from running. The second scan in bdev_disk_changed() drops all partition devices from the first scan (via blk_drop_partitions()) before re-adding them, causing partition block devices to briefly disappear. This breaks any systemd unit with BindsTo= on the partition device: systemd observes the device going dead, fails the dependent units, and does not retry them when the device reappears. Fix this by removing the GD_NEED_PART_SCAN set from disk_force_media_change() entirely. None of the current callers need the lazy on-open partition scan triggered by this flag: - floppy: sets GENHD_FL_NO_PART, so disk_has_partscan() is always false and GD_NEED_PART_SCAN has no effect. - loop (loop_configure, loop_change_fd): when LO_FLAGS_PARTSCAN is set, loop_reread_partitions() performs an explicit scan. When not set, GD_SUPPRESS_PART_SCAN prevents the lazy scan path. - loop (__loop_clr_fd): calls bdev_disk_changed() explicitly if LO_FLAGS_PARTSCAN is set. - nbd (nbd_clear_sock_ioctl): capacity is set to zero immediately after; nbd manages GD_NEED_PART_SCAN explicitly elsewhere. With GD_NEED_PART_SCAN no longer set by disk_force_media_change(), udev opening the loop device after the uevent no longer triggers a redundant scan in blkdev_get_whole(), and only the single explicit scan from loop_reread_partitions() runs. A regression test for this bug has been submitted to blktests: linux-blktests/blktests#240. Fixes: 9f65c48 ("loop: raise media_change event") Signed-off-by: Daan De Meyer <[email protected]>

When LOOP_CONFIGURE is called with LO_FLAGS_PARTSCAN, the following sequence occurs: 1. disk_force_media_change() sets GD_NEED_PART_SCAN 2. Uevent suppression is lifted and a KOBJ_CHANGE uevent is sent 3. loop_global_unlock() releases the lock 4. loop_reread_partitions() calls bdev_disk_changed() to scan There is a race between steps 2 and 4: when udev receives the uevent and opens the device before loop_reread_partitions() runs, blkdev_get_whole() in bdev.c sees GD_NEED_PART_SCAN set and calls bdev_disk_changed() for a first scan. Then loop_reread_partitions() does a second scan. The open_mutex serializes these two scans, but does not prevent both from running. The second scan in bdev_disk_changed() drops all partition devices from the first scan (via blk_drop_partitions()) before re-adding them, causing partition block devices to briefly disappear. This breaks any systemd unit with BindsTo= on the partition device: systemd observes the device going dead, fails the dependent units, and does not retry them when the device reappears. Fix this by removing the GD_NEED_PART_SCAN set from disk_force_media_change() entirely. None of the current callers need the lazy on-open partition scan triggered by this flag: - floppy: sets GENHD_FL_NO_PART, so disk_has_partscan() is always false and GD_NEED_PART_SCAN has no effect. - loop (loop_configure, loop_change_fd): when LO_FLAGS_PARTSCAN is set, loop_reread_partitions() performs an explicit scan. When not set, GD_SUPPRESS_PART_SCAN prevents the lazy scan path. - loop (__loop_clr_fd): calls bdev_disk_changed() explicitly if LO_FLAGS_PARTSCAN is set. - nbd (nbd_clear_sock_ioctl): capacity is set to zero immediately after; nbd manages GD_NEED_PART_SCAN explicitly elsewhere. With GD_NEED_PART_SCAN no longer set by disk_force_media_change(), udev opening the loop device after the uevent no longer triggers a redundant scan in blkdev_get_whole(), and only the single explicit scan from loop_reread_partitions() runs. A regression test for this bug has been submitted to blktests: linux-blktests/blktests#240. Fixes: 9f65c48 ("loop: raise media_change event") Signed-off-by: Daan De Meyer <[email protected]> Acked-by: Christian Brauner <[email protected]>

When LOOP_CONFIGURE is called with LO_FLAGS_PARTSCAN, the following sequence occurs: 1. disk_force_media_change() sets GD_NEED_PART_SCAN 2. Uevent suppression is lifted and a KOBJ_CHANGE uevent is sent 3. loop_global_unlock() releases the lock 4. loop_reread_partitions() calls bdev_disk_changed() to scan There is a race between steps 2 and 4: when udev receives the uevent and opens the device before loop_reread_partitions() runs, blkdev_get_whole() in bdev.c sees GD_NEED_PART_SCAN set and calls bdev_disk_changed() for a first scan. Then loop_reread_partitions() does a second scan. The open_mutex serializes these two scans, but does not prevent both from running. The second scan in bdev_disk_changed() drops all partition devices from the first scan (via blk_drop_partitions()) before re-adding them, causing partition block devices to briefly disappear. This breaks any systemd unit with BindsTo= on the partition device: systemd observes the device going dead, fails the dependent units, and does not retry them when the device reappears. Fix this by removing the GD_NEED_PART_SCAN set from disk_force_media_change() entirely. None of the current callers need the lazy on-open partition scan triggered by this flag: - floppy: sets GENHD_FL_NO_PART, so disk_has_partscan() is always false and GD_NEED_PART_SCAN has no effect. - loop (loop_configure, loop_change_fd): when LO_FLAGS_PARTSCAN is set, loop_reread_partitions() performs an explicit scan. When not set, GD_SUPPRESS_PART_SCAN prevents the lazy scan path. - loop (__loop_clr_fd): calls bdev_disk_changed() explicitly if LO_FLAGS_PARTSCAN is set. - nbd (nbd_clear_sock_ioctl): capacity is set to zero immediately after; nbd manages GD_NEED_PART_SCAN explicitly elsewhere. With GD_NEED_PART_SCAN no longer set by disk_force_media_change(), udev opening the loop device after the uevent no longer triggers a redundant scan in blkdev_get_whole(), and only the single explicit scan from loop_reread_partitions() runs. A regression test for this bug has been submitted to blktests: linux-blktests/blktests#240. Fixes: 9f65c48 ("loop: raise media_change event") Signed-off-by: Daan De Meyer <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jens Axboe <[email protected]>

When LOOP_CONFIGURE is called with LO_FLAGS_PARTSCAN, the following sequence occurs: 1. disk_force_media_change() sets GD_NEED_PART_SCAN 2. Uevent suppression is lifted and a KOBJ_CHANGE uevent is sent 3. loop_global_unlock() releases the lock 4. loop_reread_partitions() calls bdev_disk_changed() to scan There is a race between steps 2 and 4: when udev receives the uevent and opens the device before loop_reread_partitions() runs, blkdev_get_whole() in bdev.c sees GD_NEED_PART_SCAN set and calls bdev_disk_changed() for a first scan. Then loop_reread_partitions() does a second scan. The open_mutex serializes these two scans, but does not prevent both from running. The second scan in bdev_disk_changed() drops all partition devices from the first scan (via blk_drop_partitions()) before re-adding them, causing partition block devices to briefly disappear. This breaks any systemd unit with BindsTo= on the partition device: systemd observes the device going dead, fails the dependent units, and does not retry them when the device reappears. Fix this by removing the GD_NEED_PART_SCAN set from disk_force_media_change() entirely. None of the current callers need the lazy on-open partition scan triggered by this flag: - floppy: sets GENHD_FL_NO_PART, so disk_has_partscan() is always false and GD_NEED_PART_SCAN has no effect. - loop (loop_configure, loop_change_fd): when LO_FLAGS_PARTSCAN is set, loop_reread_partitions() performs an explicit scan. When not set, GD_SUPPRESS_PART_SCAN prevents the lazy scan path. - loop (__loop_clr_fd): calls bdev_disk_changed() explicitly if LO_FLAGS_PARTSCAN is set. - nbd (nbd_clear_sock_ioctl): capacity is set to zero immediately after; nbd manages GD_NEED_PART_SCAN explicitly elsewhere. With GD_NEED_PART_SCAN no longer set by disk_force_media_change(), udev opening the loop device after the uevent no longer triggers a redundant scan in blkdev_get_whole(), and only the single explicit scan from loop_reread_partitions() runs. A regression test for this bug has been submitted to blktests: linux-blktests/blktests#240. Fixes: 9f65c48 ("loop: raise media_change event") Signed-off-by: Daan De Meyer <[email protected]> Acked-by: Christian Brauner <[email protected]>

kawasaki · 2026-04-08T04:47:44Z

Hello @daandemeyer

Today, I ran the added test case on my test environment, and found the test case fail even when I apply the fix patch to the kernel.

$ sudo ./check loop/012
loop/012 (check for spurious partition removal when partscan is enabled) [failed]
    runtime  5.682s  ...  6.243s
    --- tests/loop/012.out      2026-04-08 10:11:48.266000000 +0900
    +++ /home/shin/Blktests/blktests/results/nodev/loop/012.out.bad     2026-04-08 13:45:01.585000000 +0900
    @@ -1,2 +1,4 @@
     Running loop/012
    +Fail: 111 iterations but 149 add events (expected 111)
    +Fail: 111 iterations but 149 remove events (expected 111)
     Test complete
[shin@testnode2 blktests ]$

I observed the failure with both v2 and v3 kernel patches. I used v7.0-rc7 as the baseline kernel. Is this failure expected?

kawasaki · 2026-04-08T04:57:35Z

One more nit comment. Some commands in the new test case use short options: "truncate -s", "losetup -f" or so. For readability and stability, longer options are recommended, like "truncate --size=", "losetup --find". I'm fine with the current short options, but if you have chance to respin, I suggest to use longer options. I attach a diff file to show how the new test case will look like with the longer options.

review.txt

When LOOP_CONFIGURE is called with LO_FLAGS_PARTSCAN, the following sequence occurs: 1. disk_force_media_change() sets GD_NEED_PART_SCAN 2. Uevent suppression is lifted and a KOBJ_CHANGE uevent is sent 3. loop_global_unlock() releases the lock 4. loop_reread_partitions() calls bdev_disk_changed() to scan There is a race between steps 2 and 4: when udev receives the uevent and opens the device before loop_reread_partitions() runs, blkdev_get_whole() in bdev.c sees GD_NEED_PART_SCAN set and calls bdev_disk_changed() for a first scan. Then loop_reread_partitions() does a second scan. The open_mutex serializes these two scans, but does not prevent both from running. The second scan in bdev_disk_changed() drops all partition devices from the first scan (via blk_drop_partitions()) before re-adding them, causing partition block devices to briefly disappear. This breaks any systemd unit with BindsTo= on the partition device: systemd observes the device going dead, fails the dependent units, and does not retry them when the device reappears. Fix this by removing the GD_NEED_PART_SCAN set from disk_force_media_change() entirely. None of the current callers need the lazy on-open partition scan triggered by this flag: - floppy: sets GENHD_FL_NO_PART, so disk_has_partscan() is always false and GD_NEED_PART_SCAN has no effect. - loop (loop_configure, loop_change_fd): when LO_FLAGS_PARTSCAN is set, loop_reread_partitions() performs an explicit scan. When not set, GD_SUPPRESS_PART_SCAN prevents the lazy scan path. - loop (__loop_clr_fd): calls bdev_disk_changed() explicitly if LO_FLAGS_PARTSCAN is set. - nbd (nbd_clear_sock_ioctl): capacity is set to zero immediately after; nbd manages GD_NEED_PART_SCAN explicitly elsewhere. With GD_NEED_PART_SCAN no longer set by disk_force_media_change(), udev opening the loop device after the uevent no longer triggers a redundant scan in blkdev_get_whole(), and only the single explicit scan from loop_reread_partitions() runs. A regression test for this bug has been submitted to blktests: linux-blktests/blktests#240. Fixes: 9f65c48 ("loop: raise media_change event") Signed-off-by: Daan De Meyer <[email protected]> Acked-by: Christian Brauner <[email protected]>

daandemeyer · 2026-04-13T07:17:05Z

@kawasaki I ran it a bunch of times and it doesn't fail on my test setup :/

kawasaki · 2026-04-13T08:20:57Z

@daandemeyer Thank you for trying out. FYI, here I attach the kernel config of the kernel that I used.
_config_v7.0-rc7_pr240_failure.gz

And I used Fedora 43 on QEMU VM as the userland of my test environment. Which distro do you use? Do you run your test on any VM or bare metal machine?

daandemeyer · 2026-04-13T08:23:19Z

My test environment is https://github.com/DaanDeMeyer/mkosi-kernel, an Arch Linux image running in qemu (booted with systemd).

(I verified the test failed without my changes as well)

kawasaki · 2026-04-14T10:00:09Z

Thanks. I started trying mkosi and mkosi-kernel, but still failing to building the Arch image. Will try to allocate more time for it.

kawasaki · 2026-04-20T00:47:42Z

I managed to run mkosi with mkosi-kernel to generate Fedora and Arch images. (I fell in the pit fall that Fedora 43 kernel failed to boot with the Fedora rawhide image generated by mkosi. When I specify kernel source to build, the failure disappeared.)

With the Fedora and Arch images, I confirmed that the new test case works as expected. The test case fails without the kernel side fix, and the test case passes with the kernel side fix.

Then, the next step is to investigate why the test case fails on my Fedora QEMU environment even with the kernel side fix. Now I'm chasing this.

kawasaki · 2026-04-20T05:06:21Z

I set up "Distribution=fedora" in mkosi.local.conf. I tried both "Release=rawhide" and "Release=43", and found that the test case passes with "Release=rawhide", but fail with "Release=43" (same kernel v7.0 + kernel fix by @daandemeyer ). I guess systemd version difference between the Fedora version might be the cause of the behavior difference.

Fedora rawhide: systemd 260.1
Fedora 43: systemd 258.7

If this hypothesis is correct, it means that the new test case has systemd version dependency. I would like to confirm it, but I do not know how to replace systemd. @daandemeyer If you have any good idea to confirm it, please share. Can mkosi replace systemd version?

kawasaki · 2026-04-22T12:47:51Z

If was stupid to ask "Can mkosi replace systemd version?". mkosi is the systemd origin side project to build up the OS image to help systemd development. Of course it can. And @daandemeyer already documented how to do it in README of mkosi-kernel. It was a bit tough for newbees, but I have managed to control both systemd and kernel versions to build Fedora rawhide image.

With that environment, I confirmed that the new test case depends on both kernel and systemd:

The kernel fix is required to make the test case pass
Even with the kernel fix, systemd version 259 is required to make the test case pass

I bisected further, and found that the systemd commit aa47d8ade18c is required to make the test case pass. The test case fails with systemd v258 tag or before the commit.

Based on this observation, I suggest to check systemd version in the required() function. If it is older than v259, skip the test case. I attach a patch file patch.txt for reference. @daandemeyer Please let me know your thoughts.

daandemeyer · 2026-04-22T13:06:18Z

@kawasaki Thank you for the investigation! The attached patch looks good to me. I will apply it and update the PR

Add a stress test that detects spurious partition removal events when setting up a loop device with partscan enabled. The kernel bug was that disk_force_media_change() set GD_NEED_PART_SCAN, causing udev's device open to trigger a partition scan racing with the explicit scan from loop_reread_partitions(). The second scan would drop and re-add all partitions, making partition devices briefly disappear. The test monitors kernel uevents while repeatedly setting up and tearing down a loop device with partscan. Each cycle should produce exactly one add and one remove uevent for the partition device. Extra events indicate the double-scan race was triggered. Link: https://lore.kernel.org/linux-block/[email protected]/T/#u Signed-off-by: Daan De Meyer <[email protected]> Co-Authored-By: Shinichiro Kawasaki <[email protected]>

daandemeyer force-pushed the loop-race branch from b56c91b to 23fbe5a Compare April 22, 2026 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loop: add regression test for partscan double-scan race#240

loop: add regression test for partscan double-scan race#240
daandemeyer wants to merge 1 commit intolinux-blktests:masterfrom
daandemeyer:loop-race

daandemeyer commented Mar 31, 2026

Uh oh!

kawasaki commented Apr 8, 2026

Uh oh!

kawasaki commented Apr 8, 2026

Uh oh!

daandemeyer commented Apr 13, 2026

Uh oh!

kawasaki commented Apr 13, 2026

Uh oh!

daandemeyer commented Apr 13, 2026 •

edited

Loading

Uh oh!

kawasaki commented Apr 14, 2026

Uh oh!

kawasaki commented Apr 20, 2026

Uh oh!

kawasaki commented Apr 20, 2026

Uh oh!

kawasaki commented Apr 22, 2026

Uh oh!

daandemeyer commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

daandemeyer commented Mar 31, 2026

Uh oh!

kawasaki commented Apr 8, 2026

Uh oh!

kawasaki commented Apr 8, 2026

Uh oh!

daandemeyer commented Apr 13, 2026

Uh oh!

kawasaki commented Apr 13, 2026

Uh oh!

daandemeyer commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kawasaki commented Apr 14, 2026

Uh oh!

kawasaki commented Apr 20, 2026

Uh oh!

kawasaki commented Apr 20, 2026

Uh oh!

kawasaki commented Apr 22, 2026

Uh oh!

daandemeyer commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daandemeyer commented Apr 13, 2026 •

edited

Loading