block tests: nvme metadata passthrough#4
block tests: nvme metadata passthrough#4blktests-ci[bot] wants to merge 8 commits intofor-next_basefrom
Conversation
* io_uring-6.16: MAINTAINERS: remove myself from io_uring io_uring/net: only consider msg_inq if larger than 1 io_uring/zcrx: fix area release on registration failure io_uring/zcrx: init id for xa_find
* block-6.16: selftests: ublk: cover PER_IO_DAEMON in more stress tests Documentation: ublk: document UBLK_F_PER_IO_DAEMON selftests: ublk: add stress test for per io daemons selftests: ublk: add functional test for per io daemons selftests: ublk: kublk: decouple ublk_queues from ublk server threads selftests: ublk: kublk: move per-thread data out of ublk_queue selftests: ublk: kublk: lift queue initialization out of thread selftests: ublk: kublk: tie sqe allocation to io instead of queue selftests: ublk: kublk: plumb q_id in io_uring user_data ublk: have a per-io daemon instead of a per-queue daemon md/md-bitmap: remove parameter slot from bitmap_create() md/md-bitmap: cleanup bitmap_ops->startwrite() md/dm-raid: remove max_write_behind setting limit md/md-bitmap: fix dm-raid max_write_behind setting md/raid1,raid10: don't handle IO error for REQ_RAHEAD and REQ_NOWAIT loop: add file_start_write() and file_end_write() bcache: reserve more RESERVE_BTREE buckets to prevent allocator hang bcache: remove unused constants bcache: fix NULL pointer in cache_set_flush()
* io_uring-6.16: io_uring/kbuf: limit legacy provided buffer lists to USHRT_MAX
* block-6.16: block: drop direction param from bio_integrity_copy_user()
* block-6.16: selftests: ublk: kublk: improve behavior on init failure block: flip iter directions in blk_rq_integrity_map_user()
* io_uring-6.16: io_uring/futex: mark wait requests as inflight io_uring/futex: get rid of struct io_futex addr union
* block-6.16: nvme: spelling fixes nvme-tcp: fix I/O stalls on congested sockets nvme-tcp: sanitize request list handling nvme-tcp: remove tag set when second admin queue config fails nvme: enable vectored registered bufs for passthrough cmds nvme: fix implicit bool to flags conversion nvme: fix command limits status code
|
Upstream branch: 38f4878 Pull request is NOT updated. Failed to apply https://patchwork.kernel.org/project/linux-block/list/?series=969899 conflict: |
3 similar comments
|
Upstream branch: 38f4878 Pull request is NOT updated. Failed to apply https://patchwork.kernel.org/project/linux-block/list/?series=969899 conflict: |
|
Upstream branch: 38f4878 Pull request is NOT updated. Failed to apply https://patchwork.kernel.org/project/linux-block/list/?series=969899 conflict: |
|
Upstream branch: 38f4878 Pull request is NOT updated. Failed to apply https://patchwork.kernel.org/project/linux-block/list/?series=969899 conflict: |
|
Upstream branch: 38f4878 Pull request is NOT updated. Failed to apply https://patchwork.kernel.org/project/linux-block/list/?series=969899 conflict: |
1 similar comment
|
Upstream branch: 38f4878 Pull request is NOT updated. Failed to apply https://patchwork.kernel.org/project/linux-block/list/?series=969899 conflict: |
… context The current use of a mutex to protect the notifier hashtable accesses can lead to issues in the atomic context. It results in the below kernel warnings: | BUG: sleeping function called from invalid context at kernel/locking/mutex.c:258 | in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 9, name: kworker/0:0 | preempt_count: 1, expected: 0 | RCU nest depth: 0, expected: 0 | CPU: 0 UID: 0 PID: 9 Comm: kworker/0:0 Not tainted 6.14.0 #4 | Workqueue: ffa_pcpu_irq_notification notif_pcpu_irq_work_fn | Call trace: | show_stack+0x18/0x24 (C) | dump_stack_lvl+0x78/0x90 | dump_stack+0x18/0x24 | __might_resched+0x114/0x170 | __might_sleep+0x48/0x98 | mutex_lock+0x24/0x80 | handle_notif_callbacks+0x54/0xe0 | notif_get_and_handle+0x40/0x88 | generic_exec_single+0x80/0xc0 | smp_call_function_single+0xfc/0x1a0 | notif_pcpu_irq_work_fn+0x2c/0x38 | process_one_work+0x14c/0x2b4 | worker_thread+0x2e4/0x3e0 | kthread+0x13c/0x210 | ret_from_fork+0x10/0x20 To address this, replace the mutex with an rwlock to protect the notifier hashtable accesses. This ensures that read-side locking does not sleep and multiple readers can acquire the lock concurrently, avoiding unnecessary contention and potential deadlocks. Writer access remains exclusive, preserving correctness. This change resolves warnings from lockdep about potential sleep in atomic context. Cc: Jens Wiklander <[email protected]> Reported-by: Jérôme Forissier <[email protected]> Closes: OP-TEE/optee_os#7394 Fixes: e057344 ("firmware: arm_ffa: Add interfaces to request notification callbacks") Message-Id: <[email protected]> Reviewed-by: Jens Wiklander <[email protected]> Tested-by: Jens Wiklander <[email protected]> Signed-off-by: Sudeep Holla <[email protected]>
Before the commit under the Fixes tag below, bnxt_ulp_stop() and bnxt_ulp_start() were always invoked in pairs. After that commit, the new bnxt_ulp_restart() can be invoked after bnxt_ulp_stop() has been called. This may result in the RoCE driver's aux driver .suspend() method being invoked twice. The 2nd bnxt_re_suspend() call will crash when it dereferences a NULL pointer: (NULL ib_device): Handle device suspend call BUG: kernel NULL pointer dereference, address: 0000000000000b78 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 20 UID: 0 PID: 181 Comm: kworker/u96:5 Tainted: G S 6.15.0-rc1 #4 PREEMPT(voluntary) Tainted: [S]=CPU_OUT_OF_SPEC Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017 Workqueue: bnxt_pf_wq bnxt_sp_task [bnxt_en] RIP: 0010:bnxt_re_suspend+0x45/0x1f0 [bnxt_re] Code: 8b 05 a7 3c 5b f5 48 89 44 24 18 31 c0 49 8b 5c 24 08 4d 8b 2c 24 e8 ea 06 0a f4 48 c7 c6 04 60 52 c0 48 89 df e8 1b ce f9 ff <48> 8b 83 78 0b 00 00 48 8b 80 38 03 00 00 a8 40 0f 85 b5 00 00 00 RSP: 0018:ffffa2e84084fd88 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffffffffb4b6b934 RDI: 00000000ffffffff RBP: ffffa1760954c9c0 R08: 0000000000000000 R09: c0000000ffffdfff R10: 0000000000000001 R11: ffffa2e84084fb50 R12: ffffa176031ef070 R13: ffffa17609775000 R14: ffffa17603adc180 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffa17daa397000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000b78 CR3: 00000004aaa30003 CR4: 00000000003706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> bnxt_ulp_stop+0x69/0x90 [bnxt_en] bnxt_sp_task+0x678/0x920 [bnxt_en] ? __schedule+0x514/0xf50 process_scheduled_works+0x9d/0x400 worker_thread+0x11c/0x260 ? __pfx_worker_thread+0x10/0x10 kthread+0xfe/0x1e0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2b/0x40 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 Check the BNXT_EN_FLAG_ULP_STOPPED flag and do not proceed if the flag is already set. This will preserve the original symmetrical bnxt_ulp_stop() and bnxt_ulp_start(). Also, inside bnxt_ulp_start(), clear the BNXT_EN_FLAG_ULP_STOPPED flag after taking the mutex to avoid any race condition. And for symmetry, only proceed in bnxt_ulp_start() if the BNXT_EN_FLAG_ULP_STOPPED is set. Fixes: 3c163f3 ("bnxt_en: Optimize recovery path ULP locking in the driver") Signed-off-by: Kalesh AP <[email protected]> Co-developed-by: Michael Chan <[email protected]> Signed-off-by: Michael Chan <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
a73409f to
d74f66f
Compare
|
At least one diff in series https://patchwork.kernel.org/project/linux-block/list/?series=969091 irrelevant now for [{'archived': False, 'project': 241}] search patterns |
…ux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.16, take #4 - Gracefully fail initialising pKVM if the interrupt controller isn't GICv3 - Also gracefully fail initialising pKVM if the carveout allocation fails - Fix the computing of the minimum MMIO range required for the host on stage-2 fault - Fix the generation of the GICv3 Maintenance Interrupt in nested mode
pert script tests fails with segmentation fault as below:
92: perf script tests:
--- start ---
test child forked, pid 103769
DB test
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.012 MB /tmp/perf-test-script.7rbftEpOzX/perf.data (9 samples) ]
/usr/libexec/perf-core/tests/shell/script.sh: line 35:
103780 Segmentation fault (core dumped)
perf script -i "${perfdatafile}" -s "${db_test}"
--- Cleaning up ---
---- end(-1) ----
92: perf script tests : FAILED!
Backtrace pointed to :
#0 0x0000000010247dd0 in maps.machine ()
#1 0x00000000101d178c in db_export.sample ()
#2 0x00000000103412c8 in python_process_event ()
#3 0x000000001004eb28 in process_sample_event ()
#4 0x000000001024fcd0 in machines.deliver_event ()
#5 0x000000001025005c in perf_session.deliver_event ()
#6 0x00000000102568b0 in __ordered_events__flush.part.0 ()
#7 0x0000000010251618 in perf_session.process_events ()
#8 0x0000000010053620 in cmd_script ()
#9 0x00000000100b5a28 in run_builtin ()
#10 0x00000000100b5f94 in handle_internal_command ()
#11 0x0000000010011114 in main ()
Further investigation reveals that this occurs in the `perf script tests`,
because it uses `db_test.py` script. This script sets `perf_db_export_mode = True`.
With `perf_db_export_mode` enabled, if a sample originates from a hypervisor,
perf doesn't set maps for "[H]" sample in the code. Consequently, `al->maps` remains NULL
when `maps__machine(al->maps)` is called from `db_export__sample`.
As al->maps can be NULL in case of Hypervisor samples , use thread->maps
because even for Hypervisor sample, machine should exist.
If we don't have machine for some reason, return -1 to avoid segmentation fault.
Reported-by: Disha Goel <[email protected]>
Signed-off-by: Aditya Bodkhe <[email protected]>
Reviewed-by: Adrian Hunter <[email protected]>
Tested-by: Disha Goel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Suggested-by: Adrian Hunter <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
Without the change `perf `hangs up on charaster devices. On my system
it's enough to run system-wide sampler for a few seconds to get the
hangup:
$ perf record -a -g --call-graph=dwarf
$ perf report
# hung
`strace` shows that hangup happens on reading on a character device
`/dev/dri/renderD128`
$ strace -y -f -p 2780484
strace: Process 2780484 attached
pread64(101</dev/dri/renderD128>, strace: Process 2780484 detached
It's call trace descends into `elfutils`:
$ gdb -p 2780484
(gdb) bt
#0 0x00007f5e508f04b7 in __libc_pread64 (fd=101, buf=0x7fff9df7edb0, count=0, offset=0)
at ../sysdeps/unix/sysv/linux/pread64.c:25
#1 0x00007f5e52b79515 in read_file () from /<<NIX>>/elfutils-0.192/lib/libelf.so.1
#2 0x00007f5e52b25666 in libdw_open_elf () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#3 0x00007f5e52b25907 in __libdw_open_file () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#4 0x00007f5e52b120a9 in dwfl_report_elf@@ELFUTILS_0.156 ()
from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#5 0x000000000068bf20 in __report_module (al=al@entry=0x7fff9df80010, ip=ip@entry=139803237033216, ui=ui@entry=0x5369b5e0)
at util/dso.h:537
#6 0x000000000068c3d1 in report_module (ip=139803237033216, ui=0x5369b5e0) at util/unwind-libdw.c:114
#7 frame_callback (state=0x535aef10, arg=0x5369b5e0) at util/unwind-libdw.c:242
#8 0x00007f5e52b261d3 in dwfl_thread_getframes () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#9 0x00007f5e52b25bdb in get_one_thread_cb () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#10 0x00007f5e52b25faa in dwfl_getthreads () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#11 0x00007f5e52b26514 in dwfl_getthread_frames () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1
#12 0x000000000068c6ce in unwind__get_entries (cb=cb@entry=0x5d4620 <unwind_entry>, arg=arg@entry=0x10cd5fa0,
thread=thread@entry=0x1076a290, data=data@entry=0x7fff9df80540, max_stack=max_stack@entry=127,
best_effort=best_effort@entry=false) at util/thread.h:152
#13 0x00000000005dae95 in thread__resolve_callchain_unwind (evsel=0x106006d0, thread=0x1076a290, cursor=0x10cd5fa0,
sample=0x7fff9df80540, max_stack=127, symbols=true) at util/machine.c:2939
#14 thread__resolve_callchain_unwind (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, sample=0x7fff9df80540,
max_stack=127, symbols=true) at util/machine.c:2920
#15 __thread__resolve_callchain (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, evsel@entry=0x7fff9df80440,
sample=0x7fff9df80540, parent=parent@entry=0x7fff9df804a0, root_al=root_al@entry=0x7fff9df80440, max_stack=127, symbols=true)
at util/machine.c:2970
#16 0x00000000005d0cb2 in thread__resolve_callchain (thread=<optimized out>, cursor=<optimized out>, evsel=0x7fff9df80440,
sample=<optimized out>, parent=0x7fff9df804a0, root_al=0x7fff9df80440, max_stack=127) at util/machine.h:198
#17 sample__resolve_callchain (sample=<optimized out>, cursor=<optimized out>, parent=parent@entry=0x7fff9df804a0,
evsel=evsel@entry=0x106006d0, al=al@entry=0x7fff9df80440, max_stack=max_stack@entry=127) at util/callchain.c:1127
#18 0x0000000000617e08 in hist_entry_iter__add (iter=iter@entry=0x7fff9df80480, al=al@entry=0x7fff9df80440, max_stack_depth=127,
arg=arg@entry=0x7fff9df81ae0) at util/hist.c:1255
#19 0x000000000045d2d0 in process_sample_event (tool=0x7fff9df81ae0, event=<optimized out>, sample=0x7fff9df80540,
evsel=0x106006d0, machine=<optimized out>) at builtin-report.c:334
#20 0x00000000005e3bb1 in perf_session__deliver_event (session=0x105ff2c0, event=0x7f5c7d735ca0, tool=0x7fff9df81ae0,
file_offset=2914716832, file_path=0x105ffbf0 "perf.data") at util/session.c:1367
#21 0x00000000005e8d93 in do_flush (oe=0x105ffa50, show_progress=false) at util/ordered-events.c:245
#22 __ordered_events__flush (oe=0x105ffa50, how=OE_FLUSH__ROUND, timestamp=<optimized out>) at util/ordered-events.c:324
#23 0x00000000005e1f64 in perf_session__process_user_event (session=0x105ff2c0, event=0x7f5c7d752b18, file_offset=2914835224,
file_path=0x105ffbf0 "perf.data") at util/session.c:1419
#24 0x00000000005e47c7 in reader__read_event (rd=rd@entry=0x7fff9df81260, session=session@entry=0x105ff2c0,
--Type <RET> for more, q to quit, c to continue without paging--
quit
prog=prog@entry=0x7fff9df81220) at util/session.c:2132
#25 0x00000000005e4b37 in reader__process_events (rd=0x7fff9df81260, session=0x105ff2c0, prog=0x7fff9df81220)
at util/session.c:2181
#26 __perf_session__process_events (session=0x105ff2c0) at util/session.c:2226
#27 perf_session__process_events (session=session@entry=0x105ff2c0) at util/session.c:2390
#28 0x0000000000460add in __cmd_report (rep=0x7fff9df81ae0) at builtin-report.c:1076
#29 cmd_report (argc=<optimized out>, argv=<optimized out>) at builtin-report.c:1827
#30 0x00000000004c5a40 in run_builtin (p=p@entry=0xd8f7f8 <commands+312>, argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0)
at perf.c:351
#31 0x00000000004c5d63 in handle_internal_command (argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0) at perf.c:404
#32 0x0000000000442de3 in run_argv (argcp=<synthetic pointer>, argv=<synthetic pointer>) at perf.c:448
#33 main (argc=<optimized out>, argv=0x7fff9df844b0) at perf.c:556
The hangup happens because nothing in` perf` or `elfutils` checks if a
mapped file is easily readable.
The change conservatively skips all non-regular files.
Signed-off-by: Sergei Trofimovich <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
With PREEMPT_RT as potential configuration option, spinlock_t is now considered as a sleeping lock, and thus might cause issues when used in an atomic context. But even with PREEMPT_RT as potential configuration option, raw_spinlock_t remains as a true spinning lock/atomic context. This creates potential issues with the s390 debug/tracing feature. The functions to trace errors are called in various contexts, including under lock of raw_spinlock_t, and thus the used spinlock_t in each debug area is in violation of the locking semantics. Here are two examples involving failing PCI Read accesses that are traced while holding `pci_lock` in `drivers/pci/access.c`: ============================= [ BUG: Invalid wait context ] 6.19.0-devel #18 Not tainted ----------------------------- bash/3833 is trying to lock: 0000027790baee30 (&rc->lock){-.-.}-{3:3}, at: debug_event_common+0xfc/0x300 other info that might help us debug this: context-{5:5} 5 locks held by bash/3833: #0: 0000027efbb29450 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x7c/0xf0 #1: 00000277f0504a90 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x13e/0x260 #2: 00000277beed8c18 (kn->active#339){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x164/0x260 #3: 00000277e9859190 (&dev->mutex){....}-{4:4}, at: pci_dev_lock+0x2e/0x40 #4: 00000383068a7708 (pci_lock){....}-{2:2}, at: pci_bus_read_config_dword+0x4a/0xb0 stack backtrace: CPU: 6 UID: 0 PID: 3833 Comm: bash Kdump: loaded Not tainted 6.19.0-devel #18 PREEMPTLAZY Hardware name: IBM 9175 ME1 701 (LPAR) Call Trace: [<00000383048afec2>] dump_stack_lvl+0xa2/0xe8 [<00000383049ba166>] __lock_acquire+0x816/0x1660 [<00000383049bb1fa>] lock_acquire+0x24a/0x370 [<00000383059e3860>] _raw_spin_lock_irqsave+0x70/0xc0 [<00000383048bbb6c>] debug_event_common+0xfc/0x300 [<0000038304900b0a>] __zpci_load+0x17a/0x1f0 [<00000383048fad88>] pci_read+0x88/0xd0 [<00000383054cbce0>] pci_bus_read_config_dword+0x70/0xb0 [<00000383054d55e4>] pci_dev_wait+0x174/0x290 [<00000383054d5a3e>] __pci_reset_function_locked+0xfe/0x170 [<00000383054d9b30>] pci_reset_function+0xd0/0x100 [<00000383054ee21a>] reset_store+0x5a/0x80 [<0000038304e98758>] kernfs_fop_write_iter+0x1e8/0x260 [<0000038304d995da>] new_sync_write+0x13a/0x180 [<0000038304d9c5d0>] vfs_write+0x200/0x330 [<0000038304d9c88c>] ksys_write+0x7c/0xf0 [<00000383059cfa80>] __do_syscall+0x210/0x500 [<00000383059e4c06>] system_call+0x6e/0x90 INFO: lockdep is turned off. ============================= [ BUG: Invalid wait context ] 6.19.0-devel #3 Not tainted ----------------------------- bash/6861 is trying to lock: 0000009da05c7430 (&rc->lock){-.-.}-{3:3}, at: debug_event_common+0xfc/0x300 other info that might help us debug this: context-{5:5} 5 locks held by bash/6861: #0: 000000acff404450 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x7c/0xf0 #1: 000000acff41c490 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x13e/0x260 #2: 0000009da36937d8 (kn->active#75){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x164/0x260 #3: 0000009dd15250d0 (&zdev->state_lock){+.+.}-{4:4}, at: enable_slot+0x2e/0xc0 #4: 000001a19682f708 (pci_lock){....}-{2:2}, at: pci_bus_read_config_byte+0x42/0xa0 stack backtrace: CPU: 16 UID: 0 PID: 6861 Comm: bash Kdump: loaded Not tainted 6.19.0-devel #3 PREEMPTLAZY Hardware name: IBM 9175 ME1 701 (LPAR) Call Trace: [<000001a194837ec2>] dump_stack_lvl+0xa2/0xe8 [<000001a194942166>] __lock_acquire+0x816/0x1660 [<000001a1949431fa>] lock_acquire+0x24a/0x370 [<000001a19596b810>] _raw_spin_lock_irqsave+0x70/0xc0 [<000001a194843b6c>] debug_event_common+0xfc/0x300 [<000001a194888b0a>] __zpci_load+0x17a/0x1f0 [<000001a194882d88>] pci_read+0x88/0xd0 [<000001a195453b88>] pci_bus_read_config_byte+0x68/0xa0 [<000001a195457bc2>] pci_setup_device+0x62/0xad0 [<000001a195458e70>] pci_scan_single_device+0x90/0xe0 [<000001a19488a0f6>] zpci_bus_scan_device+0x46/0x80 [<000001a19547f958>] enable_slot+0x98/0xc0 [<000001a19547f134>] power_write_file+0xc4/0x110 [<000001a194e20758>] kernfs_fop_write_iter+0x1e8/0x260 [<000001a194d215da>] new_sync_write+0x13a/0x180 [<000001a194d245d0>] vfs_write+0x200/0x330 [<000001a194d2488c>] ksys_write+0x7c/0xf0 [<000001a195957a30>] __do_syscall+0x210/0x500 [<000001a19596cbb6>] system_call+0x6e/0x90 INFO: lockdep is turned off. Since it is desired to keep it possible to create trace records in most situations, including this particular case (failing PCI config space accesses are relevant), convert the used spinlock_t in `struct debug_info` to raw_spinlock_t. The impact is small, as the debug area lock only protects bounded memory access without external dependencies, apart from one function debug_set_size() where kfree() is implicitly called with the lock held. Move debug_info_free() out of this lock, to keep remove this external dependency. Acked-by: Heiko Carstens <[email protected]> Signed-off-by: Benjamin Block <[email protected]> Signed-off-by: Heiko Carstens <[email protected]>
test_progs run with ASAN reported [1]:
==126==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 32 byte(s) in 1 object(s) allocated from:
#0 0x7f1ff3cfa340 in calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:77
#1 0x5610c15bb520 in bpf_program_attach_fd /codebuild/output/src685977285/src/actions-runner/_work/vmtest/vmtest/src/tools/lib/bpf/libbpf.c:13164
#2 0x5610c15bb740 in bpf_program__attach_xdp /codebuild/output/src685977285/src/actions-runner/_work/vmtest/vmtest/src/tools/lib/bpf/libbpf.c:13204
#3 0x5610c14f91d3 in test_xdp_flowtable /codebuild/output/src685977285/src/actions-runner/_work/vmtest/vmtest/src/tools/testing/selftests/bpf/prog_tests/xdp_flowtable.c:138
#4 0x5610c1533566 in run_one_test /codebuild/output/src685977285/src/actions-runner/_work/vmtest/vmtest/src/tools/testing/selftests/bpf/test_progs.c:1406
#5 0x5610c1537fb0 in main /codebuild/output/src685977285/src/actions-runner/_work/vmtest/vmtest/src/tools/testing/selftests/bpf/test_progs.c:2097
#6 0x7f1ff25df1c9 (/lib/x86_64-linux-gnu/libc.so.6+0x2a1c9) (BuildId: 8e9fd827446c24067541ac5390e6f527fb5947bb)
#7 0x7f1ff25df28a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28a) (BuildId: 8e9fd827446c24067541ac5390e6f527fb5947bb)
#8 0x5610c0bd3180 in _start (/tmp/work/vmtest/vmtest/selftests/bpf/test_progs+0x593180) (BuildId: cdf9f103f42307dc0a2cd6cfc8afcbc1366cf8bd)
Fix by properly destroying bpf_link on exit in xdp_flowtable test.
[1] https://github.com/kernel-patches/vmtest/actions/runs/22361085418/job/64716490680
Signed-off-by: Ihor Solodrai <[email protected]>
Reviewed-by: Subbaraya Sundeep <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
The current cpuset partition code is able to dynamically update the sched domains of a running system and the corresponding HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentially the "isolcpus=domain,..." boot command line feature at run time. The housekeeping cpumask update requires flushing a number of different workqueues which may not be safe with cpus_read_lock() held as the workqueue flushing code may acquire cpus_read_lock() or acquiring locks which have locking dependency with cpus_read_lock() down the chain. Below is an example of such circular locking problem. ====================================================== WARNING: possible circular locking dependency detected 6.18.0-test+ #2 Tainted: G S ------------------------------------------------------ test_cpuset_prs/10971 is trying to acquire lock: ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180 but task is already holding lock: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (cpuset_mutex){+.+.}-{4:4}: -> #3 (cpu_hotplug_lock){++++}-{0:0}: -> #2 (rtnl_mutex){+.+.}-{4:4}: -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: Chain exists of: (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex 5 locks held by test_cpuset_prs/10971: #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0 #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0 #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0 #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130 #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 Call Trace: <TASK> : touch_wq_lockdep_map+0x93/0x180 __flush_workqueue+0x111/0x10b0 housekeeping_update+0x12d/0x2d0 update_parent_effective_cpumask+0x595/0x2440 update_prstate+0x89d/0xce0 cpuset_partition_write+0xc5/0x130 cgroup_file_write+0x1a5/0x680 kernfs_fop_write_iter+0x3df/0x5f0 vfs_write+0x525/0xfd0 ksys_write+0xf9/0x1d0 do_syscall_64+0x95/0x520 entry_SYSCALL_64_after_hwframe+0x76/0x7e To avoid such a circular locking dependency problem, we have to call housekeeping_update() without holding the cpus_read_lock() and cpuset_mutex. The current set of wq's flushed by housekeeping_update() may not have work functions that call cpus_read_lock() directly, but we are likely to extend the list of wq's that are flushed in the future. Moreover, the current set of work functions may hold locks that may have cpu_hotplug_lock down the dependency chain. So housekeeping_update() is now called after releasing cpus_read_lock and cpuset_mutex at the end of a cpuset operation. These two locks are then re-acquired later before calling rebuild_sched_domains_locked(). To enable mutual exclusion between the housekeeping_update() call and other cpuset control file write actions, a new top level cpuset_top_mutex is introduced. This new mutex will be acquired first to allow sharing variables used by both code paths. However, cpuset update from CPU hotplug can still happen in parallel with the housekeeping_update() call, though that should be rare in production environment. As cpus_read_lock() is now no longer held when tmigr_isolated_exclude_cpumask() is called, it needs to acquire it directly. The lockdep_is_cpuset_held() is also updated to return true if either cpuset_top_mutex or cpuset_mutex is held. Fixes: 03ff735 ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset") Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
This leak will cause a hang when tearing down the SCSI host. For example, iscsid hangs with the following call trace: [130120.652718] scsi_alloc_sdev: Allocation failure during SCSI scanning, some SCSI devices might not be configured PID: 2528 TASK: ffff9d0408974e00 CPU: 3 COMMAND: "iscsid" #0 [ffffb5b9c134b9e0] __schedule at ffffffff860657d4 #1 [ffffb5b9c134ba28] schedule at ffffffff86065c6f #2 [ffffb5b9c134ba40] schedule_timeout at ffffffff86069fb0 #3 [ffffb5b9c134bab0] __wait_for_common at ffffffff8606674f #4 [ffffb5b9c134bb10] scsi_remove_host at ffffffff85bfe84b #5 [ffffb5b9c134bb30] iscsi_sw_tcp_session_destroy at ffffffffc03031c4 [iscsi_tcp] #6 [ffffb5b9c134bb48] iscsi_if_recv_msg at ffffffffc0292692 [scsi_transport_iscsi] #7 [ffffb5b9c134bb98] iscsi_if_rx at ffffffffc02929c2 [scsi_transport_iscsi] #8 [ffffb5b9c134bbf0] netlink_unicast at ffffffff85e551d6 #9 [ffffb5b9c134bc38] netlink_sendmsg at ffffffff85e554ef Fixes: 8fe4ce5 ("scsi: core: Fix a use-after-free") Cc: [email protected] Signed-off-by: Junxiao Bi <[email protected]> Reviewed-by: Mike Christie <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Martin K. Petersen <[email protected]>
Shin'ichiro reported sporadic hangs when running generic/013 in our CI system. When enabling lockdep, there is a lockdep splat when calling btrfs_get_dev_zone_info_all_devices() in the mount path that can be triggered by i.e. generic/013: ====================================================== WARNING: possible circular locking dependency detected 7.0.0-rc1+ #355 Not tainted ------------------------------------------------------ mount/1043 is trying to acquire lock: ffff8881020b5470 (&vblk->vdev_mutex){+.+.}-{4:4}, at: virtblk_report_zones+0xda/0x430 but task is already holding lock: ffff888102a738e0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: btrfs_get_dev_zone_info_all_devices+0x45/0x90 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&fs_devs->device_list_mutex){+.+.}-{4:4}: __mutex_lock+0xa3/0x1360 btrfs_create_pending_block_groups+0x1f4/0x9d0 __btrfs_end_transaction+0x3e/0x2e0 btrfs_zoned_reserve_data_reloc_bg+0x2f8/0x390 open_ctree+0x1934/0x23db btrfs_get_tree.cold+0x105/0x26c vfs_get_tree+0x28/0xb0 __do_sys_fsconfig+0x324/0x680 do_syscall_64+0x92/0x4f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #3 (btrfs_trans_num_extwriters){++++}-{0:0}: join_transaction+0xc2/0x5c0 start_transaction+0x17c/0xbc0 btrfs_zoned_reserve_data_reloc_bg+0x2b4/0x390 open_ctree+0x1934/0x23db btrfs_get_tree.cold+0x105/0x26c vfs_get_tree+0x28/0xb0 __do_sys_fsconfig+0x324/0x680 do_syscall_64+0x92/0x4f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (btrfs_trans_num_writers){++++}-{0:0}: lock_release+0x163/0x4b0 __btrfs_end_transaction+0x1c7/0x2e0 btrfs_dirty_inode+0x6f/0xd0 touch_atime+0xe5/0x2c0 btrfs_file_mmap_prepare+0x65/0x90 __mmap_region+0x4b9/0xf00 mmap_region+0xf7/0x120 do_mmap+0x43d/0x610 vm_mmap_pgoff+0xd6/0x190 ksys_mmap_pgoff+0x7e/0xc0 do_syscall_64+0x92/0x4f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&mm->mmap_lock){++++}-{4:4}: __might_fault+0x68/0xa0 _copy_to_user+0x22/0x70 blkdev_copy_zone_to_user+0x22/0x40 virtblk_report_zones+0x282/0x430 blkdev_report_zones_ioctl+0xfd/0x130 blkdev_ioctl+0x20f/0x2c0 __x64_sys_ioctl+0x86/0xd0 do_syscall_64+0x92/0x4f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&vblk->vdev_mutex){+.+.}-{4:4}: __lock_acquire+0x1522/0x2680 lock_acquire+0xd5/0x2f0 __mutex_lock+0xa3/0x1360 virtblk_report_zones+0xda/0x430 blkdev_report_zones_cached+0x162/0x190 btrfs_get_dev_zones+0xdc/0x2e0 btrfs_get_dev_zone_info+0x219/0xe80 btrfs_get_dev_zone_info_all_devices+0x62/0x90 open_ctree+0x1200/0x23db btrfs_get_tree.cold+0x105/0x26c vfs_get_tree+0x28/0xb0 __do_sys_fsconfig+0x324/0x680 do_syscall_64+0x92/0x4f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: &vblk->vdev_mutex --> btrfs_trans_num_extwriters --> &fs_devs->device_list_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_devs->device_list_mutex); lock(btrfs_trans_num_extwriters); lock(&fs_devs->device_list_mutex); lock(&vblk->vdev_mutex); *** DEADLOCK *** 3 locks held by mount/1043: #0: ffff88811063e878 (&fc->uapi_mutex){+.+.}-{4:4}, at: __do_sys_fsconfig+0x2ae/0x680 #1: ffff88810cb9f0e8 (&type->s_umount_key#31/1){+.+.}-{4:4}, at: alloc_super+0xc0/0x3e0 #2: ffff888102a738e0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: btrfs_get_dev_zone_info_all_devices+0x45/0x90 stack backtrace: CPU: 2 UID: 0 PID: 1043 Comm: mount Not tainted 7.0.0-rc1+ #355 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-9.fc43 06/10/2025 Call Trace: <TASK> dump_stack_lvl+0x5b/0x80 print_circular_bug.cold+0x18d/0x1d8 check_noncircular+0x10d/0x130 __lock_acquire+0x1522/0x2680 ? vmap_small_pages_range_noflush+0x3ef/0x820 lock_acquire+0xd5/0x2f0 ? virtblk_report_zones+0xda/0x430 ? lock_is_held_type+0xcd/0x130 __mutex_lock+0xa3/0x1360 ? virtblk_report_zones+0xda/0x430 ? virtblk_report_zones+0xda/0x430 ? __pfx_copy_zone_info_cb+0x10/0x10 ? virtblk_report_zones+0xda/0x430 virtblk_report_zones+0xda/0x430 ? __pfx_copy_zone_info_cb+0x10/0x10 blkdev_report_zones_cached+0x162/0x190 ? __pfx_copy_zone_info_cb+0x10/0x10 btrfs_get_dev_zones+0xdc/0x2e0 btrfs_get_dev_zone_info+0x219/0xe80 btrfs_get_dev_zone_info_all_devices+0x62/0x90 open_ctree+0x1200/0x23db btrfs_get_tree.cold+0x105/0x26c ? rcu_is_watching+0x18/0x50 vfs_get_tree+0x28/0xb0 __do_sys_fsconfig+0x324/0x680 do_syscall_64+0x92/0x4f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f615e27a40e RSP: 002b:00007fff11b18fb8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af RAX: ffffffffffffffda RBX: 000055572e92ab10 RCX: 00007f615e27a40e RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003 RBP: 00007fff11b19100 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000055572e92bc40 R14: 00007f615e3faa60 R15: 000055572e92bd08 </TASK> Don't hold the device_list_mutex while calling into btrfs_get_dev_zone_info() in btrfs_get_dev_zone_info_all_devices() to mitigate the issue. This is safe, as no other thread can touch the device list at the moment of execution. Reported-by: Shin'ichiro Kawasaki <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
…kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 7.0, take #4 - Clear the pending exception state from a vcpu coming out of reset, as it could otherwise affect the first instruction executed in the guest. - Fix the address translation emulation icode to set the Hardware Access bit on the correct PTE instead of some other location.
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
The devm_free_irq() and devm_request_irq() functions should not be
executed in an atomic context.
During device suspend, all userspace processes and most kernel threads
are frozen. Additionally, we flush all tx/rx status, disable all macb
interrupts, and halt rx operations. Therefore, it is safe to split the
region protected by bp->lock into two independent sections, allowing
devm_free_irq() and devm_request_irq() to run in a non-atomic context.
This modification resolves the following lockdep warning:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:591
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 501, name: rtcwake
preempt_count: 1, expected: 0
RCU nest depth: 1, expected: 0
7 locks held by rtcwake/501:
#0: ffff0008038c3408 (sb_writers#5){.+.+}-{0:0}, at: vfs_write+0xf8/0x368
#1: ffff0008049a5e88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0xbc/0x1c8
#2: ffff00080098d588 (kn->active#70){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xcc/0x1c8
#3: ffff800081c84888 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0x1ec/0x290
#4: ffff0008009ba0f8 (&dev->mutex){....}-{4:4}, at: device_suspend+0x118/0x4f0
#5: ffff800081d00458 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48
#6: ffff0008031fb9e0 (&bp->lock){-.-.}-{3:3}, at: macb_suspend+0x144/0x558
irq event stamp: 8682
hardirqs last enabled at (8681): [<ffff8000813c7d7c>] _raw_spin_unlock_irqrestore+0x44/0x88
hardirqs last disabled at (8682): [<ffff8000813c7b58>] _raw_spin_lock_irqsave+0x38/0x98
softirqs last enabled at (7322): [<ffff8000800f1b4c>] handle_softirqs+0x52c/0x588
softirqs last disabled at (7317): [<ffff800080010310>] __do_softirq+0x20/0x2c
CPU: 1 UID: 0 PID: 501 Comm: rtcwake Not tainted 7.0.0-rc3-next-20260310-yocto-standard+ #125 PREEMPT
Hardware name: ZynqMP ZCU102 Rev1.1 (DT)
Call trace:
show_stack+0x24/0x38 (C)
__dump_stack+0x28/0x38
dump_stack_lvl+0x64/0x88
dump_stack+0x18/0x24
__might_resched+0x200/0x218
__might_sleep+0x38/0x98
__mutex_lock_common+0x7c/0x1378
mutex_lock_nested+0x38/0x50
free_irq+0x68/0x2b0
devm_irq_release+0x24/0x38
devres_release+0x40/0x80
devm_free_irq+0x48/0x88
macb_suspend+0x298/0x558
device_suspend+0x218/0x4f0
dpm_suspend+0x244/0x3a0
dpm_suspend_start+0x50/0x78
suspend_devices_and_enter+0xec/0x560
pm_suspend+0x194/0x290
state_store+0x110/0x158
kobj_attr_store+0x1c/0x30
sysfs_kf_write+0xa8/0xd0
kernfs_fop_write_iter+0x11c/0x1c8
vfs_write+0x248/0x368
ksys_write+0x7c/0xf8
__arm64_sys_write+0x28/0x40
invoke_syscall+0x4c/0xe8
el0_svc_common+0x98/0xf0
do_el0_svc+0x28/0x40
el0_svc+0x54/0x1e0
el0t_64_sync_handler+0x84/0x130
el0t_64_sync+0x198/0x1a0
Fixes: 558e35c ("net: macb: WoL support for GEM type of Ethernet controller")
Cc: [email protected]
Reviewed-by: Théo Lebrun <[email protected]>
Signed-off-by: Kevin Hao <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Access to net_device::ip_ptr and its associated members must be protected by an RCU lock. Since we are modifying this piece of code, let's also move it to execute only when WAKE_ARP is enabled. To minimize the duration of the RCU lock, a local variable is used to temporarily store the IP address. This change resolves the following RCU check warning: WARNING: suspicious RCU usage 7.0.0-rc3-next-20260310-yocto-standard+ #122 Not tainted ----------------------------- drivers/net/ethernet/cadence/macb_main.c:5944 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 5 locks held by rtcwake/518: #0: ffff000803ab1408 (sb_writers#5){.+.+}-{0:0}, at: vfs_write+0xf8/0x368 #1: ffff0008090bf088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0xbc/0x1c8 #2: ffff00080098d588 (kn->active#70){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xcc/0x1c8 #3: ffff800081c84888 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0x1ec/0x290 #4: ffff0008009ba0f8 (&dev->mutex){....}-{4:4}, at: device_suspend+0x118/0x4f0 stack backtrace: CPU: 3 UID: 0 PID: 518 Comm: rtcwake Not tainted 7.0.0-rc3-next-20260310-yocto-standard+ #122 PREEMPT Hardware name: ZynqMP ZCU102 Rev1.1 (DT) Call trace: show_stack+0x24/0x38 (C) __dump_stack+0x28/0x38 dump_stack_lvl+0x64/0x88 dump_stack+0x18/0x24 lockdep_rcu_suspicious+0x134/0x1d8 macb_suspend+0xd8/0x4c0 device_suspend+0x218/0x4f0 dpm_suspend+0x244/0x3a0 dpm_suspend_start+0x50/0x78 suspend_devices_and_enter+0xec/0x560 pm_suspend+0x194/0x290 state_store+0x110/0x158 kobj_attr_store+0x1c/0x30 sysfs_kf_write+0xa8/0xd0 kernfs_fop_write_iter+0x11c/0x1c8 vfs_write+0x248/0x368 ksys_write+0x7c/0xf8 __arm64_sys_write+0x28/0x40 invoke_syscall+0x4c/0xe8 el0_svc_common+0x98/0xf0 do_el0_svc+0x28/0x40 el0_svc+0x54/0x1e0 el0t_64_sync_handler+0x84/0x130 el0t_64_sync+0x198/0x1a0 Fixes: 0cb8de3 ("net: macb: Add ARP support to WOL") Signed-off-by: Kevin Hao <[email protected]> Cc: [email protected] Reviewed-by: Théo Lebrun <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
bond_xmit_broadcast() reuses the original skb for the last slave (determined by bond_is_last_slave()) and clones it for others. Concurrent slave enslave/release can mutate the slave list during RCU-protected iteration, changing which slave is "last" mid-loop. This causes the original skb to be double-consumed (double-freed). Replace the racy bond_is_last_slave() check with a simple index comparison (i + 1 == slaves_count) against the pre-snapshot slave count taken via READ_ONCE() before the loop. This preserves the zero-copy optimization for the last slave while making the "last" determination stable against concurrent list mutations. The UAF can trigger the following crash: ================================================================== BUG: KASAN: slab-use-after-free in skb_clone Read of size 8 at addr ffff888100ef8d40 by task exploit/147 CPU: 1 UID: 0 PID: 147 Comm: exploit Not tainted 7.0.0-rc3+ #4 PREEMPTLAZY Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:123) print_report (mm/kasan/report.c:379 mm/kasan/report.c:482) kasan_report (mm/kasan/report.c:597) skb_clone (include/linux/skbuff.h:1724 include/linux/skbuff.h:1792 include/linux/skbuff.h:3396 net/core/skbuff.c:2108) bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5334) bond_start_xmit (drivers/net/bonding/bond_main.c:5567 drivers/net/bonding/bond_main.c:5593) dev_hard_start_xmit (include/linux/netdevice.h:5325 include/linux/netdevice.h:5334 net/core/dev.c:3871 net/core/dev.c:3887) __dev_queue_xmit (include/linux/netdevice.h:3601 net/core/dev.c:4838) ip6_finish_output2 (include/net/neighbour.h:540 include/net/neighbour.h:554 net/ipv6/ip6_output.c:136) ip6_finish_output (net/ipv6/ip6_output.c:208 net/ipv6/ip6_output.c:219) ip6_output (net/ipv6/ip6_output.c:250) ip6_send_skb (net/ipv6/ip6_output.c:1985) udp_v6_send_skb (net/ipv6/udp.c:1442) udpv6_sendmsg (net/ipv6/udp.c:1733) __sys_sendto (net/socket.c:730 net/socket.c:742 net/socket.c:2206) __x64_sys_sendto (net/socket.c:2209) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) </TASK> Allocated by task 147: Freed by task 147: The buggy address belongs to the object at ffff888100ef8c80 which belongs to the cache skbuff_head_cache of size 224 The buggy address is located 192 bytes inside of freed 224-byte region [ffff888100ef8c80, ffff888100ef8d60) Memory state around the buggy address: ffff888100ef8c00: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc ffff888100ef8c80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888100ef8d00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ^ ffff888100ef8d80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb ffff888100ef8e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Fixes: 4e5bd03 ("net: bonding: fix bond_xmit_broadcast return value error bug") Reported-by: Weiming Shi <[email protected]> Signed-off-by: Xiang Mei <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
hfsplus_fill_super() calls hfs_find_init() to initialize a search structure, which acquires tree->tree_lock. If the subsequent call to hfsplus_cat_build_key() fails, the function jumps to the out_put_root error label without releasing the lock. The later cleanup path then frees the tree data structure with the lock still held, triggering a held lock freed warning. Fix this by adding the missing hfs_find_exit(&fd) call before jumping to the out_put_root error label. This ensures that tree->tree_lock is properly released on the error path. The bug was originally detected on v6.13-rc1 using an experimental static analysis tool we are developing, and we have verified that the issue persists in the latest mainline kernel. The tool is specifically designed to detect memory management issues. It is currently under active development and not yet publicly available. We confirmed the bug by runtime testing under QEMU with x86_64 defconfig, lockdep enabled, and CONFIG_HFSPLUS_FS=y. To trigger the error path, we used GDB to dynamically shrink the max_unistr_len parameter to 1 before hfsplus_asc2uni() is called. This forces hfsplus_asc2uni() to naturally return -ENAMETOOLONG, which propagates to hfsplus_cat_build_key() and exercises the faulty error path. The following warning was observed during mount: ========================= WARNING: held lock freed! 7.0.0-rc3-00016-gb4f0dd314b39 #4 Not tainted ------------------------- mount/174 is freeing memory ffff888103f92000-ffff888103f92fff, with a lock still held there! ffff888103f920b0 (&tree->tree_lock){+.+.}-{4:4}, at: hfsplus_find_init+0x154/0x1e0 2 locks held by mount/174: #0: ffff888103f960e0 (&type->s_umount_key#42/1){+.+.}-{4:4}, at: alloc_super.constprop.0+0x167/0xa40 #1: ffff888103f920b0 (&tree->tree_lock){+.+.}-{4:4}, at: hfsplus_find_init+0x154/0x1e0 stack backtrace: CPU: 2 UID: 0 PID: 174 Comm: mount Not tainted 7.0.0-rc3-00016-gb4f0dd314b39 #4 PREEMPT(lazy) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x82/0xd0 debug_check_no_locks_freed+0x13a/0x180 kfree+0x16b/0x510 ? hfsplus_fill_super+0xcb4/0x18a0 hfsplus_fill_super+0xcb4/0x18a0 ? __pfx_hfsplus_fill_super+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? bdev_open+0x65f/0xc30 ? srso_return_thunk+0x5/0x5f ? pointer+0x4ce/0xbf0 ? trace_contention_end+0x11c/0x150 ? __pfx_pointer+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? bdev_open+0x79b/0xc30 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? vsnprintf+0x6da/0x1270 ? srso_return_thunk+0x5/0x5f ? __mutex_unlock_slowpath+0x157/0x740 ? __pfx_vsnprintf+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? mark_held_locks+0x49/0x80 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? irqentry_exit+0x17b/0x5e0 ? trace_irq_disable.constprop.0+0x116/0x150 ? __pfx_hfsplus_fill_super+0x10/0x10 ? __pfx_hfsplus_fill_super+0x10/0x10 get_tree_bdev_flags+0x302/0x580 ? __pfx_get_tree_bdev_flags+0x10/0x10 ? vfs_parse_fs_qstr+0x129/0x1a0 ? __pfx_vfs_parse_fs_qstr+0x3/0x10 vfs_get_tree+0x89/0x320 fc_mount+0x10/0x1d0 path_mount+0x5c5/0x21c0 ? __pfx_path_mount+0x10/0x10 ? trace_irq_enable.constprop.0+0x116/0x150 ? trace_irq_enable.constprop.0+0x116/0x150 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? kmem_cache_free+0x307/0x540 ? user_path_at+0x51/0x60 ? __x64_sys_mount+0x212/0x280 ? srso_return_thunk+0x5/0x5f __x64_sys_mount+0x212/0x280 ? __pfx___x64_sys_mount+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? trace_irq_enable.constprop.0+0x116/0x150 ? srso_return_thunk+0x5/0x5f do_syscall_64+0x111/0x680 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ffacad55eae Code: 48 8b 0d 85 1f 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 8 RSP: 002b:00007fff1ab55718 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffacad55eae RDX: 000055740c64e5b0 RSI: 000055740c64e630 RDI: 000055740c651ab0 RBP: 000055740c64e380 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000055740c64e5b0 R14: 000055740c651ab0 R15: 000055740c64e380 </TASK> After applying this patch, the warning no longer appears. Fixes: 89ac9b4 ("hfsplus: fix longname handling") CC: [email protected] Signed-off-by: Zilin Guan <[email protected]> Reviewed-by: Viacheslav Dubeyko <[email protected]> Tested-by: Viacheslav Dubeyko <[email protected]> Signed-off-by: Viacheslav Dubeyko <[email protected]>
…l flushing
Over the years we often get reports of some -ENOSPC failure while updating
metadata that leads to a transaction abort. I have seen this happen for
filesystems of all sizes and with workloads that are very user/customer
specific and unable to reproduce, but Aleksandar recently reported a
simple way to reproduce this with a 1G filesystem and using the bonnie++
benchmark tool. The following test script reproduces the failure:
$ cat test.sh
#!/bin/bash
# Create and use a 1G null block device, memory backed, otherwise
# the test takes a very long time.
modprobe null_blk nr_devices="0"
null_dev="/sys/kernel/config/nullb/nullb0"
mkdir "$null_dev"
size=$((1 * 1024)) # in MB
echo 2 > "$null_dev/submit_queues"
echo "$size" > "$null_dev/size"
echo 1 > "$null_dev/memory_backed"
echo 1 > "$null_dev/discard"
echo 1 > "$null_dev/power"
DEV=/dev/nullb0
MNT=/mnt/nullb0
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/test/
bonnie++ -d $MNT/test/ -m BTRFS -u 0 -s 256M -r 128M -b
umount $MNT
echo 0 > "$null_dev/power"
rmdir "$null_dev"
When running this bonnie++ fails in the phase where it deletes test
directories and files:
$ ./test.sh
(...)
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...Can't sync directory, turning off dir-sync.
Can't delete file 9Bq7sr0000000338
Cleaning up test directory after error.
Bonnie: drastic I/O error (rmdir): Read-only file system
And in the syslog/dmesg we can see the following transaction abort trace:
[161915.501506] BTRFS warning (device nullb0): Skipping commit of aborted transaction.
[161915.502983] ------------[ cut here ]------------
[161915.503832] BTRFS: Transaction aborted (error -28)
[161915.504748] WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975
[161915.506786] Modules linked in: btrfs dm_zero dm_snapshot (...)
[161915.518759] CPU: 11 UID: 0 PID: 3377975 Comm: bonnie++ Tainted: G W 6.19.0-rc7-btrfs-next-224+ #4 PREEMPT(full)
[161915.520857] Tainted: [W]=WARN
[161915.521405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[161915.523414] RIP: 0010:btrfs_commit_transaction+0xa24/0xd30 [btrfs]
[161915.524630] Code: 48 8b 7c 24 (...)
[161915.526982] RSP: 0018:ffffd3fe8206fda8 EFLAGS: 00010292
[161915.527707] RAX: 0000000000000002 RBX: ffff8f4886d3c000 RCX: 0000000000000000
[161915.528723] RDX: 0000000002040001 RSI: 00000000ffffffe4 RDI: ffffffffc088f780
[161915.529691] RBP: ffff8f4f5adae7e0 R08: 0000000000000000 R09: ffffd3fe8206fb90
[161915.530842] R10: ffff8f4f9c1fffa8 R11: 0000000000000003 R12: 00000000ffffffe4
[161915.532027] R13: ffff8f4ef2cf2400 R14: ffff8f4f5adae708 R15: ffff8f4f62d18000
[161915.533229] FS: 00007ff93112a780(0000) GS:ffff8f4ff63ee000(0000) knlGS:0000000000000000
[161915.534611] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[161915.535575] CR2: 00005571b3072000 CR3: 0000000176080005 CR4: 0000000000370ef0
[161915.536758] Call Trace:
[161915.537185] <TASK>
[161915.537575] btrfs_sync_file+0x431/0x530 [btrfs]
[161915.538473] do_fsync+0x39/0x80
[161915.539042] __x64_sys_fsync+0xf/0x20
[161915.539750] do_syscall_64+0x50/0xf20
[161915.540396] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[161915.541301] RIP: 0033:0x7ff930ca49ee
[161915.541904] Code: 08 0f 85 f5 (...)
[161915.544830] RSP: 002b:00007ffd94291f38 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[161915.546152] RAX: ffffffffffffffda RBX: 00007ff93112a780 RCX: 00007ff930ca49ee
[161915.547263] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
[161915.548383] RBP: 0000000000000dab R08: 0000000000000000 R09: 0000000000000000
[161915.549853] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd94291fb0
[161915.551196] R13: 00007ffd94292350 R14: 0000000000000001 R15: 00007ffd94292340
[161915.552161] </TASK>
[161915.552457] ---[ end trace 0000000000000000 ]---
[161915.553232] BTRFS info (device nullb0 state A): dumping space info:
[161915.553236] BTRFS info (device nullb0 state A): space_info DATA (sub-group id 0) has 12582912 free, is not full
[161915.553239] BTRFS info (device nullb0 state A): space_info total=12582912, used=0, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
[161915.553243] BTRFS info (device nullb0 state A): space_info METADATA (sub-group id 0) has -5767168 free, is full
[161915.553245] BTRFS info (device nullb0 state A): space_info total=53673984, used=6635520, pinned=46956544, reserved=16384, may_use=5767168, readonly=65536 zone_unusable=0
[161915.553251] BTRFS info (device nullb0 state A): space_info SYSTEM (sub-group id 0) has 8355840 free, is not full
[161915.553254] BTRFS info (device nullb0 state A): space_info total=8388608, used=16384, pinned=16384, reserved=0, may_use=0, readonly=0 zone_unusable=0
[161915.553257] BTRFS info (device nullb0 state A): global_block_rsv: size 5767168 reserved 5767168
[161915.553261] BTRFS info (device nullb0 state A): trans_block_rsv: size 0 reserved 0
[161915.553263] BTRFS info (device nullb0 state A): chunk_block_rsv: size 0 reserved 0
[161915.553265] BTRFS info (device nullb0 state A): remap_block_rsv: size 0 reserved 0
[161915.553268] BTRFS info (device nullb0 state A): delayed_block_rsv: size 0 reserved 0
[161915.553270] BTRFS info (device nullb0 state A): delayed_refs_rsv: size 0 reserved 0
[161915.553272] BTRFS: error (device nullb0 state A) in cleanup_transaction:2045: errno=-28 No space left
[161915.554463] BTRFS info (device nullb0 state EA): forced readonly
The problem is that we allow for a very aggressive metadata overcommit,
about 1/8th of the currently available space, even when the task
attempting the reservation allows for full flushing. Over time this allows
more and more tasks to overcommit without getting a transaction commit to
release pinned extents, joining the same transaction and eventually lead
to the transaction abort when attempting some tree update, as the extent
allocator is not able to find any available metadata extent and it's not
able to allocate a new metadata block group either (not enough unallocated
space for that).
Fix this by allowing the overcommit to be up to 1/64th of the available
(unallocated) space instead and for that limit to apply to both types of
full flushing, BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL.
This way we get more frequent transaction commits to release pinned
extents in case our caller is in a context where full flushing is allowed.
Note that the space infos dump in the dmesg/syslog right after the
transaction abort give the wrong idea that we have plenty of unallocated
space when the abort happened. During the bonnie++ workload we had a
metadata chunk allocation attempt and it failed with -ENOSPC because at
that time we had a bunch of data block groups allocated, which then became
empty and got deleted by the cleaner kthread after the metadata chunk
allocation failed with -ENOSPC and before the transaction abort happened
and dumped the space infos.
The custom tracing (some trace_printk() calls spread in strategic places)
used to check that:
mount-1793735 [011] ...1. 28877.261096: btrfs_add_bg_to_space_info: added bg offset 13631488 length 8388608 flags 1 to space_info->flags 1 total_bytes 8388608 bytes_used 0 bytes_may_use 0
mount-1793735 [011] ...1. 28877.261098: btrfs_add_bg_to_space_info: added bg offset 22020096 length 8388608 flags 34 to space_info->flags 2 total_bytes 8388608 bytes_used 16384 bytes_may_use 0
mount-1793735 [011] ...1. 28877.261100: btrfs_add_bg_to_space_info: added bg offset 30408704 length 53673984 flags 36 to space_info->flags 4 total_bytes 53673984 bytes_used 131072 bytes_may_use 0
These are from loading the block groups created by mkfs during mount.
Then when bonnie++ starts doing its thing:
kworker/u48:5-1792004 [011] ..... 28886.122050: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
kworker/u48:5-1792004 [011] ..... 28886.122053: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 927596544
kworker/u48:5-1792004 [011] ..... 28886.122055: btrfs_make_block_group: make bg offset 84082688 size 117440512 type 1
kworker/u48:5-1792004 [011] ...1. 28886.122064: btrfs_add_bg_to_space_info: added bg offset 84082688 length 117440512 flags 1 to space_info->flags 1 total_bytes 125829120 bytes_used 0 bytes_may_use 5251072
First allocation of a data block group of 112M.
kworker/u48:5-1792004 [011] ..... 28886.192408: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
kworker/u48:5-1792004 [011] ..... 28886.192413: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 810156032
kworker/u48:5-1792004 [011] ..... 28886.192415: btrfs_make_block_group: make bg offset 201523200 size 117440512 type 1
kworker/u48:5-1792004 [011] ...1. 28886.192425: btrfs_add_bg_to_space_info: added bg offset 201523200 length 117440512 flags 1 to space_info->flags 1 total_bytes 243269632 bytes_used 0 bytes_may_use 122691584
Another 112M data block group allocated.
kworker/u48:5-1792004 [011] ..... 28886.260935: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
kworker/u48:5-1792004 [011] ..... 28886.260941: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 692715520
kworker/u48:5-1792004 [011] ..... 28886.260943: btrfs_make_block_group: make bg offset 318963712 size 117440512 type 1
kworker/u48:5-1792004 [011] ...1. 28886.260954: btrfs_add_bg_to_space_info: added bg offset 318963712 length 117440512 flags 1 to space_info->flags 1 total_bytes 360710144 bytes_used 0 bytes_may_use 240132096
Yet another one.
bonnie++-1793755 [010] ..... 28886.280407: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793755 [010] ..... 28886.280412: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 575275008
bonnie++-1793755 [010] ..... 28886.280414: btrfs_make_block_group: make bg offset 436404224 size 117440512 type 1
bonnie++-1793755 [010] ...1. 28886.280419: btrfs_add_bg_to_space_info: added bg offset 436404224 length 117440512 flags 1 to space_info->flags 1 total_bytes 478150656 bytes_used 0 bytes_may_use 268435456
One more.
kworker/u48:5-1792004 [011] ..... 28886.566233: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
kworker/u48:5-1792004 [011] ..... 28886.566238: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 457834496
kworker/u48:5-1792004 [011] ..... 28886.566241: btrfs_make_block_group: make bg offset 553844736 size 117440512 type 1
kworker/u48:5-1792004 [011] ...1. 28886.566250: btrfs_add_bg_to_space_info: added bg offset 553844736 length 117440512 flags 1 to space_info->flags 1 total_bytes 595591168 bytes_used 268435456 bytes_may_use 209723392
Another one.
bonnie++-1793755 [009] ..... 28886.613446: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793755 [009] ..... 28886.613451: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 340393984
bonnie++-1793755 [009] ..... 28886.613453: btrfs_make_block_group: make bg offset 671285248 size 117440512 type 1
bonnie++-1793755 [009] ...1. 28886.613458: btrfs_add_bg_to_space_info: added bg offset 671285248 length 117440512 flags 1 to space_info->flags 1 total_bytes 713031680 bytes_used 268435456 bytes_may_use 2 68435456
Another one.
bonnie++-1793755 [009] ..... 28886.674953: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793755 [009] ..... 28886.674957: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 222953472
bonnie++-1793755 [009] ..... 28886.674959: btrfs_make_block_group: make bg offset 788725760 size 117440512 type 1
bonnie++-1793755 [009] ...1. 28886.674963: btrfs_add_bg_to_space_info: added bg offset 788725760 length 117440512 flags 1 to space_info->flags 1 total_bytes 830472192 bytes_used 268435456 bytes_may_use 1 34217728
Another one.
bonnie++-1793755 [009] ..... 28886.674981: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793755 [009] ..... 28886.674982: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 105512960
bonnie++-1793755 [009] ..... 28886.674983: btrfs_make_block_group: make bg offset 906166272 size 105512960 type 1
bonnie++-1793755 [009] ...1. 28886.674984: btrfs_add_bg_to_space_info: added bg offset 906166272 length 105512960 flags 1 to space_info->flags 1 total_bytes 935985152 bytes_used 268435456 bytes_may_use 67108864
Another one, but a bit smaller (~100.6M) since we now have less space.
bonnie++-1793758 [009] ..... 28891.962096: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 65536 dev_extent_want 1073741824
bonnie++-1793758 [009] ..... 28891.962103: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 65536 dev_extent_want 1073741824 max_avail 12582912
bonnie++-1793758 [009] ..... 28891.962105: btrfs_make_block_group: make bg offset 1011679232 size 12582912 type 1
bonnie++-1793758 [009] ...1. 28891.962114: btrfs_add_bg_to_space_info: added bg offset 1011679232 length 12582912 flags 1 to space_info->flags 1 total_bytes 948568064 bytes_used 268435456 bytes_may_use 8192
Another one, this one even smaller (12M).
kworker/u48:5-1792004 [011] ..... 28892.112802: btrfs_chunk_alloc: enter first metadata chunk alloc attempt
kworker/u48:5-1792004 [011] ..... 28892.112805: btrfs_create_chunk: gather_device_info 1 ctl->dev_extent_min = 131072 dev_extent_want 536870912
kworker/u48:5-1792004 [011] ..... 28892.112806: btrfs_create_chunk: gather_device_info 2 ctl->dev_extent_min = 131072 dev_extent_want 536870912 max_avail 0
536870912 is 512M, the standard 256M metadata chunk size times 2 because
of the DUP profile for metadata.
'max_avail' is what find_free_dev_extent() returns to us in
gather_device_info().
As a result, gather_device_info() sets ctl->ndevs to 0, making
decide_stripe_size() fail with -ENOSPC, and therefore metadata chunk
allocation fails while we are attempting to run delayed items during
the transaction commit.
kworker/u48:5-1792004 [011] ..... 28892.112807: btrfs_create_chunk: decide_stripe_size fail -ENOSPC
In the syslog/dmesg pasted above, which happened after the transaction was
aborted, the space info dumps did not account for all these data block
groups that were allocated during bonnie++'s workload. And that is because
after the metadata chunk allocation failed with -ENOSPC and before the
transaction abort happened, most of the data block groups had become empty
and got deleted by by the cleaner kthread - when the abort happened, we
had bonnie++ in the middle of deleting the files it created.
But dumping the space infos right after the metadata chunk allocation fails
by adding a call to btrfs_dump_space_info_for_trans_abort() in
decide_stripe_size() when it returns -ENOSPC, we get:
[29972.409295] BTRFS info (device nullb0): dumping space info:
[29972.409300] BTRFS info (device nullb0): space_info DATA (sub-group id 0) has 673341440 free, is not full
[29972.409303] BTRFS info (device nullb0): space_info total=948568064, used=0, pinned=275226624, reserved=0, may_use=0, readonly=0 zone_unusable=0
[29972.409305] BTRFS info (device nullb0): space_info METADATA (sub-group id 0) has 3915776 free, is not full
[29972.409306] BTRFS info (device nullb0): space_info total=53673984, used=163840, pinned=42827776, reserved=147456, may_use=6553600, readonly=65536 zone_unusable=0
[29972.409308] BTRFS info (device nullb0): space_info SYSTEM (sub-group id 0) has 7979008 free, is not full
[29972.409310] BTRFS info (device nullb0): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=393216, readonly=0 zone_unusable=0
[29972.409311] BTRFS info (device nullb0): global_block_rsv: size 5767168 reserved 5767168
[29972.409313] BTRFS info (device nullb0): trans_block_rsv: size 0 reserved 0
[29972.409314] BTRFS info (device nullb0): chunk_block_rsv: size 393216 reserved 393216
[29972.409315] BTRFS info (device nullb0): remap_block_rsv: size 0 reserved 0
[29972.409316] BTRFS info (device nullb0): delayed_block_rsv: size 0 reserved 0
So here we see there's ~904.6M of data space, ~51.2M of metadata space and
8M of system space, making a total of 963.8M.
Reported-by: Aleksandar Gerasimovski <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/SA1PR18MB56922F690C5EC2D85371408B998FA@SA1PR18MB5692.namprd18.prod.outlook.com/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H61vZ3_+eqJ1A9po2WcgNJJjUu9MJQoYB2oDSAAecHaug@mail.gmail.com/
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
Limit the number of zones reclaimed in flush_space()'s RECLAIM_ZONES
state.
This prevents possibly long running reclaim sweeps to block other tasks in
the system, while the system is under pressure anyways, causing the
tasks to hang.
An example of this can be seen here, triggered by fstests generic/551:
generic/551 [ 27.042349] run fstests generic/551 at 2026-02-27 11:05:30
BTRFS: device fsid 78c16e29-20d9-4c8e-bc04-7ba431be38ff devid 1 transid 8 /dev/vdb (254:16) scanned by mount (806)
BTRFS info (device vdb): first mount of filesystem 78c16e29-20d9-4c8e-bc04-7ba431be38ff
BTRFS info (device vdb): using crc32c checksum algorithm
BTRFS info (device vdb): host-managed zoned block device /dev/vdb, 64 zones of 268435456 bytes
BTRFS info (device vdb): zoned mode enabled with zone size 268435456
BTRFS info (device vdb): checking UUID tree
BTRFS info (device vdb): enabling free space tree
INFO: task kworker/u38:1:90 blocked for more than 120 seconds.
Not tainted 7.0.0-rc1+ #345
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u38:1 state:D stack:0 pid:90 tgid:90 ppid:2 task_flags:0x4208060 flags:0x00080000
Workqueue: events_unbound btrfs_async_reclaim_data_space
Call Trace:
<TASK>
__schedule+0x34f/0xe70
schedule+0x41/0x140
schedule_timeout+0xa3/0x110
? mark_held_locks+0x40/0x70
? lockdep_hardirqs_on_prepare+0xd8/0x1c0
? trace_hardirqs_on+0x18/0x100
? lockdep_hardirqs_on+0x84/0x130
? _raw_spin_unlock_irq+0x33/0x50
wait_for_completion+0xa4/0x150
? __flush_work+0x24c/0x550
__flush_work+0x339/0x550
? __pfx_wq_barrier_func+0x10/0x10
? wait_for_completion+0x39/0x150
flush_space+0x243/0x660
? find_held_lock+0x2b/0x80
? kvm_sched_clock_read+0x11/0x20
? local_clock_noinstr+0x17/0x110
? local_clock+0x15/0x30
? lock_release+0x1b7/0x4b0
do_async_reclaim_data_space+0xe8/0x160
btrfs_async_reclaim_data_space+0x19/0x30
process_one_work+0x20a/0x5f0
? lock_is_held_type+0xcd/0x130
worker_thread+0x1e2/0x3c0
? __pfx_worker_thread+0x10/0x10
kthread+0x103/0x150
? __pfx_kthread+0x10/0x10
ret_from_fork+0x20d/0x320
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Showing all locks held in the system:
1 lock held by khungtaskd/67:
#0: ffffffff824d58e0 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x3d/0x194
2 locks held by kworker/u38:1/90:
#0: ffff8881000aa158 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x3c4/0x5f0
#1: ffffc90000c17e58 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c0/0x5f0
5 locks held by kworker/u39:1/191:
#0: ffff8881000aa158 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x3c4/0x5f0
#1: ffffc90000dfbe58 ((work_completion)(&fs_info->reclaim_bgs_work)){+.+.}-{0:0}, at: process_one_work+0x1c0/0x5f0
#2: ffff888101da0420 (sb_writers#9){.+.+}-{0:0}, at: process_one_work+0x20a/0x5f0
#3: ffff88811040a648 (&fs_info->reclaim_bgs_lock){+.+.}-{4:4}, at: btrfs_reclaim_bgs_work+0x1de/0x770
#4: ffff888110408a18 (&fs_info->cleaner_mutex){+.+.}-{4:4}, at: btrfs_relocate_block_group+0x95a/0x20f0
1 lock held by aio-dio-write-v/980:
#0: ffff888110093008 (&sb->s_type->i_mutex_key#15){++++}-{4:4}, at: btrfs_inode_lock+0x51/0xb0
=============================================
To prevent these long running reclaims from blocking the system, only
reclaim 5 block_groups in the RECLAIM_ZONES state of flush_space(). Also
as these reclaims are now constrained, it opens up the use for a
synchronous call to brtfs_reclaim_block_groups(), eliminating the need
to place the reclaim task on a workqueue and then flushing the workqueue
again.
Reviewed-by: Boris Burkov <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
Signed-off-by: David Sterba <[email protected]>
As reported by syzbot [0], NBD can trigger a deadlock during
memory reclaim.
This occurs when a process holds lock_sock() on a backend TCP
socket and triggers a memory allocation that leads to fs reclaim.
If it eventually calls into NBD to send data or shut down the
socket, NBD will attempt to acquire the same lock_sock(),
resulting in the deadlock.
While NBD sets sk->sk_allocation to GFP_NOIO before calling
sendmsg(), this does not prevent the issue in some paths where
GFP_KERNEL is used directly under lock_sock().
To resolve this, let's use lock_sock_try() for TCP sendmsg() and
shutdown().
For sock_sendmsg(), if lock_sock_try() fails, -ERESTARTSYS is
returned, allowing the request to be retried later (e.g., via
was_interrupted() logic).
For sock_sendmsg() for NBD_CMD_DISC and kernel_sock_shutdown(),
the operation might be skipped if the lock cannot be acquired.
However, this is not expected to occur in practice because the
backend TCP socket should not be touched by userspace once it is
handed over to NBD.
Note that sock_recvmsg() does not require this special handling
because it is only called from the workqueue context.
Also note that AF_UNIX sockets continue to use sock_sendmsg()
and kernel_sock_shutdown() because unix_stream_sendmsg() and
unix_shutdown() do not acquire lock_sock().
[0]:
WARNING: possible circular locking dependency detected
syzkaller #0 Tainted: G L
syz.7.2282/12353 is trying to acquire lock:
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:317 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:4489 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:4843 [inline]
ffffffff8e9aa700 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
but task is already holding lock:
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1709 [inline]
ffff88806f972a20 (sk_lock-AF_INET6){+.+.}-{0:0}, at: tcp_close+0x1d/0x110 net/ipv4/tcp.c:3349
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (sk_lock-AF_INET6){+.+.}-{0:0}:
lock_sock_nested+0x41/0xf0 net/core/sock.c:3780
lock_sock include/net/sock.h:1709 [inline]
inet_shutdown+0x67/0x410 net/ipv4/af_inet.c:919
nbd_mark_nsock_dead+0xae/0x5c0 drivers/block/nbd.c:318
sock_shutdown+0x16b/0x200 drivers/block/nbd.c:411
nbd_clear_sock drivers/block/nbd.c:1427 [inline]
nbd_config_put+0x1eb/0x750 drivers/block/nbd.c:1451
nbd_genl_connect+0xaf8/0x1a40 drivers/block/nbd.c:2248
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #5 (&nsock->tx_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_handle_cmd drivers/block/nbd.c:1143 [inline]
nbd_queue_rq+0x428/0x1080 drivers/block/nbd.c:1207
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #4 (&cmd->lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
nbd_queue_rq+0xba/0x1080 drivers/block/nbd.c:1199
blk_mq_dispatch_rq_list+0x422/0x1e70 block/blk-mq.c:2148
__blk_mq_do_dispatch_sched block/blk-mq-sched.c:168 [inline]
blk_mq_do_dispatch_sched block/blk-mq-sched.c:182 [inline]
__blk_mq_sched_dispatch_requests+0xcea/0x1620 block/blk-mq-sched.c:307
blk_mq_sched_dispatch_requests+0xd7/0x1c0 block/blk-mq-sched.c:329
blk_mq_run_hw_queue+0x23c/0x670 block/blk-mq.c:2386
blk_mq_dispatch_list+0x51d/0x1360 block/blk-mq.c:2949
blk_mq_flush_plug_list block/blk-mq.c:2997 [inline]
blk_mq_flush_plug_list+0x130/0x600 block/blk-mq.c:2969
__blk_flush_plug+0x2c4/0x4b0 block/blk-core.c:1230
blk_finish_plug block/blk-core.c:1257 [inline]
__submit_bio+0x584/0x6c0 block/blk-core.c:649
__submit_bio_noacct_mq block/blk-core.c:722 [inline]
submit_bio_noacct_nocheck+0x562/0xc10 block/blk-core.c:753
submit_bio_noacct+0xd17/0x2010 block/blk-core.c:884
blk_crypto_submit_bio include/linux/blk-crypto.h:203 [inline]
submit_bh_wbc+0x59c/0x770 fs/buffer.c:2821
submit_bh fs/buffer.c:2826 [inline]
block_read_full_folio+0x264/0x8e0 fs/buffer.c:2444
filemap_read_folio+0xfc/0x3b0 mm/filemap.c:2501
do_read_cache_folio+0x2d7/0x6b0 mm/filemap.c:4101
read_mapping_folio include/linux/pagemap.h:1028 [inline]
read_part_sector+0xd1/0x370 block/partitions/core.c:723
adfspart_check_ICS+0x93/0x910 block/partitions/acorn.c:360
check_partition block/partitions/core.c:142 [inline]
blk_add_partitions block/partitions/core.c:590 [inline]
bdev_disk_changed+0x7f8/0xc80 block/partitions/core.c:694
blkdev_get_whole+0x187/0x290 block/bdev.c:764
bdev_open+0x2c7/0xe40 block/bdev.c:973
blkdev_open+0x34e/0x4f0 block/fops.c:697
do_dentry_open+0x6d8/0x1660 fs/open.c:949
vfs_open+0x82/0x3f0 fs/open.c:1081
do_open fs/namei.c:4671 [inline]
path_openat+0x208c/0x31a0 fs/namei.c:4830
do_file_open+0x20e/0x430 fs/namei.c:4859
do_sys_openat2+0x10d/0x1e0 fs/open.c:1366
do_sys_open fs/open.c:1372 [inline]
__do_sys_openat fs/open.c:1388 [inline]
__se_sys_openat fs/open.c:1383 [inline]
__x64_sys_openat+0x12d/0x210 fs/open.c:1383
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #3 (set->srcu){.+.+}-{0:0}:
srcu_lock_sync include/linux/srcu.h:199 [inline]
__synchronize_srcu+0xa1/0x2a0 kernel/rcu/srcutree.c:1505
blk_mq_wait_quiesce_done block/blk-mq.c:284 [inline]
blk_mq_wait_quiesce_done block/blk-mq.c:281 [inline]
blk_mq_quiesce_queue block/blk-mq.c:304 [inline]
blk_mq_quiesce_queue+0x149/0x1c0 block/blk-mq.c:299
elevator_switch+0x17b/0x7e0 block/elevator.c:576
elevator_change+0x352/0x530 block/elevator.c:681
elevator_set_default+0x29e/0x360 block/elevator.c:754
blk_register_queue+0x412/0x590 block/blk-sysfs.c:946
__add_disk+0x73f/0xe40 block/genhd.c:528
add_disk_fwnode+0x118/0x5c0 block/genhd.c:597
add_disk include/linux/blkdev.h:785 [inline]
nbd_dev_add+0x77a/0xb10 drivers/block/nbd.c:1984
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
__mutex_lock_common kernel/locking/mutex.c:614 [inline]
__mutex_lock+0x1a2/0x1b90 kernel/locking/mutex.c:776
elevator_change+0x1bc/0x530 block/elevator.c:679
elevator_set_none+0x92/0xf0 block/elevator.c:769
blk_mq_elv_switch_none block/blk-mq.c:5110 [inline]
__blk_mq_update_nr_hw_queues block/blk-mq.c:5155 [inline]
blk_mq_update_nr_hw_queues+0x4c1/0x15f0 block/blk-mq.c:5220
nbd_start_device+0x1a6/0xbd0 drivers/block/nbd.c:1489
nbd_genl_connect+0xff2/0x1a40 drivers/block/nbd.c:2239
genl_family_rcv_msg_doit+0x214/0x300 net/netlink/genetlink.c:1114
genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
genl_rcv_msg+0x560/0x800 net/netlink/genetlink.c:1209
netlink_rcv_skb+0x159/0x420 net/netlink/af_netlink.c:2550
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
netlink_unicast+0x5aa/0x870 net/netlink/af_netlink.c:1344
netlink_sendmsg+0x8b0/0xda0 net/netlink/af_netlink.c:1894
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0x9e1/0xb70 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (&q->q_usage_counter(io)#49){++++}-{0:0}:
blk_alloc_queue+0x610/0x790 block/blk-core.c:461
blk_mq_alloc_queue+0x174/0x290 block/blk-mq.c:4429
__blk_mq_alloc_disk+0x29/0x120 block/blk-mq.c:4476
nbd_dev_add+0x492/0xb10 drivers/block/nbd.c:1954
nbd_init+0x291/0x2b0 drivers/block/nbd.c:2692
do_one_initcall+0x11d/0x760 init/main.c:1382
do_initcall_level init/main.c:1444 [inline]
do_initcalls init/main.c:1460 [inline]
do_basic_setup init/main.c:1479 [inline]
kernel_init_freeable+0x6e5/0x7a0 init/main.c:1692
kernel_init+0x1f/0x1e0 init/main.c:1582
ret_from_fork+0x754/0xd80 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
-> #0 (fs_reclaim){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3165 [inline]
check_prevs_add kernel/locking/lockdep.c:3284 [inline]
validate_chain kernel/locking/lockdep.c:3908 [inline]
__lock_acquire+0x14b8/0x2630 kernel/locking/lockdep.c:5237
lock_acquire kernel/locking/lockdep.c:5868 [inline]
lock_acquire+0x1cf/0x380 kernel/locking/lockdep.c:5825
__fs_reclaim_acquire mm/page_alloc.c:4348 [inline]
fs_reclaim_acquire+0xc4/0x100 mm/page_alloc.c:4362
might_alloc include/linux/sched/mm.h:317 [inline]
slab_pre_alloc_hook mm/slub.c:4489 [inline]
slab_alloc_node mm/slub.c:4843 [inline]
kmem_cache_alloc_node_noprof+0x53/0x6f0 mm/slub.c:4918
__alloc_skb+0x140/0x710 net/core/skbuff.c:702
alloc_skb include/linux/skbuff.h:1383 [inline]
tcp_send_active_reset+0x8b/0xa60 net/ipv4/tcp_output.c:3862
__tcp_close+0x41e/0x1110 net/ipv4/tcp.c:3223
tcp_close+0x28/0x110 net/ipv4/tcp.c:3350
inet_release+0xed/0x200 net/ipv4/af_inet.c:443
inet6_release+0x4f/0x70 net/ipv6/af_inet6.c:479
__sock_release+0xb3/0x260 net/socket.c:662
sock_close+0x1c/0x30 net/socket.c:1455
__fput+0x3ff/0xb40 fs/file_table.c:469
task_work_run+0x150/0x240 kernel/task_work.c:233
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
__exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
exit_to_user_mode_loop+0x100/0x4a0 kernel/entry/common.c:98
__exit_to_user_mode_prepare include/linux/irq-entry-common.h:226 [inline]
syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:256 [inline]
syscall_exit_to_user_mode include/linux/entry-common.h:325 [inline]
do_syscall_64+0x67c/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
fs_reclaim --> &nsock->tx_lock --> sk_lock-AF_INET6
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET6);
lock(&nsock->tx_lock);
lock(sk_lock-AF_INET6);
lock(fs_reclaim);
*** DEADLOCK ***
Fixes: fd8383f ("nbd: convert to blkmq")
Reported-by: [email protected]
Closes: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Pull request for series with
subject: block tests: nvme metadata passthrough
version: 1
url: https://patchwork.kernel.org/project/linux-block/list/?series=969899