Skip to content

block: move sched_tags allocation/de-allocation outside of locking context#68

Closed
blktests-ci[bot] wants to merge 3 commits intolinus-master_basefrom
series/977615=>linus-master
Closed

block: move sched_tags allocation/de-allocation outside of locking context#68
blktests-ci[bot] wants to merge 3 commits intolinus-master_basefrom
series/977615=>linus-master

Conversation

@blktests-ci
Copy link
Copy Markdown

@blktests-ci blktests-ci Bot commented Jul 31, 2025

Pull request for series with
subject: block: move sched_tags allocation/de-allocation outside of locking context
version: 1
url: https://patchwork.kernel.org/project/linux-block/list/?series=986951

shroffni added 3 commits July 31, 2025 13:28
In preparation for allocating sched_tags before freezing the request
queue and acquiring ->elevator_lock, move the elevator queue allocation
logic from the elevator ops ->init_sched callback into blk_mq_init_sched.
As elevator_alloc is now only invoked from block layer core, we don't
need to export it, so unexport elevator_alloc function.

This refactoring provides a centralized location for elevator queue
initialization, which makes it easier to store pre-allocated sched_tags
in the struct elevator_queue during later changes.

Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Nilay Shroff <[email protected]>
…tore

Recent lockdep reports [1] have revealed a potential deadlock caused by a
lock dependency between the percpu allocator lock and the elevator lock.
This issue can be avoided by ensuring that the allocation and release of
scheduler tags (sched_tags) are performed outside the elevator lock.
Furthermore, the queue does not need to be remain frozen during these
operations.

To address this, move all sched_tags allocations and deallocations outside
of both the ->elevator_lock and the ->freeze_lock. Since the lifetime of
the elevator queue and its associated sched_tags is closely tied, the
allocated sched_tags are now stored in the elevator queue structure. Then,
during the actual elevator switch (which runs under ->freeze_lock and
->elevator_lock), the pre-allocated sched_tags are assigned to the
appropriate q->hctx. Once the elevator switch is complete and the locks
are released, the old elevator queue and its associated sched_tags are
freed.

This commit specifically addresses the allocation/deallocation of sched_
tags during elevator switching. Note that sched_tags may also be allocated
in other contexts, such as during nr_hw_queues updates. Supporting that
use case will require batch allocation/deallocation, which will be handled
in a follow-up patch.

This restructuring ensures that sched_tags memory management occurs
entirely outside of the ->elevator_lock and ->freeze_lock context,
eliminating the lock dependency problem seen during scheduler updates.

[1] https://lore.kernel.org/all/[email protected]/

Reported-by: Stefan Haberland <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Nilay Shroff <[email protected]>
Move scheduler tags (sched_tags) allocation and deallocation outside
both the ->elevator_lock and ->freeze_lock when updating nr_hw_queues.
This change breaks the dependency chain from the percpu allocator lock
to the elevator lock, helping to prevent potential deadlocks, as
observed in the reported lockdep splat[1].

This commit introduces batch allocation and deallocation helpers for
sched_tags, which are now used from within __blk_mq_update_nr_hw_queues
routine while iterating through the tagset.

With this change, all sched_tags memory management is handled entirely
outside the ->elevator_lock and the ->freeze_lock context, thereby
eliminating the lock dependency that could otherwise manifest during
nr_hw_queues updates.

[1] https://lore.kernel.org/all/[email protected]/

Reported-by: Stefan Haberland <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Nilay Shroff <[email protected]>
@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented Jul 31, 2025

Upstream branch: 260f6f4
series: https://patchwork.kernel.org/project/linux-block/list/?series=986951
version: 1

@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented Jul 31, 2025

Upstream branch: 260f6f4
series: https://patchwork.kernel.org/project/linux-block/list/?series=986951
version: 1

1 similar comment
@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented Jul 31, 2025

Upstream branch: 260f6f4
series: https://patchwork.kernel.org/project/linux-block/list/?series=986951
version: 1

@blktests-ci
Copy link
Copy Markdown
Author

blktests-ci Bot commented Jul 31, 2025

Github failed to update this PR after force push. Close it.

@blktests-ci blktests-ci Bot closed this Jul 31, 2025
@blktests-ci blktests-ci Bot deleted the series/977615=>linus-master branch August 2, 2025 00:21
blktests-ci Bot pushed a commit that referenced this pull request Aug 8, 2025
netlink_attachskb() checks for the socket's read memory allocation
constraints. Firstly, it has:

  rmem < READ_ONCE(sk->sk_rcvbuf)

to check if the just increased rmem value fits into the socket's receive
buffer. If not, it proceeds and tries to wait for the memory under:

  rmem + skb->truesize > READ_ONCE(sk->sk_rcvbuf)

The checks don't cover the case when skb->truesize + sk->sk_rmem_alloc is
equal to sk->sk_rcvbuf. Thus the function neither successfully accepts
these conditions, nor manages to reschedule the task - and is called in
retry loop for indefinite time which is caught as:

  rcu: INFO: rcu_sched self-detected stall on CPU
  rcu:     0-....: (25999 ticks this GP) idle=ef2/1/0x4000000000000000 softirq=262269/262269 fqs=6212
  (t=26000 jiffies g=230833 q=259957)
  NMI backtrace for cpu 0
  CPU: 0 PID: 22 Comm: kauditd Not tainted 5.10.240 #68
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-4.fc42 04/01/2014
  Call Trace:
  <IRQ>
  dump_stack lib/dump_stack.c:120
  nmi_cpu_backtrace.cold lib/nmi_backtrace.c:105
  nmi_trigger_cpumask_backtrace lib/nmi_backtrace.c:62
  rcu_dump_cpu_stacks kernel/rcu/tree_stall.h:335
  rcu_sched_clock_irq.cold kernel/rcu/tree.c:2590
  update_process_times kernel/time/timer.c:1953
  tick_sched_handle kernel/time/tick-sched.c:227
  tick_sched_timer kernel/time/tick-sched.c:1399
  __hrtimer_run_queues kernel/time/hrtimer.c:1652
  hrtimer_interrupt kernel/time/hrtimer.c:1717
  __sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1113
  asm_call_irq_on_stack arch/x86/entry/entry_64.S:808
  </IRQ>

  netlink_attachskb net/netlink/af_netlink.c:1234
  netlink_unicast net/netlink/af_netlink.c:1349
  kauditd_send_queue kernel/audit.c:776
  kauditd_thread kernel/audit.c:897
  kthread kernel/kthread.c:328
  ret_from_fork arch/x86/entry/entry_64.S:304

Restore the original behavior of the check which commit in Fixes
accidentally missed when restructuring the code.

Found by Linux Verification Center (linuxtesting.org).

Fixes: ae8f160 ("netlink: Fix wraparounds of sk->sk_rmem_alloc.")
Cc: [email protected]
Signed-off-by: Fedor Pchelkin <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant