Skip to content

Commit a84097e

Browse files
Waiman-Longhtejun
authored andcommitted
cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock
The current cpuset partition code is able to dynamically update the sched domains of a running system and the corresponding HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentially the "isolcpus=domain,..." boot command line feature at run time. The housekeeping cpumask update requires flushing a number of different workqueues which may not be safe with cpus_read_lock() held as the workqueue flushing code may acquire cpus_read_lock() or acquiring locks which have locking dependency with cpus_read_lock() down the chain. Below is an example of such circular locking problem. ====================================================== WARNING: possible circular locking dependency detected 6.18.0-test+ #2 Tainted: G S ------------------------------------------------------ test_cpuset_prs/10971 is trying to acquire lock: ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180 but task is already holding lock: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (cpuset_mutex){+.+.}-{4:4}: -> #3 (cpu_hotplug_lock){++++}-{0:0}: -> #2 (rtnl_mutex){+.+.}-{4:4}: -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: Chain exists of: (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex 5 locks held by test_cpuset_prs/10971: #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0 #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0 #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0 #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130 #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 Call Trace: <TASK> : touch_wq_lockdep_map+0x93/0x180 __flush_workqueue+0x111/0x10b0 housekeeping_update+0x12d/0x2d0 update_parent_effective_cpumask+0x595/0x2440 update_prstate+0x89d/0xce0 cpuset_partition_write+0xc5/0x130 cgroup_file_write+0x1a5/0x680 kernfs_fop_write_iter+0x3df/0x5f0 vfs_write+0x525/0xfd0 ksys_write+0xf9/0x1d0 do_syscall_64+0x95/0x520 entry_SYSCALL_64_after_hwframe+0x76/0x7e To avoid such a circular locking dependency problem, we have to call housekeeping_update() without holding the cpus_read_lock() and cpuset_mutex. The current set of wq's flushed by housekeeping_update() may not have work functions that call cpus_read_lock() directly, but we are likely to extend the list of wq's that are flushed in the future. Moreover, the current set of work functions may hold locks that may have cpu_hotplug_lock down the dependency chain. So housekeeping_update() is now called after releasing cpus_read_lock and cpuset_mutex at the end of a cpuset operation. These two locks are then re-acquired later before calling rebuild_sched_domains_locked(). To enable mutual exclusion between the housekeeping_update() call and other cpuset control file write actions, a new top level cpuset_top_mutex is introduced. This new mutex will be acquired first to allow sharing variables used by both code paths. However, cpuset update from CPU hotplug can still happen in parallel with the housekeeping_update() call, though that should be rare in production environment. As cpus_read_lock() is now no longer held when tmigr_isolated_exclude_cpumask() is called, it needs to acquire it directly. The lockdep_is_cpuset_held() is also updated to return true if either cpuset_top_mutex or cpuset_mutex is held. Fixes: 03ff735 ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset") Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
1 parent 6df415a commit a84097e

3 files changed

Lines changed: 44 additions & 11 deletions

File tree

kernel/cgroup/cpuset.c

Lines changed: 42 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -65,14 +65,28 @@ static const char * const perr_strings[] = {
6565
* CPUSET Locking Convention
6666
* -------------------------
6767
*
68-
* Below are the three global locks guarding cpuset structures in lock
68+
* Below are the four global/local locks guarding cpuset structures in lock
6969
* acquisition order:
70+
* - cpuset_top_mutex
7071
* - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
7172
* - cpuset_mutex
7273
* - callback_lock (raw spinlock)
7374
*
74-
* A task must hold all the three locks to modify externally visible or
75-
* used fields of cpusets, though some of the internally used cpuset fields
75+
* As cpuset will now indirectly flush a number of different workqueues in
76+
* housekeeping_update() to update housekeeping cpumasks when the set of
77+
* isolated CPUs is going to be changed, it may be vulnerable to deadlock
78+
* if we hold cpus_read_lock while calling into housekeeping_update().
79+
*
80+
* The first cpuset_top_mutex will be held except when calling into
81+
* cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock
82+
* and cpuset_mutex will be held instead. The main purpose of this mutex
83+
* is to prevent regular cpuset control file write actions from interfering
84+
* with the call to housekeeping_update(), though CPU hotplug operation can
85+
* still happen in parallel. This mutex also provides protection for some
86+
* internal variables.
87+
*
88+
* A task must hold all the remaining three locks to modify externally visible
89+
* or used fields of cpusets, though some of the internally used cpuset fields
7690
* and internal variables can be modified without holding callback_lock. If only
7791
* reliable read access of the externally used fields are needed, a task can
7892
* hold either cpuset_mutex or callback_lock which are exposed to other
@@ -100,6 +114,7 @@ static const char * const perr_strings[] = {
100114
* cpumasks and nodemasks.
101115
*/
102116

117+
static DEFINE_MUTEX(cpuset_top_mutex);
103118
static DEFINE_MUTEX(cpuset_mutex);
104119

105120
/*
@@ -111,6 +126,8 @@ static DEFINE_MUTEX(cpuset_mutex);
111126
*
112127
* CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable
113128
* by holding both cpuset_mutex and callback_lock.
129+
*
130+
* T: Read/write-able by holding the cpuset_top_mutex.
114131
*/
115132

116133
/*
@@ -134,6 +151,11 @@ static cpumask_var_t isolated_cpus; /* CSCB */
134151
*/
135152
static bool update_housekeeping; /* RWCS */
136153

154+
/*
155+
* Copy of isolated_cpus to be passed to housekeeping_update()
156+
*/
157+
static cpumask_var_t isolated_hk_cpus; /* T */
158+
137159
/*
138160
* A flag to force sched domain rebuild at the end of an operation.
139161
* It can be set in
@@ -297,6 +319,7 @@ void lockdep_assert_cpuset_lock_held(void)
297319
*/
298320
void cpuset_full_lock(void)
299321
{
322+
mutex_lock(&cpuset_top_mutex);
300323
cpus_read_lock();
301324
mutex_lock(&cpuset_mutex);
302325
}
@@ -305,12 +328,14 @@ void cpuset_full_unlock(void)
305328
{
306329
mutex_unlock(&cpuset_mutex);
307330
cpus_read_unlock();
331+
mutex_unlock(&cpuset_top_mutex);
308332
}
309333

310334
#ifdef CONFIG_LOCKDEP
311335
bool lockdep_is_cpuset_held(void)
312336
{
313-
return lockdep_is_held(&cpuset_mutex);
337+
return lockdep_is_held(&cpuset_mutex) ||
338+
lockdep_is_held(&cpuset_top_mutex);
314339
}
315340
#endif
316341

@@ -1315,9 +1340,20 @@ static void update_hk_sched_domains(void)
13151340
{
13161341
if (update_housekeeping) {
13171342
/* Updating HK cpumasks implies rebuild sched domains */
1318-
WARN_ON_ONCE(housekeeping_update(isolated_cpus));
13191343
update_housekeeping = false;
13201344
force_sd_rebuild = true;
1345+
cpumask_copy(isolated_hk_cpus, isolated_cpus);
1346+
1347+
/*
1348+
* housekeeping_update() is now called without holding
1349+
* cpus_read_lock and cpuset_mutex. Only cpuset_top_mutex
1350+
* is still being held for mutual exclusion.
1351+
*/
1352+
mutex_unlock(&cpuset_mutex);
1353+
cpus_read_unlock();
1354+
WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus));
1355+
cpus_read_lock();
1356+
mutex_lock(&cpuset_mutex);
13211357
}
13221358
/* force_sd_rebuild will be cleared in rebuild_sched_domains_locked() */
13231359
if (force_sd_rebuild)
@@ -3635,6 +3671,7 @@ int __init cpuset_init(void)
36353671
BUG_ON(!alloc_cpumask_var(&top_cpuset.exclusive_cpus, GFP_KERNEL));
36363672
BUG_ON(!zalloc_cpumask_var(&subpartitions_cpus, GFP_KERNEL));
36373673
BUG_ON(!zalloc_cpumask_var(&isolated_cpus, GFP_KERNEL));
3674+
BUG_ON(!zalloc_cpumask_var(&isolated_hk_cpus, GFP_KERNEL));
36383675

36393676
cpumask_setall(top_cpuset.cpus_allowed);
36403677
nodes_setall(top_cpuset.mems_allowed);

kernel/sched/isolation.c

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -123,8 +123,6 @@ int housekeeping_update(struct cpumask *isol_mask)
123123
struct cpumask *trial, *old = NULL;
124124
int err;
125125

126-
lockdep_assert_cpus_held();
127-
128126
trial = kmalloc(cpumask_size(), GFP_KERNEL);
129127
if (!trial)
130128
return -ENOMEM;
@@ -136,7 +134,7 @@ int housekeeping_update(struct cpumask *isol_mask)
136134
}
137135

138136
if (!housekeeping.flags)
139-
static_branch_enable_cpuslocked(&housekeeping_overridden);
137+
static_branch_enable(&housekeeping_overridden);
140138

141139
if (housekeeping.flags & HK_FLAG_DOMAIN)
142140
old = housekeeping_cpumask_dereference(HK_TYPE_DOMAIN);

kernel/time/timer_migration.c

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1559,8 +1559,6 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
15591559
cpumask_var_t cpumask __free(free_cpumask_var) = CPUMASK_VAR_NULL;
15601560
int cpu;
15611561

1562-
lockdep_assert_cpus_held();
1563-
15641562
if (!works)
15651563
return -ENOMEM;
15661564
if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
@@ -1570,6 +1568,7 @@ int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
15701568
* First set previously isolated CPUs as available (unisolate).
15711569
* This cpumask contains only CPUs that switched to available now.
15721570
*/
1571+
guard(cpus_read_lock)();
15731572
cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask);
15741573
cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask);
15751574

@@ -1626,7 +1625,6 @@ static int __init tmigr_init_isolation(void)
16261625
cpumask_andnot(cpumask, cpu_possible_mask, housekeeping_cpumask(HK_TYPE_DOMAIN));
16271626

16281627
/* Protect against RCU torture hotplug testing */
1629-
guard(cpus_read_lock)();
16301628
return tmigr_isolated_exclude_cpumask(cpumask);
16311629
}
16321630
late_initcall(tmigr_init_isolation);

0 commit comments

Comments
 (0)