Skip to content

Commit 4c56a8a

Browse files
committed
cgroup: Fix cgroup_drain_dying() testing the wrong condition
cgroup_drain_dying() was using cgroup_is_populated() to test whether there are dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets, nr_populated_domain_children and nr_populated_threaded_children, but cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether there are children is cgroup_destroy_locked()'s concern. This caused hangs during shutdown. When systemd tried to rmdir a cgroup that had no direct tasks but had a populated child, cgroup_drain_dying() would enter its wait loop because cgroup_is_populated() was true from nr_populated_domain_children. The task iterator found nothing to wait for, yet the populated state never cleared because it was driven by live tasks in the child cgroup. Fix it by using cgroup_has_tasks() which only tests nr_populated_csets. v3: Fix cgroup_is_populated() -> cgroup_has_tasks() (Sebastian). v2: https://lore.kernel.org/r/[email protected] Reported-by: Sebastian Andrzej Siewior <[email protected]> Fixes: 1b164b8 ("cgroup: Wait for dying tasks to leave on rmdir") Signed-off-by: Tejun Heo <[email protected]> Tested-by: Sebastian Andrzej Siewior <[email protected]>
1 parent 6680c16 commit 4c56a8a

1 file changed

Lines changed: 22 additions & 20 deletions

File tree

kernel/cgroup/cgroup.c

Lines changed: 22 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -6229,20 +6229,22 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
62296229
* cgroup_drain_dying - wait for dying tasks to leave before rmdir
62306230
* @cgrp: the cgroup being removed
62316231
*
6232-
* The PF_EXITING filter in css_task_iter_advance() hides exiting tasks from
6233-
* cgroup.procs so that userspace (e.g. systemd) doesn't see tasks that have
6234-
* already been reaped via waitpid(). However, the populated counter
6235-
* (nr_populated_csets) is only decremented when the task later passes through
6232+
* cgroup.procs and cgroup.threads use css_task_iter which filters out
6233+
* PF_EXITING tasks so that userspace doesn't see tasks that have already been
6234+
* reaped via waitpid(). However, cgroup_has_tasks() - which tests whether the
6235+
* cgroup has non-empty css_sets - is only updated when dying tasks pass through
62366236
* cgroup_task_dead() in finish_task_switch(). This creates a window where
6237-
* cgroup.procs appears empty but cgroup_is_populated() is still true, causing
6238-
* rmdir to fail with -EBUSY.
6237+
* cgroup.procs reads empty but cgroup_has_tasks() is still true, making rmdir
6238+
* fail with -EBUSY from cgroup_destroy_locked() even though userspace sees no
6239+
* tasks.
6240+
*
6241+
* This function aligns cgroup_has_tasks() with what userspace can observe. If
6242+
* cgroup_has_tasks() but the task iterator sees nothing (all remaining tasks are
6243+
* PF_EXITING), we wait for cgroup_task_dead() to finish processing them. As the
6244+
* window between PF_EXITING and cgroup_task_dead() is short, the wait is brief.
62396245
*
6240-
* This function bridges that gap. If the cgroup is populated but all remaining
6241-
* tasks have PF_EXITING set, we wait for cgroup_task_dead() to process them.
6242-
* Tasks are removed from the cgroup's css_set in cgroup_task_dead() called from
6243-
* finish_task_switch(). As the window between PF_EXITING and cgroup_task_dead()
6244-
* is short, the number of PF_EXITING tasks on the list is small and the wait
6245-
* is brief.
6246+
* This function only concerns itself with this cgroup's own dying tasks.
6247+
* Whether the cgroup has children is cgroup_destroy_locked()'s problem.
62466248
*
62476249
* Each cgroup_task_dead() kicks the waitqueue via cset->cgrp_links, and we
62486250
* retry the full check from scratch.
@@ -6258,7 +6260,7 @@ static int cgroup_drain_dying(struct cgroup *cgrp)
62586260

62596261
lockdep_assert_held(&cgroup_mutex);
62606262
retry:
6261-
if (!cgroup_is_populated(cgrp))
6263+
if (!cgroup_has_tasks(cgrp))
62626264
return 0;
62636265

62646266
/* Same iterator as cgroup.threads - if any task is visible, it's busy */
@@ -6273,15 +6275,15 @@ static int cgroup_drain_dying(struct cgroup *cgrp)
62736275
* All remaining tasks are PF_EXITING and will pass through
62746276
* cgroup_task_dead() shortly. Wait for a kick and retry.
62756277
*
6276-
* cgroup_is_populated() can't transition from false to true while
6277-
* we're holding cgroup_mutex, but the true to false transition
6278-
* happens under css_set_lock (via cgroup_task_dead()). We must
6279-
* retest and prepare_to_wait() under css_set_lock. Otherwise, the
6280-
* transition can happen between our first test and
6281-
* prepare_to_wait(), and we sleep with no one to wake us.
6278+
* cgroup_has_tasks() can't transition from false to true while we're
6279+
* holding cgroup_mutex, but the true to false transition happens
6280+
* under css_set_lock (via cgroup_task_dead()). We must retest and
6281+
* prepare_to_wait() under css_set_lock. Otherwise, the transition
6282+
* can happen between our first test and prepare_to_wait(), and we
6283+
* sleep with no one to wake us.
62826284
*/
62836285
spin_lock_irq(&css_set_lock);
6284-
if (!cgroup_is_populated(cgrp)) {
6286+
if (!cgroup_has_tasks(cgrp)) {
62856287
spin_unlock_irq(&css_set_lock);
62866288
return 0;
62876289
}

0 commit comments

Comments
 (0)