Skip to content

Commit 5bdb407

Browse files
committed
Merge tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext updates from Tejun Heo: - cgroup sub-scheduler groundwork Multiple BPF schedulers can be attached to cgroups and the dispatch path is made hierarchical. This involves substantial restructuring of the core dispatch, bypass, watchdog, and dump paths to be per-scheduler, along with new infrastructure for scheduler ownership enforcement, lifecycle management, and cgroup subtree iteration The enqueue path is not yet updated and will follow in a later cycle - scx_bpf_dsq_reenq() generalized to support any DSQ including remote local DSQs and user DSQs Built on top of this, SCX_ENQ_IMMED guarantees that tasks dispatched to local DSQs either run immediately or get reenqueued back through ops.enqueue(), giving schedulers tighter control over queueing latency Also useful for opportunistic CPU sharing across sub-schedulers - ops.dequeue() was only invoked when the core knew a task was in BPF data structures, missing scheduling property change events and skipping callbacks for non-local DSQ dispatches from ops.select_cpu() Fixed to guarantee exactly one ops.dequeue() call when a task leaves BPF scheduler custody - Kfunc access validation moved from runtime to BPF verifier time, removing runtime mask enforcement - Idle SMT sibling prioritization in the idle CPU selection path - Documentation, selftest, and tooling updates. Misc bug fixes and cleanups * tag 'sched_ext-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (134 commits) tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY() sched_ext: Make string params of __ENUM_set() const tools/sched_ext: Kick home CPU for stranded tasks in scx_qmap sched_ext: Drop spurious warning on kick during scheduler disable sched_ext: Warn on task-based SCX op recursion sched_ext: Rename scx_kf_allowed_on_arg_tasks() to scx_kf_arg_task_ok() sched_ext: Remove runtime kfunc mask enforcement sched_ext: Add verifier-time kfunc context filter sched_ext: Drop redundant rq-locked check from scx_bpf_task_cgroup() sched_ext: Decouple kfunc unlocked-context check from kf_mask sched_ext: Fix ops.cgroup_move() invocation kf_mask and rq tracking sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked sched_ext: Drop TRACING access to select_cpu kfuncs selftests/sched_ext: Fix wrong DSQ ID in peek_dsq error message sched_ext: Documentation: improve accuracy of task lifecycle pseudo-code selftests/sched_ext: Improve runner error reporting for invalid arguments sched_ext: Documentation: Fix scx_bpf_move_to_local kfunc name sched_ext: Documentation: Add ops.dequeue() to task lifecycle tools/sched_ext: Fix off-by-one in scx_sdt payload zeroing ...
2 parents 7de6b4a + 7e311ba commit 5bdb407

47 files changed

Lines changed: 5356 additions & 1365 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Documentation/scheduler/sched-ext.rst

Lines changed: 187 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,55 @@ scheduler has been loaded):
9393
# cat /sys/kernel/sched_ext/enable_seq
9494
1
9595
96+
Each running scheduler also exposes a per-scheduler ``events`` file under
97+
``/sys/kernel/sched_ext/<scheduler-name>/events`` that tracks diagnostic
98+
counters. Each counter occupies one ``name value`` line:
99+
100+
.. code-block:: none
101+
102+
# cat /sys/kernel/sched_ext/simple/events
103+
SCX_EV_SELECT_CPU_FALLBACK 0
104+
SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE 0
105+
SCX_EV_DISPATCH_KEEP_LAST 123
106+
SCX_EV_ENQ_SKIP_EXITING 0
107+
SCX_EV_ENQ_SKIP_MIGRATION_DISABLED 0
108+
SCX_EV_REENQ_IMMED 0
109+
SCX_EV_REENQ_LOCAL_REPEAT 0
110+
SCX_EV_REFILL_SLICE_DFL 456789
111+
SCX_EV_BYPASS_DURATION 0
112+
SCX_EV_BYPASS_DISPATCH 0
113+
SCX_EV_BYPASS_ACTIVATE 0
114+
SCX_EV_INSERT_NOT_OWNED 0
115+
SCX_EV_SUB_BYPASS_DISPATCH 0
116+
117+
The counters are described in ``kernel/sched/ext_internal.h``; briefly:
118+
119+
* ``SCX_EV_SELECT_CPU_FALLBACK``: ops.select_cpu() returned a CPU unusable by
120+
the task and the core scheduler silently picked a fallback CPU.
121+
* ``SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE``: a local-DSQ dispatch was redirected
122+
to the global DSQ because the target CPU went offline.
123+
* ``SCX_EV_DISPATCH_KEEP_LAST``: a task continued running because no other
124+
task was available (only when ``SCX_OPS_ENQ_LAST`` is not set).
125+
* ``SCX_EV_ENQ_SKIP_EXITING``: an exiting task was dispatched to the local DSQ
126+
directly, bypassing ops.enqueue() (only when ``SCX_OPS_ENQ_EXITING`` is not set).
127+
* ``SCX_EV_ENQ_SKIP_MIGRATION_DISABLED``: a migration-disabled task was
128+
dispatched to its local DSQ directly (only when
129+
``SCX_OPS_ENQ_MIGRATION_DISABLED`` is not set).
130+
* ``SCX_EV_REENQ_IMMED``: a task dispatched with ``SCX_ENQ_IMMED`` was
131+
re-enqueued because the target CPU was not available for immediate execution.
132+
* ``SCX_EV_REENQ_LOCAL_REPEAT``: a reenqueue of the local DSQ triggered
133+
another reenqueue; recurring counts indicate incorrect ``SCX_ENQ_REENQ``
134+
handling in the BPF scheduler.
135+
* ``SCX_EV_REFILL_SLICE_DFL``: a task's time slice was refilled with the
136+
default value (``SCX_SLICE_DFL``).
137+
* ``SCX_EV_BYPASS_DURATION``: total nanoseconds spent in bypass mode.
138+
* ``SCX_EV_BYPASS_DISPATCH``: number of tasks dispatched while in bypass mode.
139+
* ``SCX_EV_BYPASS_ACTIVATE``: number of times bypass mode was activated.
140+
* ``SCX_EV_INSERT_NOT_OWNED``: attempted to insert a task not owned by this
141+
scheduler into a DSQ; such attempts are silently ignored.
142+
* ``SCX_EV_SUB_BYPASS_DISPATCH``: tasks dispatched from sub-scheduler bypass
143+
DSQs (only relevant with ``CONFIG_EXT_SUB_SCHED``).
144+
96145
``tools/sched_ext/scx_show_state.py`` is a drgn script which shows more
97146
detailed information:
98147

@@ -228,16 +277,23 @@ The following briefly shows how a waking task is scheduled and executed.
228277
scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
229278
using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
230279

231-
A task can be immediately inserted into a DSQ from ``ops.select_cpu()``
232-
by calling ``scx_bpf_dsq_insert()``. If the task is inserted into
233-
``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the
234-
local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
235-
Additionally, inserting directly from ``ops.select_cpu()`` will cause the
236-
``ops.enqueue()`` callback to be skipped.
237-
238280
Note that the scheduler core will ignore an invalid CPU selection, for
239281
example, if it's outside the allowed cpumask of the task.
240282

283+
A task can be immediately inserted into a DSQ from ``ops.select_cpu()``
284+
by calling ``scx_bpf_dsq_insert()`` or ``scx_bpf_dsq_insert_vtime()``.
285+
286+
If the task is inserted into ``SCX_DSQ_LOCAL`` from
287+
``ops.select_cpu()``, it will be added to the local DSQ of whichever CPU
288+
is returned from ``ops.select_cpu()``. Additionally, inserting directly
289+
from ``ops.select_cpu()`` will cause the ``ops.enqueue()`` callback to
290+
be skipped.
291+
292+
Any other attempt to store a task in BPF-internal data structures from
293+
``ops.select_cpu()`` does not prevent ``ops.enqueue()`` from being
294+
invoked. This is discouraged, as it can introduce racy behavior or
295+
inconsistent state.
296+
241297
2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
242298
task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()``
243299
can make one of the following decisions:
@@ -251,6 +307,61 @@ The following briefly shows how a waking task is scheduled and executed.
251307

252308
* Queue the task on the BPF side.
253309

310+
**Task State Tracking and ops.dequeue() Semantics**
311+
312+
A task is in the "BPF scheduler's custody" when the BPF scheduler is
313+
responsible for managing its lifecycle. A task enters custody when it is
314+
dispatched to a user DSQ or stored in the BPF scheduler's internal data
315+
structures. Custody is entered only from ``ops.enqueue()`` for those
316+
operations. The only exception is dispatching to a user DSQ from
317+
``ops.select_cpu()``: although the task is not yet technically in BPF
318+
scheduler custody at that point, the dispatch has the same semantic
319+
effect as dispatching from ``ops.enqueue()`` for custody-related
320+
purposes.
321+
322+
Once ``ops.enqueue()`` is called, the task may or may not enter custody
323+
depending on what the scheduler does:
324+
325+
* **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``,
326+
``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): the BPF scheduler
327+
is done with the task - it either goes straight to a CPU's local run
328+
queue or to the global DSQ as a fallback. The task never enters (or
329+
exits) BPF custody, and ``ops.dequeue()`` will not be called.
330+
331+
* **Dispatch to user-created DSQs** (custom DSQs): the task enters the
332+
BPF scheduler's custody. When the task later leaves BPF custody
333+
(dispatched to a terminal DSQ, picked by core-sched, or dequeued for
334+
sleep/property changes), ``ops.dequeue()`` will be called exactly
335+
once.
336+
337+
* **Stored in BPF data structures** (e.g., internal BPF queues): the
338+
task is in BPF custody. ``ops.dequeue()`` will be called when it
339+
leaves (e.g., when ``ops.dispatch()`` moves it to a terminal DSQ, or
340+
on property change / sleep).
341+
342+
When a task leaves BPF scheduler custody, ``ops.dequeue()`` is invoked.
343+
The dequeue can happen for different reasons, distinguished by flags:
344+
345+
1. **Regular dispatch**: when a task in BPF custody is dispatched to a
346+
terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for
347+
execution), ``ops.dequeue()`` is triggered without any special flags.
348+
349+
2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
350+
core scheduling picks a task for execution while it's still in BPF
351+
custody, ``ops.dequeue()`` is called with the
352+
``SCX_DEQ_CORE_SCHED_EXEC`` flag.
353+
354+
3. **Scheduling property change**: when a task property changes (via
355+
operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
356+
priority changes, CPU migrations, etc.) while the task is still in
357+
BPF custody, ``ops.dequeue()`` is called with the
358+
``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
359+
360+
**Important**: Once a task has left BPF custody (e.g., after being
361+
dispatched to a terminal DSQ), property changes will not trigger
362+
``ops.dequeue()``, since the task is no longer managed by the BPF
363+
scheduler.
364+
254365
3. When a CPU is ready to schedule, it first looks at its local DSQ. If
255366
empty, it then looks at the global DSQ. If there still isn't a task to
256367
run, ``ops.dispatch()`` is invoked which can use the following two
@@ -264,9 +375,9 @@ The following briefly shows how a waking task is scheduled and executed.
264375
rather than performing them immediately. There can be up to
265376
``ops.dispatch_max_batch`` pending tasks.
266377

267-
* ``scx_bpf_move_to_local()`` moves a task from the specified non-local
378+
* ``scx_bpf_dsq_move_to_local()`` moves a task from the specified non-local
268379
DSQ to the dispatching DSQ. This function cannot be called with any BPF
269-
locks held. ``scx_bpf_move_to_local()`` flushes the pending insertions
380+
locks held. ``scx_bpf_dsq_move_to_local()`` flushes the pending insertions
270381
tasks before trying to move from the specified DSQ.
271382

272383
4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
@@ -297,8 +408,8 @@ for more information.
297408
Task Lifecycle
298409
--------------
299410

300-
The following pseudo-code summarizes the entire lifecycle of a task managed
301-
by a sched_ext scheduler:
411+
The following pseudo-code presents a rough overview of the entire lifecycle
412+
of a task managed by a sched_ext scheduler:
302413

303414
.. code-block:: c
304415
@@ -311,22 +422,37 @@ by a sched_ext scheduler:
311422
312423
ops.runnable(); /* Task becomes ready to run */
313424
314-
while (task is runnable) {
315-
if (task is not in a DSQ && task->scx.slice == 0) {
425+
while (task_is_runnable(task)) {
426+
if (task is not in a DSQ || task->scx.slice == 0) {
316427
ops.enqueue(); /* Task can be added to a DSQ */
317428
429+
/* Task property change (i.e., affinity, nice, etc.)? */
430+
if (sched_change(task)) {
431+
ops.dequeue(); /* Exiting BPF scheduler custody */
432+
ops.quiescent();
433+
434+
/* Property change callback, e.g. ops.set_weight() */
435+
436+
ops.runnable();
437+
continue;
438+
}
439+
318440
/* Any usable CPU becomes available */
319441
320-
ops.dispatch(); /* Task is moved to a local DSQ */
442+
ops.dispatch(); /* Task is moved to a local DSQ */
443+
ops.dequeue(); /* Exiting BPF scheduler custody */
321444
}
445+
322446
ops.running(); /* Task starts running on its assigned CPU */
323-
while (task->scx.slice > 0 && task is runnable)
447+
448+
while (task_is_runnable(task) && task->scx.slice > 0) {
324449
ops.tick(); /* Called every 1/HZ seconds */
325-
ops.stopping(); /* Task stops running (time slice expires or wait) */
326450
327-
/* Task's CPU becomes available */
451+
if (task->scx.slice == 0)
452+
ops.dispatch(); /* task->scx.slice can be refilled */
453+
}
328454
329-
ops.dispatch(); /* task->scx.slice can be refilled */
455+
ops.stopping(); /* Task stops running (time slice expires or wait) */
330456
}
331457
332458
ops.quiescent(); /* Task releases its assigned CPU (wait) */
@@ -335,6 +461,30 @@ by a sched_ext scheduler:
335461
ops.disable(); /* Disable BPF scheduling for the task */
336462
ops.exit_task(); /* Task is destroyed */
337463
464+
Note that the above pseudo-code does not cover all possible state transitions
465+
and edge cases, to name a few examples:
466+
467+
* ``ops.dispatch()`` may fail to move the task to a local DSQ due to a racing
468+
property change on that task, in which case ``ops.dispatch()`` will be
469+
retried.
470+
471+
* The task may be direct-dispatched to a local DSQ from ``ops.enqueue()``,
472+
in which case ``ops.dispatch()`` and ``ops.dequeue()`` are skipped and we go
473+
straight to ``ops.running()``.
474+
475+
* Property changes may occur at virtually any point during the task's lifecycle,
476+
not just when the task is queued and waiting to be dispatched. For example,
477+
changing a property of a running task will lead to the callback sequence
478+
``ops.stopping()`` -> ``ops.quiescent()`` -> (property change callback) ->
479+
``ops.runnable()`` -> ``ops.running()``.
480+
481+
* A sched_ext task can be preempted by a task from a higher-priority scheduling
482+
class, in which case it will exit the tick-dispatch loop even though it is runnable
483+
and has a non-zero slice.
484+
485+
See the "Scheduling Cycle" section for a more detailed description of how
486+
a freshly woken up task gets on a CPU.
487+
338488
Where to Look
339489
=============
340490

@@ -377,6 +527,25 @@ Where to Look
377527
scheduling. Tasks with CPU affinity are direct-dispatched in FIFO order;
378528
all others are scheduled in user space by a simple vruntime scheduler.
379529

530+
Module Parameters
531+
=================
532+
533+
sched_ext exposes two module parameters under the ``sched_ext.`` prefix that
534+
control bypass-mode behaviour. These knobs are primarily for debugging; there
535+
is usually no reason to change them during normal operation. They can be read
536+
and written at runtime (mode 0600) via
537+
``/sys/module/sched_ext/parameters/``.
538+
539+
``sched_ext.slice_bypass_us`` (default: 5000 µs)
540+
The time slice assigned to all tasks when the scheduler is in bypass mode,
541+
i.e. during BPF scheduler load, unload, and error recovery. Valid range is
542+
100 µs to 100 ms.
543+
544+
``sched_ext.bypass_lb_intv_us`` (default: 500000 µs)
545+
The interval at which the bypass-mode load balancer redistributes tasks
546+
across CPUs. Set to 0 to disable load balancing during bypass mode. Valid
547+
range is 0 to 10 s.
548+
380549
ABI Instability
381550
===============
382551

include/linux/cgroup-defs.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
#include <linux/refcount.h>
1818
#include <linux/percpu-refcount.h>
1919
#include <linux/percpu-rwsem.h>
20+
#include <linux/sched.h>
2021
#include <linux/u64_stats_sync.h>
2122
#include <linux/workqueue.h>
2223
#include <linux/bpf-cgroup-defs.h>
@@ -628,6 +629,9 @@ struct cgroup {
628629
#ifdef CONFIG_BPF_SYSCALL
629630
struct bpf_local_storage __rcu *bpf_cgrp_storage;
630631
#endif
632+
#ifdef CONFIG_EXT_SUB_SCHED
633+
struct scx_sched __rcu *scx_sched;
634+
#endif
631635

632636
/* All ancestors including self */
633637
union {

0 commit comments

Comments
 (0)