@@ -93,6 +93,55 @@ scheduler has been loaded):
9393 # cat /sys/kernel/sched_ext/enable_seq
9494 1
9595
96+ Each running scheduler also exposes a per-scheduler ``events `` file under
97+ ``/sys/kernel/sched_ext/<scheduler-name>/events `` that tracks diagnostic
98+ counters. Each counter occupies one ``name value `` line:
99+
100+ .. code-block :: none
101+
102+ # cat /sys/kernel/sched_ext/simple/events
103+ SCX_EV_SELECT_CPU_FALLBACK 0
104+ SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE 0
105+ SCX_EV_DISPATCH_KEEP_LAST 123
106+ SCX_EV_ENQ_SKIP_EXITING 0
107+ SCX_EV_ENQ_SKIP_MIGRATION_DISABLED 0
108+ SCX_EV_REENQ_IMMED 0
109+ SCX_EV_REENQ_LOCAL_REPEAT 0
110+ SCX_EV_REFILL_SLICE_DFL 456789
111+ SCX_EV_BYPASS_DURATION 0
112+ SCX_EV_BYPASS_DISPATCH 0
113+ SCX_EV_BYPASS_ACTIVATE 0
114+ SCX_EV_INSERT_NOT_OWNED 0
115+ SCX_EV_SUB_BYPASS_DISPATCH 0
116+
117+ The counters are described in ``kernel/sched/ext_internal.h ``; briefly:
118+
119+ * ``SCX_EV_SELECT_CPU_FALLBACK ``: ops.select_cpu() returned a CPU unusable by
120+ the task and the core scheduler silently picked a fallback CPU.
121+ * ``SCX_EV_DISPATCH_LOCAL_DSQ_OFFLINE ``: a local-DSQ dispatch was redirected
122+ to the global DSQ because the target CPU went offline.
123+ * ``SCX_EV_DISPATCH_KEEP_LAST ``: a task continued running because no other
124+ task was available (only when ``SCX_OPS_ENQ_LAST `` is not set).
125+ * ``SCX_EV_ENQ_SKIP_EXITING ``: an exiting task was dispatched to the local DSQ
126+ directly, bypassing ops.enqueue() (only when ``SCX_OPS_ENQ_EXITING `` is not set).
127+ * ``SCX_EV_ENQ_SKIP_MIGRATION_DISABLED ``: a migration-disabled task was
128+ dispatched to its local DSQ directly (only when
129+ ``SCX_OPS_ENQ_MIGRATION_DISABLED `` is not set).
130+ * ``SCX_EV_REENQ_IMMED ``: a task dispatched with ``SCX_ENQ_IMMED `` was
131+ re-enqueued because the target CPU was not available for immediate execution.
132+ * ``SCX_EV_REENQ_LOCAL_REPEAT ``: a reenqueue of the local DSQ triggered
133+ another reenqueue; recurring counts indicate incorrect ``SCX_ENQ_REENQ ``
134+ handling in the BPF scheduler.
135+ * ``SCX_EV_REFILL_SLICE_DFL ``: a task's time slice was refilled with the
136+ default value (``SCX_SLICE_DFL ``).
137+ * ``SCX_EV_BYPASS_DURATION ``: total nanoseconds spent in bypass mode.
138+ * ``SCX_EV_BYPASS_DISPATCH ``: number of tasks dispatched while in bypass mode.
139+ * ``SCX_EV_BYPASS_ACTIVATE ``: number of times bypass mode was activated.
140+ * ``SCX_EV_INSERT_NOT_OWNED ``: attempted to insert a task not owned by this
141+ scheduler into a DSQ; such attempts are silently ignored.
142+ * ``SCX_EV_SUB_BYPASS_DISPATCH ``: tasks dispatched from sub-scheduler bypass
143+ DSQs (only relevant with ``CONFIG_EXT_SUB_SCHED ``).
144+
96145``tools/sched_ext/scx_show_state.py `` is a drgn script which shows more
97146detailed information:
98147
@@ -228,16 +277,23 @@ The following briefly shows how a waking task is scheduled and executed.
228277 scheduler can wake up any cpu using the ``scx_bpf_kick_cpu() `` helper,
229278 using ``ops.select_cpu() `` judiciously can be simpler and more efficient.
230279
231- A task can be immediately inserted into a DSQ from ``ops.select_cpu() ``
232- by calling ``scx_bpf_dsq_insert() ``. If the task is inserted into
233- ``SCX_DSQ_LOCAL `` from ``ops.select_cpu() ``, it will be inserted into the
234- local DSQ of whichever CPU is returned from ``ops.select_cpu() ``.
235- Additionally, inserting directly from ``ops.select_cpu() `` will cause the
236- ``ops.enqueue() `` callback to be skipped.
237-
238280 Note that the scheduler core will ignore an invalid CPU selection, for
239281 example, if it's outside the allowed cpumask of the task.
240282
283+ A task can be immediately inserted into a DSQ from ``ops.select_cpu() ``
284+ by calling ``scx_bpf_dsq_insert() `` or ``scx_bpf_dsq_insert_vtime() ``.
285+
286+ If the task is inserted into ``SCX_DSQ_LOCAL `` from
287+ ``ops.select_cpu() ``, it will be added to the local DSQ of whichever CPU
288+ is returned from ``ops.select_cpu() ``. Additionally, inserting directly
289+ from ``ops.select_cpu() `` will cause the ``ops.enqueue() `` callback to
290+ be skipped.
291+
292+ Any other attempt to store a task in BPF-internal data structures from
293+ ``ops.select_cpu() `` does not prevent ``ops.enqueue() `` from being
294+ invoked. This is discouraged, as it can introduce racy behavior or
295+ inconsistent state.
296+
2412972. Once the target CPU is selected, ``ops.enqueue() `` is invoked (unless the
242298 task was inserted directly from ``ops.select_cpu() ``). ``ops.enqueue() ``
243299 can make one of the following decisions:
@@ -251,6 +307,61 @@ The following briefly shows how a waking task is scheduled and executed.
251307
252308 * Queue the task on the BPF side.
253309
310+ **Task State Tracking and ops.dequeue() Semantics **
311+
312+ A task is in the "BPF scheduler's custody" when the BPF scheduler is
313+ responsible for managing its lifecycle. A task enters custody when it is
314+ dispatched to a user DSQ or stored in the BPF scheduler's internal data
315+ structures. Custody is entered only from ``ops.enqueue() `` for those
316+ operations. The only exception is dispatching to a user DSQ from
317+ ``ops.select_cpu() ``: although the task is not yet technically in BPF
318+ scheduler custody at that point, the dispatch has the same semantic
319+ effect as dispatching from ``ops.enqueue() `` for custody-related
320+ purposes.
321+
322+ Once ``ops.enqueue() `` is called, the task may or may not enter custody
323+ depending on what the scheduler does:
324+
325+ * **Directly dispatched to terminal DSQs ** (``SCX_DSQ_LOCAL ``,
326+ ``SCX_DSQ_LOCAL_ON | cpu ``, or ``SCX_DSQ_GLOBAL ``): the BPF scheduler
327+ is done with the task - it either goes straight to a CPU's local run
328+ queue or to the global DSQ as a fallback. The task never enters (or
329+ exits) BPF custody, and ``ops.dequeue() `` will not be called.
330+
331+ * **Dispatch to user-created DSQs ** (custom DSQs): the task enters the
332+ BPF scheduler's custody. When the task later leaves BPF custody
333+ (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
334+ sleep/property changes), ``ops.dequeue() `` will be called exactly
335+ once.
336+
337+ * **Stored in BPF data structures ** (e.g., internal BPF queues): the
338+ task is in BPF custody. ``ops.dequeue() `` will be called when it
339+ leaves (e.g., when ``ops.dispatch() `` moves it to a terminal DSQ, or
340+ on property change / sleep).
341+
342+ When a task leaves BPF scheduler custody, ``ops.dequeue() `` is invoked.
343+ The dequeue can happen for different reasons, distinguished by flags:
344+
345+ 1. **Regular dispatch **: when a task in BPF custody is dispatched to a
346+ terminal DSQ from ``ops.dispatch() `` (leaving BPF custody for
347+ execution), ``ops.dequeue() `` is triggered without any special flags.
348+
349+ 2. **Core scheduling pick **: when ``CONFIG_SCHED_CORE `` is enabled and
350+ core scheduling picks a task for execution while it's still in BPF
351+ custody, ``ops.dequeue() `` is called with the
352+ ``SCX_DEQ_CORE_SCHED_EXEC `` flag.
353+
354+ 3. **Scheduling property change **: when a task property changes (via
355+ operations like ``sched_setaffinity() ``, ``sched_setscheduler() ``,
356+ priority changes, CPU migrations, etc.) while the task is still in
357+ BPF custody, ``ops.dequeue() `` is called with the
358+ ``SCX_DEQ_SCHED_CHANGE `` flag set in ``deq_flags ``.
359+
360+ **Important **: Once a task has left BPF custody (e.g., after being
361+ dispatched to a terminal DSQ), property changes will not trigger
362+ ``ops.dequeue() ``, since the task is no longer managed by the BPF
363+ scheduler.
364+
2543653. When a CPU is ready to schedule, it first looks at its local DSQ. If
255366 empty, it then looks at the global DSQ. If there still isn't a task to
256367 run, ``ops.dispatch() `` is invoked which can use the following two
@@ -264,9 +375,9 @@ The following briefly shows how a waking task is scheduled and executed.
264375 rather than performing them immediately. There can be up to
265376 ``ops.dispatch_max_batch `` pending tasks.
266377
267- * ``scx_bpf_move_to_local () `` moves a task from the specified non-local
378+ * ``scx_bpf_dsq_move_to_local () `` moves a task from the specified non-local
268379 DSQ to the dispatching DSQ. This function cannot be called with any BPF
269- locks held. ``scx_bpf_move_to_local () `` flushes the pending insertions
380+ locks held. ``scx_bpf_dsq_move_to_local () `` flushes the pending insertions
270381 tasks before trying to move from the specified DSQ.
271382
2723834. After ``ops.dispatch() `` returns, if there are tasks in the local DSQ,
@@ -297,8 +408,8 @@ for more information.
297408Task Lifecycle
298409--------------
299410
300- The following pseudo-code summarizes the entire lifecycle of a task managed
301- by a sched_ext scheduler:
411+ The following pseudo-code presents a rough overview of the entire lifecycle
412+ of a task managed by a sched_ext scheduler:
302413
303414.. code-block :: c
304415
@@ -311,22 +422,37 @@ by a sched_ext scheduler:
311422
312423 ops.runnable(); /* Task becomes ready to run */
313424
314- while (task is runnable ) {
315- if (task is not in a DSQ && task->scx.slice == 0) {
425+ while (task_is_runnable( task) ) {
426+ if (task is not in a DSQ || task->scx.slice == 0) {
316427 ops.enqueue(); /* Task can be added to a DSQ */
317428
429+ /* Task property change (i.e., affinity, nice, etc.)? */
430+ if (sched_change(task)) {
431+ ops.dequeue(); /* Exiting BPF scheduler custody */
432+ ops.quiescent();
433+
434+ /* Property change callback, e.g. ops.set_weight() */
435+
436+ ops.runnable();
437+ continue;
438+ }
439+
318440 /* Any usable CPU becomes available */
319441
320- ops.dispatch(); /* Task is moved to a local DSQ */
442+ ops.dispatch(); /* Task is moved to a local DSQ */
443+ ops.dequeue(); /* Exiting BPF scheduler custody */
321444 }
445+
322446 ops.running(); /* Task starts running on its assigned CPU */
323- while (task->scx.slice > 0 && task is runnable)
447+
448+ while (task_is_runnable(task) && task->scx.slice > 0) {
324449 ops.tick(); /* Called every 1/HZ seconds */
325- ops.stopping(); /* Task stops running (time slice expires or wait) */
326450
327- /* Task's CPU becomes available */
451+ if (task->scx.slice == 0)
452+ ops.dispatch(); /* task->scx.slice can be refilled */
453+ }
328454
329- ops.dispatch (); /* task->scx. slice can be refilled */
455+ ops.stopping (); /* Task stops running (time slice expires or wait) */
330456 }
331457
332458 ops.quiescent(); /* Task releases its assigned CPU (wait) */
@@ -335,6 +461,30 @@ by a sched_ext scheduler:
335461 ops.disable(); /* Disable BPF scheduling for the task */
336462 ops.exit_task(); /* Task is destroyed */
337463
464+ Note that the above pseudo-code does not cover all possible state transitions
465+ and edge cases, to name a few examples:
466+
467+ * ``ops.dispatch() `` may fail to move the task to a local DSQ due to a racing
468+ property change on that task, in which case ``ops.dispatch() `` will be
469+ retried.
470+
471+ * The task may be direct-dispatched to a local DSQ from ``ops.enqueue() ``,
472+ in which case ``ops.dispatch() `` and ``ops.dequeue() `` are skipped and we go
473+ straight to ``ops.running() ``.
474+
475+ * Property changes may occur at virtually any point during the task's lifecycle,
476+ not just when the task is queued and waiting to be dispatched. For example,
477+ changing a property of a running task will lead to the callback sequence
478+ ``ops.stopping() `` -> ``ops.quiescent() `` -> (property change callback) ->
479+ ``ops.runnable() `` -> ``ops.running() ``.
480+
481+ * A sched_ext task can be preempted by a task from a higher-priority scheduling
482+ class, in which case it will exit the tick-dispatch loop even though it is runnable
483+ and has a non-zero slice.
484+
485+ See the "Scheduling Cycle" section for a more detailed description of how
486+ a freshly woken up task gets on a CPU.
487+
338488Where to Look
339489=============
340490
@@ -377,6 +527,25 @@ Where to Look
377527 scheduling. Tasks with CPU affinity are direct-dispatched in FIFO order;
378528 all others are scheduled in user space by a simple vruntime scheduler.
379529
530+ Module Parameters
531+ =================
532+
533+ sched_ext exposes two module parameters under the ``sched_ext. `` prefix that
534+ control bypass-mode behaviour. These knobs are primarily for debugging; there
535+ is usually no reason to change them during normal operation. They can be read
536+ and written at runtime (mode 0600) via
537+ ``/sys/module/sched_ext/parameters/ ``.
538+
539+ ``sched_ext.slice_bypass_us `` (default: 5000 µs)
540+ The time slice assigned to all tasks when the scheduler is in bypass mode,
541+ i.e. during BPF scheduler load, unload, and error recovery. Valid range is
542+ 100 µs to 100 ms.
543+
544+ ``sched_ext.bypass_lb_intv_us `` (default: 500000 µs)
545+ The interval at which the bypass-mode load balancer redistributes tasks
546+ across CPUs. Set to 0 to disable load balancing during bypass mode. Valid
547+ range is 0 to 10 s.
548+
380549ABI Instability
381550===============
382551
0 commit comments