Skip to content

Commit 36ae1c4

Browse files
committed
Merge tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: "Scheduler Kconfig space updates: - Further consolidate configurable preemption modes (Peter Zijlstra) Reduce the number of architectures that are allowed to offer PREEMPT_NONE and PREEMPT_VOLUNTARY, reducing the number of preemption models from four to just two: 'full' and 'lazy' on up-to-date architectures (arm64, loongarch, powerpc, riscv, s390, x86). None and voluntary are only available as legacy features on platforms that don't implement lazy preemption yet, or which don't even support preemption. The goal is to eventually remove cond_resched() and voluntary preemption altogether. RSEQ based 'scheduler time slice extension' support (Thomas Gleixner and Peter Zijlstra): This allows a thread to request a time slice extension when it enters a critical section to avoid contention on a resource when the thread is scheduled out inside of the critical section. - Add fields and constants for time slice extension - Provide static branch for time slice extensions - Add statistics for time slice extensions - Add prctl() to enable time slice extensions - Implement sys_rseq_slice_yield() - Implement syscall entry work for time slice extensions - Implement time slice extension enforcement timer - Reset slice extension when scheduled - Implement rseq_grant_slice_extension() - entry: Hook up rseq time slice extension - selftests: Implement time slice extension test - Allow registering RSEQ with slice extension - Move slice_ext_nsec to debugfs - Lower default slice extension - selftests/rseq: Add rseq slice histogram script Scheduler performance/scalability improvements: - Update rq->avg_idle when a task is moved to an idle CPU, which improves the scalability of various workloads (Shubhang Kaushik) - Reorder fields in 'struct rq' for better caching (Blake Jones) - Fair scheduler SMP NOHZ balancing code speedups (Shrikanth Hegde): - Move checking for nohz cpus after time check - Change likelyhood of nohz.nr_cpus - Remove nohz.nr_cpus and use weight of cpumask instead - Avoid false sharing for sched_clock_irqtime (Wangyang Guo) - Cleanups (Yury Norov): - Drop useless cpumask_empty() in find_energy_efficient_cpu() - Simplify task_numa_find_cpu() - Use cpumask_weight_and() in sched_balance_find_dst_group() DL scheduler updates: - Add a deadline server for sched_ext tasks (by Andrea Righi and Joel Fernandes, with fixes by Peter Zijlstra) RT scheduler updates: - Skip currently executing CPU in rto_next_cpu() (Chen Jinghuang) Entry code updates and performance improvements (Jinjie Ruan) This is part of the scheduler tree in this cycle due to inter- dependencies with the RSEQ based time slice extension work: - Remove unused syscall argument from syscall_trace_enter() - Rework syscall_exit_to_user_mode_work() for architecture reuse - Add arch_ptrace_report_syscall_entry/exit() - Inline syscall_exit_work() and syscall_trace_enter() Scheduler core updates (Peter Zijlstra): - Rework sched_class::wakeup_preempt() and rq_modified_*() - Avoid rq->lock bouncing in sched_balance_newidle() - Rename rcu_dereference_check_sched_domain() => rcu_dereference_sched_domain() - <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper Fair scheduler updates/refactoring (Peter Zijlstra and Ingo Molnar): - Fold the sched_avg update - Change rcu_dereference_check_sched_domain() to rcu-sched - Switch to rcu_dereference_all() - Remove superfluous rcu_read_lock() - Limit hrtick work - Join two #ifdef CONFIG_FAIR_GROUP_SCHED blocks - Clean up comments in 'struct cfs_rq' - Separate se->vlag from se->vprot - Rename cfs_rq::avg_load to cfs_rq::sum_weight - Rename cfs_rq::avg_vruntime to ::sum_w_vruntime & helper functions - Introduce and use the vruntime_cmp() and vruntime_op() wrappers for wrapped-signed aritmetics - Sort out 'blocked_load*' namespace noise Scheduler debugging code updates: - Export hidden tracepoints to modules (Gabriele Monaco) - Convert copy_from_user() + kstrtouint() to kstrtouint_from_user() (Fushuai Wang) - Add assertions to QUEUE_CLASS (Peter Zijlstra) - hrtimer: Fix tracing oddity (Thomas Gleixner) Misc fixes and cleanups: - Re-evaluate scheduling when migrating queued tasks out of throttled cgroups (Zicheng Qu) - Remove task_struct->faults_disabled_mapping (Christoph Hellwig) - Fix math notation errors in avg_vruntime comment (Zhan Xusheng) - sched/cpufreq: Use %pe format for PTR_ERR() printing (zenghongling)" * tag 'sched-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups sched/cpufreq: Use %pe format for PTR_ERR() printing sched/rt: Skip currently executing CPU in rto_next_cpu() sched/clock: Avoid false sharing for sched_clock_irqtime selftests/sched_ext: Add test for DL server total_bw consistency selftests/sched_ext: Add test for sched_ext dl_server sched/debug: Fix dl_server (re)start conditions sched/debug: Add support to change sched_ext server params sched_ext: Add a DL server for sched_ext tasks sched/debug: Stop and start server based on if it was active sched/debug: Fix updating of ppos on server write ops sched/deadline: Clear the defer params entry: Inline syscall_exit_work() and syscall_trace_enter() entry: Add arch_ptrace_report_syscall_entry/exit() entry: Rework syscall_exit_to_user_mode_work() for architecture reuse entry: Remove unused syscall argument from syscall_trace_enter() sched: remove task_struct->faults_disabled_mapping sched: Update rq->avg_idle when a task is moved to an idle CPU selftests/rseq: Add rseq slice histogram script hrtimer: Fix trace oddity ...
2 parents 0923fd0 + e34881c commit 36ae1c4

65 files changed

Lines changed: 2598 additions & 546 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6627,6 +6627,11 @@ Kernel parameters
66276627

66286628
rootflags= [KNL] Set root filesystem mount option string
66296629

6630+
rseq_slice_ext= [KNL] RSEQ based time slice extension
6631+
Format: boolean
6632+
Control enablement of RSEQ based time slice extension.
6633+
Default is 'on'.
6634+
66306635
initramfs_options= [KNL]
66316636
Specify mount options for for the initramfs mount.
66326637

Documentation/userspace-api/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ System calls
2121
ebpf/index
2222
ioctl/index
2323
mseal
24+
rseq
2425

2526
Security-related interfaces
2627
===========================
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
=====================
2+
Restartable Sequences
3+
=====================
4+
5+
Restartable Sequences allow to register a per thread userspace memory area
6+
to be used as an ABI between kernel and userspace for three purposes:
7+
8+
* userspace restartable sequences
9+
10+
* quick access to read the current CPU number, node ID from userspace
11+
12+
* scheduler time slice extensions
13+
14+
Restartable sequences (per-cpu atomics)
15+
---------------------------------------
16+
17+
Restartable sequences allow userspace to perform update operations on
18+
per-cpu data without requiring heavyweight atomic operations. The actual
19+
ABI is unfortunately only available in the code and selftests.
20+
21+
Quick access to CPU number, node ID
22+
-----------------------------------
23+
24+
Allows to implement per CPU data efficiently. Documentation is in code and
25+
selftests. :(
26+
27+
Scheduler time slice extensions
28+
-------------------------------
29+
30+
This allows a thread to request a time slice extension when it enters a
31+
critical section to avoid contention on a resource when the thread is
32+
scheduled out inside of the critical section.
33+
34+
The prerequisites for this functionality are:
35+
36+
* Enabled in Kconfig
37+
38+
* Enabled at boot time (default is enabled)
39+
40+
* A rseq userspace pointer has been registered for the thread
41+
42+
The thread has to enable the functionality via prctl(2)::
43+
44+
prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
45+
PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
46+
47+
prctl() returns 0 on success or otherwise with the following error codes:
48+
49+
========= ==============================================================
50+
Errorcode Meaning
51+
========= ==============================================================
52+
EINVAL Functionality not available or invalid function arguments.
53+
Note: arg4 and arg5 must be zero
54+
ENOTSUPP Functionality was disabled on the kernel command line
55+
ENXIO Available, but no rseq user struct registered
56+
========= ==============================================================
57+
58+
The state can be also queried via prctl(2)::
59+
60+
prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
61+
62+
prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
63+
disabled. Otherwise it returns with the following error codes:
64+
65+
========= ==============================================================
66+
Errorcode Meaning
67+
========= ==============================================================
68+
EINVAL Functionality not available or invalid function arguments.
69+
Note: arg3 and arg4 and arg5 must be zero
70+
========= ==============================================================
71+
72+
The availability and status is also exposed via the rseq ABI struct flags
73+
field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
74+
``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
75+
space and only for informational purposes.
76+
77+
If the mechanism was enabled via prctl(), the thread can request a time
78+
slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
79+
interrupted and the interrupt results in a reschedule request in the
80+
kernel, then the kernel can grant a time slice extension and return to
81+
userspace instead of scheduling out. The length of the extension is
82+
determined by debugfs:rseq/slice_ext_nsec. The default value is 5 usec; which
83+
is the minimum value. It can be incremented to 50 usecs, however doing so
84+
can/will affect the minimum scheduling latency.
85+
86+
Any proposed changes to this default will have to come with a selftest and
87+
rseq-slice-hist.py output that shows the new value has merrit.
88+
89+
The kernel indicates the grant by clearing rseq::slice_ctrl::request and
90+
setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
91+
thread after granting the extension, the kernel clears the granted bit to
92+
indicate that to userspace.
93+
94+
If the request bit is still set when the leaving the critical section,
95+
userspace can clear it and continue.
96+
97+
If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
98+
leaving the critical section to relinquish the CPU. The kernel enforces
99+
this by arming a timer to prevent misbehaving userspace from abusing this
100+
mechanism.
101+
102+
If both the request bit and the granted bit are false when leaving the
103+
critical section, then this indicates that a grant was revoked and no
104+
further action is required by userspace.
105+
106+
The required code flow is as follows::
107+
108+
rseq->slice_ctrl.request = 1;
109+
barrier(); // Prevent compiler reordering
110+
critical_section();
111+
barrier(); // Prevent compiler reordering
112+
rseq->slice_ctrl.request = 0;
113+
if (rseq->slice_ctrl.granted)
114+
rseq_slice_yield();
115+
116+
As all of this is strictly CPU local, there are no atomicity requirements.
117+
Checking the granted state is racy, but that cannot be avoided at all::
118+
119+
if (rseq->slice_ctrl.granted)
120+
-> Interrupt results in schedule and grant revocation
121+
rseq_slice_yield();
122+
123+
So there is no point in pretending that this might be solved by an atomic
124+
operation.
125+
126+
If the thread issues a syscall other than rseq_slice_yield(2) within the
127+
granted timeslice extension, the grant is also revoked and the CPU is
128+
relinquished immediately when entering the kernel. This is required as
129+
syscalls might consume arbitrary CPU time until they reach a scheduling
130+
point when the preemption model is either NONE or VOLUNTARY and therefore
131+
might exceed the grant by far.
132+
133+
The preferred solution for user space is to use rseq_slice_yield(2) which
134+
is side effect free. The support for arbitrary syscalls is required to
135+
support onion layer architectured applications, where the code handling the
136+
critical section and requesting the time slice extension has no control
137+
over the code within the critical section.
138+
139+
The kernel enforces flag consistency and terminates the thread with SIGSEGV
140+
if it detects a violation.

arch/alpha/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -510,3 +510,4 @@
510510
578 common file_getattr sys_file_getattr
511511
579 common file_setattr sys_file_setattr
512512
580 common listns sys_listns
513+
581 common rseq_slice_yield sys_rseq_slice_yield

arch/arm/tools/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -485,3 +485,4 @@
485485
468 common file_getattr sys_file_getattr
486486
469 common file_setattr sys_file_setattr
487487
470 common listns sys_listns
488+
471 common rseq_slice_yield sys_rseq_slice_yield

arch/arm64/tools/syscall_32.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -482,3 +482,4 @@
482482
468 common file_getattr sys_file_getattr
483483
469 common file_setattr sys_file_setattr
484484
470 common listns sys_listns
485+
471 common rseq_slice_yield sys_rseq_slice_yield

arch/m68k/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -470,3 +470,4 @@
470470
468 common file_getattr sys_file_getattr
471471
469 common file_setattr sys_file_setattr
472472
470 common listns sys_listns
473+
471 common rseq_slice_yield sys_rseq_slice_yield

arch/microblaze/kernel/syscalls/syscall.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -476,3 +476,4 @@
476476
468 common file_getattr sys_file_getattr
477477
469 common file_setattr sys_file_setattr
478478
470 common listns sys_listns
479+
471 common rseq_slice_yield sys_rseq_slice_yield

arch/mips/kernel/syscalls/syscall_n32.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -409,3 +409,4 @@
409409
468 n32 file_getattr sys_file_getattr
410410
469 n32 file_setattr sys_file_setattr
411411
470 n32 listns sys_listns
412+
471 n32 rseq_slice_yield sys_rseq_slice_yield

arch/mips/kernel/syscalls/syscall_n64.tbl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -385,3 +385,4 @@
385385
468 n64 file_getattr sys_file_getattr
386386
469 n64 file_setattr sys_file_setattr
387387
470 n64 listns sys_listns
388+
471 n64 rseq_slice_yield sys_rseq_slice_yield

0 commit comments

Comments
 (0)