|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +=============== |
| 4 | +Core Scheduling |
| 5 | +=============== |
| 6 | +Core scheduling support allows userspace to define groups of tasks that can |
| 7 | +share a core. These groups can be specified either for security usecases (one |
| 8 | +group of tasks don't trust another), or for performance usecases (some |
| 9 | +workloads may benefit from running on the same core as they don't need the same |
| 10 | +hardware resources of the shared core, or may prefer different cores if they |
| 11 | +do share hardware resource needs). This document only describes the security |
| 12 | +usecase. |
| 13 | + |
| 14 | +Security usecase |
| 15 | +---------------- |
| 16 | +A cross-HT attack involves the attacker and victim running on different Hyper |
| 17 | +Threads of the same core. MDS and L1TF are examples of such attacks. The only |
| 18 | +full mitigation of cross-HT attacks is to disable Hyper Threading (HT). Core |
| 19 | +scheduling is a scheduler feature that can mitigate some (not all) cross-HT |
| 20 | +attacks. It allows HT to be turned on safely by ensuring that only tasks in a |
| 21 | +user-designated trusted group can share a core. This increase in core sharing |
| 22 | +can also improve performance, however it is not guaranteed that performance |
| 23 | +will always improve, though that is seen to be the case with a number of real |
| 24 | +world workloads. In theory, core scheduling aims to perform at least as good as |
| 25 | +when Hyper Threading is disabled. In practice, this is mostly the case though |
| 26 | +not always: as synchronizing scheduling decisions across 2 or more CPUs in a |
| 27 | +core involves additional overhead - especially when the system is lightly |
| 28 | +loaded. When ``total_threads <= N_CPUS/2``, the extra overhead may cause core |
| 29 | +scheduling to perform more poorly compared to SMT-disabled, where N_CPUS is the |
| 30 | +total number of CPUs. Please measure the performance of your workloads always. |
| 31 | + |
| 32 | +Usage |
| 33 | +----- |
| 34 | +Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option. |
| 35 | +Using this feature, userspace defines groups of tasks that can be co-scheduled |
| 36 | +on the same core. The core scheduler uses this information to make sure that |
| 37 | +tasks that are not in the same group never run simultaneously on a core, while |
| 38 | +doing its best to satisfy the system's scheduling requirements. |
| 39 | + |
| 40 | +Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface. |
| 41 | +This interface provides support for the creation of core scheduling groups, as |
| 42 | +well as admission and removal of tasks from created groups:: |
| 43 | + |
| 44 | + #include <sys/prctl.h> |
| 45 | + |
| 46 | + int prctl(int option, unsigned long arg2, unsigned long arg3, |
| 47 | + unsigned long arg4, unsigned long arg5); |
| 48 | + |
| 49 | +option: |
| 50 | + ``PR_SCHED_CORE`` |
| 51 | + |
| 52 | +arg2: |
| 53 | + Command for operation, must be one off: |
| 54 | + |
| 55 | + - ``PR_SCHED_CORE_GET`` -- get core_sched cookie of ``pid``. |
| 56 | + - ``PR_SCHED_CORE_CREATE`` -- create a new unique cookie for ``pid``. |
| 57 | + - ``PR_SCHED_CORE_SHARE_TO`` -- push core_sched cookie to ``pid``. |
| 58 | + - ``PR_SCHED_CORE_SHARE_FROM`` -- pull core_sched cookie from ``pid``. |
| 59 | + |
| 60 | +arg3: |
| 61 | + ``pid`` of the task for which the operation applies. |
| 62 | + |
| 63 | +arg4: |
| 64 | + ``pid_type`` for which the operation applies. It is of type ``enum pid_type``. |
| 65 | + For example, if arg4 is ``PIDTYPE_TGID``, then the operation of this command |
| 66 | + will be performed for all tasks in the task group of ``pid``. |
| 67 | + |
| 68 | +arg5: |
| 69 | + userspace pointer to an unsigned long for storing the cookie returned by |
| 70 | + ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands. |
| 71 | + |
| 72 | +In order for a process to push a cookie to, or pull a cookie from a process, it |
| 73 | +is required to have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to the |
| 74 | +process. |
| 75 | + |
| 76 | +Building hierarchies of tasks |
| 77 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 78 | +The simplest way to build hierarchies of threads/processes which share a |
| 79 | +cookie and thus a core is to rely on the fact that the core-sched cookie is |
| 80 | +inherited across forks/clones and execs, thus setting a cookie for the |
| 81 | +'initial' script/executable/daemon will place every spawned child in the |
| 82 | +same core-sched group. |
| 83 | + |
| 84 | +Cookie Transferral |
| 85 | +~~~~~~~~~~~~~~~~~~ |
| 86 | +Transferring a cookie between the current and other tasks is possible using |
| 87 | +PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a |
| 88 | +specified task or a share a cookie with a task. In combination this allows a |
| 89 | +simple helper program to pull a cookie from a task in an existing core |
| 90 | +scheduling group and share it with already running tasks. |
| 91 | + |
| 92 | +Design/Implementation |
| 93 | +--------------------- |
| 94 | +Each task that is tagged is assigned a cookie internally in the kernel. As |
| 95 | +mentioned in `Usage`_, tasks with the same cookie value are assumed to trust |
| 96 | +each other and share a core. |
| 97 | + |
| 98 | +The basic idea is that, every schedule event tries to select tasks for all the |
| 99 | +siblings of a core such that all the selected tasks running on a core are |
| 100 | +trusted (same cookie) at any point in time. Kernel threads are assumed trusted. |
| 101 | +The idle task is considered special, as it trusts everything and everything |
| 102 | +trusts it. |
| 103 | + |
| 104 | +During a schedule() event on any sibling of a core, the highest priority task on |
| 105 | +the sibling's core is picked and assigned to the sibling calling schedule(), if |
| 106 | +the sibling has the task enqueued. For rest of the siblings in the core, |
| 107 | +highest priority task with the same cookie is selected if there is one runnable |
| 108 | +in their individual run queues. If a task with same cookie is not available, |
| 109 | +the idle task is selected. Idle task is globally trusted. |
| 110 | + |
| 111 | +Once a task has been selected for all the siblings in the core, an IPI is sent to |
| 112 | +siblings for whom a new task was selected. Siblings on receiving the IPI will |
| 113 | +switch to the new task immediately. If an idle task is selected for a sibling, |
| 114 | +then the sibling is considered to be in a `forced idle` state. I.e., it may |
| 115 | +have tasks on its on runqueue to run, however it will still have to run idle. |
| 116 | +More on this in the next section. |
| 117 | + |
| 118 | +Forced-idling of hyperthreads |
| 119 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 120 | +The scheduler tries its best to find tasks that trust each other such that all |
| 121 | +tasks selected to be scheduled are of the highest priority in a core. However, |
| 122 | +it is possible that some runqueues had tasks that were incompatible with the |
| 123 | +highest priority ones in the core. Favoring security over fairness, one or more |
| 124 | +siblings could be forced to select a lower priority task if the highest |
| 125 | +priority task is not trusted with respect to the core wide highest priority |
| 126 | +task. If a sibling does not have a trusted task to run, it will be forced idle |
| 127 | +by the scheduler (idle thread is scheduled to run). |
| 128 | + |
| 129 | +When the highest priority task is selected to run, a reschedule-IPI is sent to |
| 130 | +the sibling to force it into idle. This results in 4 cases which need to be |
| 131 | +considered depending on whether a VM or a regular usermode process was running |
| 132 | +on either HT:: |
| 133 | + |
| 134 | + HT1 (attack) HT2 (victim) |
| 135 | + A idle -> user space user space -> idle |
| 136 | + B idle -> user space guest -> idle |
| 137 | + C idle -> guest user space -> idle |
| 138 | + D idle -> guest guest -> idle |
| 139 | + |
| 140 | +Note that for better performance, we do not wait for the destination CPU |
| 141 | +(victim) to enter idle mode. This is because the sending of the IPI would bring |
| 142 | +the destination CPU immediately into kernel mode from user space, or VMEXIT |
| 143 | +in the case of guests. At best, this would only leak some scheduler metadata |
| 144 | +which may not be worth protecting. It is also possible that the IPI is received |
| 145 | +too late on some architectures, but this has not been observed in the case of |
| 146 | +x86. |
| 147 | + |
| 148 | +Trust model |
| 149 | +~~~~~~~~~~~ |
| 150 | +Core scheduling maintains trust relationships amongst groups of tasks by |
| 151 | +assigning them a tag that is the same cookie value. |
| 152 | +When a system with core scheduling boots, all tasks are considered to trust |
| 153 | +each other. This is because the core scheduler does not have information about |
| 154 | +trust relationships until userspace uses the above mentioned interfaces, to |
| 155 | +communicate them. In other words, all tasks have a default cookie value of 0. |
| 156 | +and are considered system-wide trusted. The forced-idling of siblings running |
| 157 | +cookie-0 tasks is also avoided. |
| 158 | + |
| 159 | +Once userspace uses the above mentioned interfaces to group sets of tasks, tasks |
| 160 | +within such groups are considered to trust each other, but do not trust those |
| 161 | +outside. Tasks outside the group also don't trust tasks within. |
| 162 | + |
| 163 | +Limitations of core-scheduling |
| 164 | +------------------------------ |
| 165 | +Core scheduling tries to guarantee that only trusted tasks run concurrently on a |
| 166 | +core. But there could be small window of time during which untrusted tasks run |
| 167 | +concurrently or kernel could be running concurrently with a task not trusted by |
| 168 | +kernel. |
| 169 | + |
| 170 | +IPI processing delays |
| 171 | +~~~~~~~~~~~~~~~~~~~~~ |
| 172 | +Core scheduling selects only trusted tasks to run together. IPI is used to notify |
| 173 | +the siblings to switch to the new task. But there could be hardware delays in |
| 174 | +receiving of the IPI on some arch (on x86, this has not been observed). This may |
| 175 | +cause an attacker task to start running on a CPU before its siblings receive the |
| 176 | +IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings |
| 177 | +may populate data in the cache and micro architectural buffers after the attacker |
| 178 | +starts to run and this is a possibility for data leak. |
| 179 | + |
| 180 | +Open cross-HT issues that core scheduling does not solve |
| 181 | +-------------------------------------------------------- |
| 182 | +1. For MDS |
| 183 | +~~~~~~~~~~ |
| 184 | +Core scheduling cannot protect against MDS attacks between an HT running in |
| 185 | +user mode and another running in kernel mode. Even though both HTs run tasks |
| 186 | +which trust each other, kernel memory is still considered untrusted. Such |
| 187 | +attacks are possible for any combination of sibling CPU modes (host or guest mode). |
| 188 | + |
| 189 | +2. For L1TF |
| 190 | +~~~~~~~~~~~ |
| 191 | +Core scheduling cannot protect against an L1TF guest attacker exploiting a |
| 192 | +guest or host victim. This is because the guest attacker can craft invalid |
| 193 | +PTEs which are not inverted due to a vulnerable guest kernel. The only |
| 194 | +solution is to disable EPT (Extended Page Tables). |
| 195 | + |
| 196 | +For both MDS and L1TF, if the guest vCPU is configured to not trust each |
| 197 | +other (by tagging separately), then the guest to guest attacks would go away. |
| 198 | +Or it could be a system admin policy which considers guest to guest attacks as |
| 199 | +a guest problem. |
| 200 | + |
| 201 | +Another approach to resolve these would be to make every untrusted task on the |
| 202 | +system to not trust every other untrusted task. While this could reduce |
| 203 | +parallelism of the untrusted tasks, it would still solve the above issues while |
| 204 | +allowing system processes (trusted tasks) to share a core. |
| 205 | + |
| 206 | +3. Protecting the kernel (IRQ, syscall, VMEXIT) |
| 207 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 208 | +Unfortunately, core scheduling does not protect kernel contexts running on |
| 209 | +sibling hyperthreads from one another. Prototypes of mitigations have been posted |
| 210 | +to LKML to solve this, but it is debatable whether such windows are practically |
| 211 | +exploitable, and whether the performance overhead of the prototypes are worth |
| 212 | +it (not to mention, the added code complexity). |
| 213 | + |
| 214 | +Other Use cases |
| 215 | +--------------- |
| 216 | +The main use case for Core scheduling is mitigating the cross-HT vulnerabilities |
| 217 | +with SMT enabled. There are other use cases where this feature could be used: |
| 218 | + |
| 219 | +- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks |
| 220 | + that uses SIMD instructions etc. |
| 221 | +- Gang scheduling: Requirements for a group of tasks that needs to be scheduled |
| 222 | + together could also be realized using core scheduling. One example is vCPUs of |
| 223 | + a VM. |
0 commit comments