Skip to content

Commit 01b0831

Browse files
htejungregkh
authored andcommitted
sched_ext: Fix SCX_KICK_WAIT to work reliably
commit a379fa1 upstream. SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete a reschedule and can be used to implement operations like core scheduling. This used to be implemented by scx_next_task_picked() incrementing pnt_seq, which was always called when a CPU picks the next task to run, allowing SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and pick the next task. However, commit b999e36 ("sched_ext: Replace scx_next_task_picked() with switch_class()") replaced scx_next_task_picked() with the switch_class() callback, which is only called when switching between sched classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be reliably incremented unless the previous task was SCX and the next task was not. This fix leverages commit 4c95380 ("sched/ext: Fold balance_scx() into pick_task_scx()") which refactored the pick path making put_prev_task_scx() the natural place to track task switches for SCX_KICK_WAIT. The fix moves pnt_seq increment to put_prev_task_scx() and also increments it in pick_task_scx() to handle cases where the same task is re-selected, whether by BPF scheduler decision or slice refill. The semantics: If the current task on the target CPU is SCX, SCX_KICK_WAIT waits until the CPU enters the scheduling path. This provides sufficient guarantee for use cases like core scheduling while keeping the operation self-contained within SCX. v2: - Also increment pnt_seq in pick_task_scx() to handle same-task re-selection (Andrea Righi). - Use smp_cond_load_acquire() for the busy-wait loop for better architecture optimization (Peter Zijlstra). Reported-by: Wen-Fang Liu <liuwenfang@honor.com> Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1 parent 664e78f commit 01b0831

2 files changed

Lines changed: 30 additions & 22 deletions

File tree

kernel/sched/ext.c

Lines changed: 26 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2306,12 +2306,6 @@ static void switch_class(struct rq *rq, struct task_struct *next)
23062306
struct scx_sched *sch = scx_root;
23072307
const struct sched_class *next_class = next->sched_class;
23082308

2309-
/*
2310-
* Pairs with the smp_load_acquire() issued by a CPU in
2311-
* kick_cpus_irq_workfn() who is waiting for this CPU to perform a
2312-
* resched.
2313-
*/
2314-
smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
23152309
if (!(sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT))
23162310
return;
23172311

@@ -2351,6 +2345,10 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
23512345
struct task_struct *next)
23522346
{
23532347
struct scx_sched *sch = scx_root;
2348+
2349+
/* see kick_cpus_irq_workfn() */
2350+
smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
2351+
23542352
update_curr_scx(rq);
23552353

23562354
/* see dequeue_task_scx() on why we skip when !QUEUED */
@@ -2404,6 +2402,9 @@ static struct task_struct *pick_task_scx(struct rq *rq)
24042402
bool keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP;
24052403
bool kick_idle = false;
24062404

2405+
/* see kick_cpus_irq_workfn() */
2406+
smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
2407+
24072408
/*
24082409
* WORKAROUND:
24092410
*
@@ -5186,8 +5187,12 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *pseqs)
51865187
}
51875188

51885189
if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) {
5189-
pseqs[cpu] = rq->scx.pnt_seq;
5190-
should_wait = true;
5190+
if (cur_class == &ext_sched_class) {
5191+
pseqs[cpu] = rq->scx.pnt_seq;
5192+
should_wait = true;
5193+
} else {
5194+
cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
5195+
}
51915196
}
51925197

51935198
resched_curr(rq);
@@ -5248,18 +5253,19 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
52485253
for_each_cpu(cpu, this_scx->cpus_to_wait) {
52495254
unsigned long *wait_pnt_seq = &cpu_rq(cpu)->scx.pnt_seq;
52505255

5251-
if (cpu != cpu_of(this_rq)) {
5252-
/*
5253-
* Pairs with smp_store_release() issued by this CPU in
5254-
* switch_class() on the resched path.
5255-
*
5256-
* We busy-wait here to guarantee that no other task can
5257-
* be scheduled on our core before the target CPU has
5258-
* entered the resched path.
5259-
*/
5260-
while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
5261-
cpu_relax();
5262-
}
5256+
/*
5257+
* Busy-wait until the task running at the time of kicking is no
5258+
* longer running. This can be used to implement e.g. core
5259+
* scheduling.
5260+
*
5261+
* smp_cond_load_acquire() pairs with store_releases in
5262+
* pick_task_scx() and put_prev_task_scx(). The former breaks
5263+
* the wait if SCX's scheduling path is entered even if the same
5264+
* task is picked subsequently. The latter is necessary to break
5265+
* the wait when $cpu is taken by a higher sched class.
5266+
*/
5267+
if (cpu != cpu_of(this_rq))
5268+
smp_cond_load_acquire(wait_pnt_seq, VAL != pseqs[cpu]);
52635269

52645270
cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
52655271
}

kernel/sched/ext_internal.h

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -986,8 +986,10 @@ enum scx_kick_flags {
986986
SCX_KICK_PREEMPT = 1LLU << 1,
987987

988988
/*
989-
* Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will
990-
* return after the target CPU finishes picking the next task.
989+
* The scx_bpf_kick_cpu() call will return after the current SCX task of
990+
* the target CPU switches out. This can be used to implement e.g. core
991+
* scheduling. This has no effect if the current task on the target CPU
992+
* is not on SCX.
991993
*/
992994
SCX_KICK_WAIT = 1LLU << 2,
993995
};

0 commit comments

Comments
 (0)