Skip to content

Commit 4327fb1

Browse files
Thomas GleixnerPeter Zijlstra
authored andcommitted
sched/mmcid: Prevent live lock on task to CPU mode transition
Ihor reported a BPF CI failure which turned out to be a live lock in the MM_CID management. The scenario is: A test program creates the 5th thread, which means the MM_CID users become more than the number of CPUs (four in this example), so it switches to per CPU ownership mode. At this point each live task of the program has a CID associated. Assume thread creation order assignment for simplicity. T0 CID0 runs fork() and creates T4 T1 CID1 T2 CID2 T3 CID3 T4 --- not visible yet T0 sets mm_cid::percpu = true and transfers its own CID to CPU0 where it runs on and then starts the fixup which walks through the threads to transfer the per task CIDs either to the CPU the task is running on or drop it back into the pool if the task is not on a CPU. During that T1 - T3 are free to schedule in and out before the fixup caught up with them. Going through all possible permutations with a python script revealed a few problematic cases. The most trivial one is: T1 schedules in on CPU1 and observes percpu == true, so it transfers its CID to CPU1 T1 is migrated to CPU2 and schedule in observes percpu == true, but CPU2 does not have a CID associated and T1 transferred its own to CPU1 So it has to allocate one with CPU2 runqueue lock held, but the pool is empty, so it keeps looping in mm_get_cid(). Now T0 reaches T1 in the thread walk and tries to lock the corresponding runqueue lock, which is held causing a full live lock. There is a similar scenario in the reverse direction of switching from per CPU to task mode which is way more obvious and got therefore addressed by an intermediate mode. In this mode the CIDs are marked with MM_CID_TRANSIT, which means that they are neither owned by the CPU nor by the task. When a task schedules out with a transit CID it drops the CID back into the pool making it available for others to use temporarily. Once the task which initiated the mode switch finished the fixup it clears the transit mode and the process goes back into per task ownership mode. Unfortunately this insight was not mapped back to the task to CPU mode switch as the above described scenario was not considered in the analysis. Apply the same transit mechanism to the task to CPU mode switch to handle these problematic cases correctly. As with the CPU to task transition this results in a potential temporary contention on the CID bitmap, but that's only for the time it takes to complete the transition. After that it stays in steady mode which does not touch the bitmap at all. Fixes: fbd0e71 ("sched/mmcid: Provide CID ownership mode fixup functions") Closes: https://lore.kernel.org/2b7463d7-0f58-4e34-9775-6e2115cfb971@linux.dev Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260201192834.897115238@kernel.org
1 parent 18f7fcd commit 4327fb1

2 files changed

Lines changed: 88 additions & 44 deletions

File tree

kernel/sched/core.c

Lines changed: 84 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -10269,7 +10269,8 @@ void call_trace_sched_update_nr_running(struct rq *rq, int count)
1026910269
* Serialization rules:
1027010270
*
1027110271
* mm::mm_cid::mutex: Serializes fork() and exit() and therefore
10272-
* protects mm::mm_cid::users.
10272+
* protects mm::mm_cid::users and mode switch
10273+
* transitions
1027310274
*
1027410275
* mm::mm_cid::lock: Serializes mm_update_max_cids() and
1027510276
* mm_update_cpus_allowed(). Nests in mm_cid::mutex
@@ -10285,14 +10286,61 @@ void call_trace_sched_update_nr_running(struct rq *rq, int count)
1028510286
*
1028610287
* A CID is either owned by a task (stored in task_struct::mm_cid.cid) or
1028710288
* by a CPU (stored in mm::mm_cid.pcpu::cid). CIDs owned by CPUs have the
10288-
* MM_CID_ONCPU bit set. During transition from CPU to task ownership mode,
10289-
* MM_CID_TRANSIT is set on the per task CIDs. When this bit is set the
10290-
* task needs to drop the CID into the pool when scheduling out. Both bits
10291-
* (ONCPU and TRANSIT) are filtered out by task_cid() when the CID is
10292-
* actually handed over to user space in the RSEQ memory.
10289+
* MM_CID_ONCPU bit set.
10290+
*
10291+
* During the transition of ownership mode, the MM_CID_TRANSIT bit is set
10292+
* on the CIDs. When this bit is set the tasks drop the CID back into the
10293+
* pool when scheduling out.
10294+
*
10295+
* Both bits (ONCPU and TRANSIT) are filtered out by task_cid() when the
10296+
* CID is actually handed over to user space in the RSEQ memory.
1029310297
*
1029410298
* Mode switching:
1029510299
*
10300+
* All transitions of ownership mode happen in two phases:
10301+
*
10302+
* 1) mm:mm_cid.transit contains MM_CID_TRANSIT. This is OR'ed on the CIDs
10303+
* and denotes that the CID is only temporarily owned by a task. When
10304+
* the task schedules out it drops the CID back into the pool if this
10305+
* bit is set.
10306+
*
10307+
* 2) The initiating context walks the per CPU space or the tasks to fixup
10308+
* or drop the CIDs and after completion it clears mm:mm_cid.transit.
10309+
* After that point the CIDs are strictly task or CPU owned again.
10310+
*
10311+
* This two phase transition is required to prevent CID space exhaustion
10312+
* during the transition as a direct transfer of ownership would fail:
10313+
*
10314+
* - On task to CPU mode switch if a task is scheduled in on one CPU and
10315+
* then migrated to another CPU before the fixup freed enough per task
10316+
* CIDs.
10317+
*
10318+
* - On CPU to task mode switch if two tasks are scheduled in on the same
10319+
* CPU before the fixup freed per CPU CIDs.
10320+
*
10321+
* Both scenarios can result in a live lock because sched_in() is invoked
10322+
* with runqueue lock held and loops in search of a CID and the fixup
10323+
* thread can't make progress freeing them up because it is stuck on the
10324+
* same runqueue lock.
10325+
*
10326+
* While MM_CID_TRANSIT is active during the transition phase the MM_CID
10327+
* bitmap can be contended, but that's a temporary contention bound to the
10328+
* transition period. After that everything goes back into steady state and
10329+
* nothing except fork() and exit() will touch the bitmap. This is an
10330+
* acceptable tradeoff as it completely avoids complex serialization,
10331+
* memory barriers and atomic operations for the common case.
10332+
*
10333+
* Aside of that this mechanism also ensures RT compability:
10334+
*
10335+
* - The task which runs the fixup is fully preemptible except for the
10336+
* short runqueue lock held sections.
10337+
*
10338+
* - The transient impact of the bitmap contention is only problematic
10339+
* when there is a thundering herd scenario of tasks scheduling in and
10340+
* out concurrently. There is not much which can be done about that
10341+
* except for avoiding mode switching by a proper overall system
10342+
* configuration.
10343+
*
1029610344
* Switching to per CPU mode happens when the user count becomes greater
1029710345
* than the maximum number of CIDs, which is calculated by:
1029810346
*
@@ -10306,12 +10354,13 @@ void call_trace_sched_update_nr_running(struct rq *rq, int count)
1030610354
*
1030710355
* At the point of switching to per CPU mode the new user is not yet
1030810356
* visible in the system, so the task which initiated the fork() runs the
10309-
* fixup function: mm_cid_fixup_tasks_to_cpu() walks the thread list and
10310-
* either transfers each tasks owned CID to the CPU the task runs on or
10311-
* drops it into the CID pool if a task is not on a CPU at that point in
10312-
* time. Tasks which schedule in before the task walk reaches them do the
10313-
* handover in mm_cid_schedin(). When mm_cid_fixup_tasks_to_cpus() completes
10314-
* it's guaranteed that no task related to that MM owns a CID anymore.
10357+
* fixup function. mm_cid_fixup_tasks_to_cpu() walks the thread list and
10358+
* either marks each task owned CID with MM_CID_TRANSIT if the task is
10359+
* running on a CPU or drops it into the CID pool if a task is not on a
10360+
* CPU. Tasks which schedule in before the task walk reaches them do the
10361+
* handover in mm_cid_schedin(). When mm_cid_fixup_tasks_to_cpus()
10362+
* completes it is guaranteed that no task related to that MM owns a CID
10363+
* anymore.
1031510364
*
1031610365
* Switching back to task mode happens when the user count goes below the
1031710366
* threshold which was recorded on the per CPU mode switch:
@@ -10327,28 +10376,11 @@ void call_trace_sched_update_nr_running(struct rq *rq, int count)
1032710376
* run either in the deferred update function in context of a workqueue or
1032810377
* by a task which forks a new one or by a task which exits. Whatever
1032910378
* happens first. mm_cid_fixup_cpus_to_task() walks through the possible
10330-
* CPUs and either transfers the CPU owned CIDs to a related task which
10331-
* runs on the CPU or drops it into the pool. Tasks which schedule in on a
10332-
* CPU which the walk did not cover yet do the handover themself.
10333-
*
10334-
* This transition from CPU to per task ownership happens in two phases:
10335-
*
10336-
* 1) mm:mm_cid.transit contains MM_CID_TRANSIT This is OR'ed on the task
10337-
* CID and denotes that the CID is only temporarily owned by the
10338-
* task. When it schedules out the task drops the CID back into the
10339-
* pool if this bit is set.
10340-
*
10341-
* 2) The initiating context walks the per CPU space and after completion
10342-
* clears mm:mm_cid.transit. So after that point the CIDs are strictly
10343-
* task owned again.
10344-
*
10345-
* This two phase transition is required to prevent CID space exhaustion
10346-
* during the transition as a direct transfer of ownership would fail if
10347-
* two tasks are scheduled in on the same CPU before the fixup freed per
10348-
* CPU CIDs.
10349-
*
10350-
* When mm_cid_fixup_cpus_to_tasks() completes it's guaranteed that no CID
10351-
* related to that MM is owned by a CPU anymore.
10379+
* CPUs and either marks the CPU owned CIDs with MM_CID_TRANSIT if a
10380+
* related task is running on the CPU or drops it into the pool. Tasks
10381+
* which are scheduled in before the fixup covered them do the handover
10382+
* themself. When mm_cid_fixup_cpus_to_tasks() completes it is guaranteed
10383+
* that no CID related to that MM is owned by a CPU anymore.
1035210384
*/
1035310385

1035410386
/*
@@ -10400,9 +10432,9 @@ static bool mm_update_max_cids(struct mm_struct *mm)
1040010432
/* Mode change required? */
1040110433
if (!!mc->percpu == !!mc->pcpu_thrs)
1040210434
return false;
10403-
/* When switching back to per TASK mode, set the transition flag */
10404-
if (!mc->pcpu_thrs)
10405-
WRITE_ONCE(mc->transit, MM_CID_TRANSIT);
10435+
10436+
/* Set the transition flag to bridge the transfer */
10437+
WRITE_ONCE(mc->transit, MM_CID_TRANSIT);
1040610438
WRITE_ONCE(mc->percpu, !!mc->pcpu_thrs);
1040710439
return true;
1040810440
}
@@ -10493,10 +10525,10 @@ static void mm_cid_fixup_cpus_to_tasks(struct mm_struct *mm)
1049310525
WRITE_ONCE(mm->mm_cid.transit, 0);
1049410526
}
1049510527

10496-
static inline void mm_cid_transfer_to_cpu(struct task_struct *t, struct mm_cid_pcpu *pcp)
10528+
static inline void mm_cid_transit_to_cpu(struct task_struct *t, struct mm_cid_pcpu *pcp)
1049710529
{
1049810530
if (cid_on_task(t->mm_cid.cid)) {
10499-
t->mm_cid.cid = cid_to_cpu_cid(t->mm_cid.cid);
10531+
t->mm_cid.cid = cid_to_transit_cid(t->mm_cid.cid);
1050010532
pcp->cid = t->mm_cid.cid;
1050110533
}
1050210534
}
@@ -10509,18 +10541,17 @@ static bool mm_cid_fixup_task_to_cpu(struct task_struct *t, struct mm_struct *mm
1050910541
if (!t->mm_cid.active)
1051010542
return false;
1051110543
if (cid_on_task(t->mm_cid.cid)) {
10512-
/* If running on the CPU, transfer the CID, otherwise drop it */
10544+
/* If running on the CPU, put the CID in transit mode, otherwise drop it */
1051310545
if (task_rq(t)->curr == t)
10514-
mm_cid_transfer_to_cpu(t, per_cpu_ptr(mm->mm_cid.pcpu, task_cpu(t)));
10546+
mm_cid_transit_to_cpu(t, per_cpu_ptr(mm->mm_cid.pcpu, task_cpu(t)));
1051510547
else
1051610548
mm_unset_cid_on_task(t);
1051710549
}
1051810550
return true;
1051910551
}
1052010552

10521-
static void mm_cid_fixup_tasks_to_cpus(void)
10553+
static void mm_cid_do_fixup_tasks_to_cpus(struct mm_struct *mm)
1052210554
{
10523-
struct mm_struct *mm = current->mm;
1052410555
struct task_struct *p, *t;
1052510556
unsigned int users;
1052610557

@@ -10558,6 +10589,15 @@ static void mm_cid_fixup_tasks_to_cpus(void)
1055810589
}
1055910590
}
1056010591

10592+
static void mm_cid_fixup_tasks_to_cpus(void)
10593+
{
10594+
struct mm_struct *mm = current->mm;
10595+
10596+
mm_cid_do_fixup_tasks_to_cpus(mm);
10597+
/* Clear the transition bit */
10598+
WRITE_ONCE(mm->mm_cid.transit, 0);
10599+
}
10600+
1056110601
static bool sched_mm_cid_add_user(struct task_struct *t, struct mm_struct *mm)
1056210602
{
1056310603
t->mm_cid.active = 1;
@@ -10596,7 +10636,7 @@ void sched_mm_cid_fork(struct task_struct *t)
1059610636
if (!percpu)
1059710637
mm_cid_transit_to_task(current, pcp);
1059810638
else
10599-
mm_cid_transfer_to_cpu(current, pcp);
10639+
mm_cid_transit_to_cpu(current, pcp);
1060010640
}
1060110641

1060210642
if (percpu) {

kernel/sched/sched.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3841,6 +3841,10 @@ static __always_inline void mm_cid_from_cpu(struct task_struct *t, unsigned int
38413841
/* Still nothing, allocate a new one */
38423842
if (!cid_on_cpu(cpu_cid))
38433843
cpu_cid = cid_to_cpu_cid(mm_get_cid(mm));
3844+
3845+
/* Set the transition mode flag if required */
3846+
if (READ_ONCE(mm->mm_cid.transit))
3847+
cpu_cid = cpu_cid_to_cid(cpu_cid) | MM_CID_TRANSIT;
38443848
}
38453849
mm_cid_update_pcpu_cid(mm, cpu_cid);
38463850
mm_cid_update_task_cid(t, cpu_cid);

0 commit comments

Comments
 (0)