Skip to content

Commit 9359d97

Browse files
author
Peter Zijlstra
committed
sched/core: Add comment explaining force-idle vruntime snapshots
I always end up having to re-read these emails every time I look at this code. And a future patch is going to change this story a little. This means it is past time to stick them in a comment so it can be modified and stay current. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200506143506.GH5298@hirez.programming.kicks-ass.net Link: https://lkml.kernel.org/r/20200515103844.GG2978@hirez.programming.kicks-ass.net Link: https://patch.msgid.link/20251106111603.GB4068168@noisy.programming.kicks-ass.net
1 parent 7f829bd commit 9359d97

1 file changed

Lines changed: 181 additions & 0 deletions

File tree

kernel/sched/fair.c

Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13014,6 +13014,187 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
1301413014
resched_curr(rq);
1301513015
}
1301613016

13017+
/*
13018+
* Consider any infeasible weight scenario. Take for instance two tasks,
13019+
* each bound to their respective sibling, one with weight 1 and one with
13020+
* weight 2. Then the lower weight task will run ahead of the higher weight
13021+
* task without bound.
13022+
*
13023+
* This utterly destroys the concept of a shared time base.
13024+
*
13025+
* Remember; all this is about a proportionally fair scheduling, where each
13026+
* tasks receives:
13027+
*
13028+
* w_i
13029+
* dt_i = ---------- dt (1)
13030+
* \Sum_j w_j
13031+
*
13032+
* which we do by tracking a virtual time, s_i:
13033+
*
13034+
* 1
13035+
* s_i = --- d[t]_i (2)
13036+
* w_i
13037+
*
13038+
* Where d[t] is a delta of discrete time, while dt is an infinitesimal.
13039+
* The immediate corollary is that the ideal schedule S, where (2) to use
13040+
* an infinitesimal delta, is:
13041+
*
13042+
* 1
13043+
* S = ---------- dt (3)
13044+
* \Sum_i w_i
13045+
*
13046+
* From which we can define the lag, or deviation from the ideal, as:
13047+
*
13048+
* lag(i) = S - s_i (4)
13049+
*
13050+
* And since the one and only purpose is to approximate S, we get that:
13051+
*
13052+
* \Sum_i w_i lag(i) := 0 (5)
13053+
*
13054+
* If this were not so, we no longer converge to S, and we can no longer
13055+
* claim our scheduler has any of the properties we derive from S. This is
13056+
* exactly what you did above, you broke it!
13057+
*
13058+
*
13059+
* Let's continue for a while though; to see if there is anything useful to
13060+
* be learned. We can combine (1)-(3) or (4)-(5) and express S in s_i:
13061+
*
13062+
* \Sum_i w_i s_i
13063+
* S = -------------- (6)
13064+
* \Sum_i w_i
13065+
*
13066+
* Which gives us a way to compute S, given our s_i. Now, if you've read
13067+
* our code, you know that we do not in fact do this, the reason for this
13068+
* is two-fold. Firstly, computing S in that way requires a 64bit division
13069+
* for every time we'd use it (see 12), and secondly, this only describes
13070+
* the steady-state, it doesn't handle dynamics.
13071+
*
13072+
* Anyway, in (6): s_i -> x + (s_i - x), to get:
13073+
*
13074+
* \Sum_i w_i (s_i - x)
13075+
* S - x = -------------------- (7)
13076+
* \Sum_i w_i
13077+
*
13078+
* Which shows that S and s_i transform alike (which makes perfect sense
13079+
* given that S is basically the (weighted) average of s_i).
13080+
*
13081+
* Then:
13082+
*
13083+
* x -> s_min := min{s_i} (8)
13084+
*
13085+
* to obtain:
13086+
*
13087+
* \Sum_i w_i (s_i - s_min)
13088+
* S = s_min + ------------------------ (9)
13089+
* \Sum_i w_i
13090+
*
13091+
* Which already looks familiar, and is the basis for our current
13092+
* approximation:
13093+
*
13094+
* S ~= s_min (10)
13095+
*
13096+
* Now, obviously, (10) is absolute crap :-), but it sorta works.
13097+
*
13098+
* So the thing to remember is that the above is strictly UP. It is
13099+
* possible to generalize to multiple runqueues -- however it gets really
13100+
* yuck when you have to add affinity support, as illustrated by our very
13101+
* first counter-example.
13102+
*
13103+
* Luckily I think we can avoid needing a full multi-queue variant for
13104+
* core-scheduling (or load-balancing). The crucial observation is that we
13105+
* only actually need this comparison in the presence of forced-idle; only
13106+
* then do we need to tell if the stalled rq has higher priority over the
13107+
* other.
13108+
*
13109+
* [XXX assumes SMT2; better consider the more general case, I suspect
13110+
* it'll work out because our comparison is always between 2 rqs and the
13111+
* answer is only interesting if one of them is forced-idle]
13112+
*
13113+
* And (under assumption of SMT2) when there is forced-idle, there is only
13114+
* a single queue, so everything works like normal.
13115+
*
13116+
* Let, for our runqueue 'k':
13117+
*
13118+
* T_k = \Sum_i w_i s_i
13119+
* W_k = \Sum_i w_i ; for all i of k (11)
13120+
*
13121+
* Then we can write (6) like:
13122+
*
13123+
* T_k
13124+
* S_k = --- (12)
13125+
* W_k
13126+
*
13127+
* From which immediately follows that:
13128+
*
13129+
* T_k + T_l
13130+
* S_k+l = --------- (13)
13131+
* W_k + W_l
13132+
*
13133+
* On which we can define a combined lag:
13134+
*
13135+
* lag_k+l(i) := S_k+l - s_i (14)
13136+
*
13137+
* And that gives us the tools to compare tasks across a combined runqueue.
13138+
*
13139+
*
13140+
* Combined this gives the following:
13141+
*
13142+
* a) when a runqueue enters force-idle, sync it against it's sibling rq(s)
13143+
* using (7); this only requires storing single 'time'-stamps.
13144+
*
13145+
* b) when comparing tasks between 2 runqueues of which one is forced-idle,
13146+
* compare the combined lag, per (14).
13147+
*
13148+
* Now, of course cgroups (I so hate them) make this more interesting in
13149+
* that a) seems to suggest we need to iterate all cgroup on a CPU at such
13150+
* boundaries, but I think we can avoid that. The force-idle is for the
13151+
* whole CPU, all it's rqs. So we can mark it in the root and lazily
13152+
* propagate downward on demand.
13153+
*/
13154+
13155+
/*
13156+
* So this sync is basically a relative reset of S to 0.
13157+
*
13158+
* So with 2 queues, when one goes idle, we drop them both to 0 and one
13159+
* then increases due to not being idle, and the idle one builds up lag to
13160+
* get re-elected. So far so simple, right?
13161+
*
13162+
* When there's 3, we can have the situation where 2 run and one is idle,
13163+
* we sync to 0 and let the idle one build up lag to get re-election. Now
13164+
* suppose another one also drops idle. At this point dropping all to 0
13165+
* again would destroy the built-up lag from the queue that was already
13166+
* idle, not good.
13167+
*
13168+
* So instead of syncing everything, we can:
13169+
*
13170+
* less := !((s64)(s_a - s_b) <= 0)
13171+
*
13172+
* (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
13173+
* == v_a - (v_b - S_a + S_b)
13174+
*
13175+
* IOW, we can recast the (lag) comparison to a one-sided difference.
13176+
* So if then, instead of syncing the whole queue, sync the idle queue
13177+
* against the active queue with S_a + S_b at the point where we sync.
13178+
*
13179+
* (XXX consider the implication of living in a cyclic group: N / 2^n N)
13180+
*
13181+
* This gives us means of syncing single queues against the active queue,
13182+
* and for already idle queues to preserve their build-up lag.
13183+
*
13184+
* Of course, then we get the situation where there's 2 active and one
13185+
* going idle, who do we pick to sync against? Theory would have us sync
13186+
* against the combined S, but as we've already demonstrated, there is no
13187+
* such thing in infeasible weight scenarios.
13188+
*
13189+
* One thing I've considered; and this is where that core_active rudiment
13190+
* came from, is having active queues sync up between themselves after
13191+
* every tick. This limits the observed divergence due to the work
13192+
* conservancy.
13193+
*
13194+
* On top of that, we can improve upon things by moving away from our
13195+
* horrible (10) hack and moving to (9) and employing (13) here.
13196+
*/
13197+
1301713198
/*
1301813199
* se_fi_update - Update the cfs_rq->min_vruntime_fi in a CFS hierarchy if needed.
1301913200
*/

0 commit comments

Comments
 (0)