@@ -13014,6 +13014,187 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr)
1301413014 resched_curr (rq );
1301513015}
1301613016
13017+ /*
13018+ * Consider any infeasible weight scenario. Take for instance two tasks,
13019+ * each bound to their respective sibling, one with weight 1 and one with
13020+ * weight 2. Then the lower weight task will run ahead of the higher weight
13021+ * task without bound.
13022+ *
13023+ * This utterly destroys the concept of a shared time base.
13024+ *
13025+ * Remember; all this is about a proportionally fair scheduling, where each
13026+ * tasks receives:
13027+ *
13028+ * w_i
13029+ * dt_i = ---------- dt (1)
13030+ * \Sum_j w_j
13031+ *
13032+ * which we do by tracking a virtual time, s_i:
13033+ *
13034+ * 1
13035+ * s_i = --- d[t]_i (2)
13036+ * w_i
13037+ *
13038+ * Where d[t] is a delta of discrete time, while dt is an infinitesimal.
13039+ * The immediate corollary is that the ideal schedule S, where (2) to use
13040+ * an infinitesimal delta, is:
13041+ *
13042+ * 1
13043+ * S = ---------- dt (3)
13044+ * \Sum_i w_i
13045+ *
13046+ * From which we can define the lag, or deviation from the ideal, as:
13047+ *
13048+ * lag(i) = S - s_i (4)
13049+ *
13050+ * And since the one and only purpose is to approximate S, we get that:
13051+ *
13052+ * \Sum_i w_i lag(i) := 0 (5)
13053+ *
13054+ * If this were not so, we no longer converge to S, and we can no longer
13055+ * claim our scheduler has any of the properties we derive from S. This is
13056+ * exactly what you did above, you broke it!
13057+ *
13058+ *
13059+ * Let's continue for a while though; to see if there is anything useful to
13060+ * be learned. We can combine (1)-(3) or (4)-(5) and express S in s_i:
13061+ *
13062+ * \Sum_i w_i s_i
13063+ * S = -------------- (6)
13064+ * \Sum_i w_i
13065+ *
13066+ * Which gives us a way to compute S, given our s_i. Now, if you've read
13067+ * our code, you know that we do not in fact do this, the reason for this
13068+ * is two-fold. Firstly, computing S in that way requires a 64bit division
13069+ * for every time we'd use it (see 12), and secondly, this only describes
13070+ * the steady-state, it doesn't handle dynamics.
13071+ *
13072+ * Anyway, in (6): s_i -> x + (s_i - x), to get:
13073+ *
13074+ * \Sum_i w_i (s_i - x)
13075+ * S - x = -------------------- (7)
13076+ * \Sum_i w_i
13077+ *
13078+ * Which shows that S and s_i transform alike (which makes perfect sense
13079+ * given that S is basically the (weighted) average of s_i).
13080+ *
13081+ * Then:
13082+ *
13083+ * x -> s_min := min{s_i} (8)
13084+ *
13085+ * to obtain:
13086+ *
13087+ * \Sum_i w_i (s_i - s_min)
13088+ * S = s_min + ------------------------ (9)
13089+ * \Sum_i w_i
13090+ *
13091+ * Which already looks familiar, and is the basis for our current
13092+ * approximation:
13093+ *
13094+ * S ~= s_min (10)
13095+ *
13096+ * Now, obviously, (10) is absolute crap :-), but it sorta works.
13097+ *
13098+ * So the thing to remember is that the above is strictly UP. It is
13099+ * possible to generalize to multiple runqueues -- however it gets really
13100+ * yuck when you have to add affinity support, as illustrated by our very
13101+ * first counter-example.
13102+ *
13103+ * Luckily I think we can avoid needing a full multi-queue variant for
13104+ * core-scheduling (or load-balancing). The crucial observation is that we
13105+ * only actually need this comparison in the presence of forced-idle; only
13106+ * then do we need to tell if the stalled rq has higher priority over the
13107+ * other.
13108+ *
13109+ * [XXX assumes SMT2; better consider the more general case, I suspect
13110+ * it'll work out because our comparison is always between 2 rqs and the
13111+ * answer is only interesting if one of them is forced-idle]
13112+ *
13113+ * And (under assumption of SMT2) when there is forced-idle, there is only
13114+ * a single queue, so everything works like normal.
13115+ *
13116+ * Let, for our runqueue 'k':
13117+ *
13118+ * T_k = \Sum_i w_i s_i
13119+ * W_k = \Sum_i w_i ; for all i of k (11)
13120+ *
13121+ * Then we can write (6) like:
13122+ *
13123+ * T_k
13124+ * S_k = --- (12)
13125+ * W_k
13126+ *
13127+ * From which immediately follows that:
13128+ *
13129+ * T_k + T_l
13130+ * S_k+l = --------- (13)
13131+ * W_k + W_l
13132+ *
13133+ * On which we can define a combined lag:
13134+ *
13135+ * lag_k+l(i) := S_k+l - s_i (14)
13136+ *
13137+ * And that gives us the tools to compare tasks across a combined runqueue.
13138+ *
13139+ *
13140+ * Combined this gives the following:
13141+ *
13142+ * a) when a runqueue enters force-idle, sync it against it's sibling rq(s)
13143+ * using (7); this only requires storing single 'time'-stamps.
13144+ *
13145+ * b) when comparing tasks between 2 runqueues of which one is forced-idle,
13146+ * compare the combined lag, per (14).
13147+ *
13148+ * Now, of course cgroups (I so hate them) make this more interesting in
13149+ * that a) seems to suggest we need to iterate all cgroup on a CPU at such
13150+ * boundaries, but I think we can avoid that. The force-idle is for the
13151+ * whole CPU, all it's rqs. So we can mark it in the root and lazily
13152+ * propagate downward on demand.
13153+ */
13154+
13155+ /*
13156+ * So this sync is basically a relative reset of S to 0.
13157+ *
13158+ * So with 2 queues, when one goes idle, we drop them both to 0 and one
13159+ * then increases due to not being idle, and the idle one builds up lag to
13160+ * get re-elected. So far so simple, right?
13161+ *
13162+ * When there's 3, we can have the situation where 2 run and one is idle,
13163+ * we sync to 0 and let the idle one build up lag to get re-election. Now
13164+ * suppose another one also drops idle. At this point dropping all to 0
13165+ * again would destroy the built-up lag from the queue that was already
13166+ * idle, not good.
13167+ *
13168+ * So instead of syncing everything, we can:
13169+ *
13170+ * less := !((s64)(s_a - s_b) <= 0)
13171+ *
13172+ * (v_a - S_a) - (v_b - S_b) == v_a - v_b - S_a + S_b
13173+ * == v_a - (v_b - S_a + S_b)
13174+ *
13175+ * IOW, we can recast the (lag) comparison to a one-sided difference.
13176+ * So if then, instead of syncing the whole queue, sync the idle queue
13177+ * against the active queue with S_a + S_b at the point where we sync.
13178+ *
13179+ * (XXX consider the implication of living in a cyclic group: N / 2^n N)
13180+ *
13181+ * This gives us means of syncing single queues against the active queue,
13182+ * and for already idle queues to preserve their build-up lag.
13183+ *
13184+ * Of course, then we get the situation where there's 2 active and one
13185+ * going idle, who do we pick to sync against? Theory would have us sync
13186+ * against the combined S, but as we've already demonstrated, there is no
13187+ * such thing in infeasible weight scenarios.
13188+ *
13189+ * One thing I've considered; and this is where that core_active rudiment
13190+ * came from, is having active queues sync up between themselves after
13191+ * every tick. This limits the observed divergence due to the work
13192+ * conservancy.
13193+ *
13194+ * On top of that, we can improve upon things by moving away from our
13195+ * horrible (10) hack and moving to (9) and employing (13) here.
13196+ */
13197+
1301713198/*
1301813199 * se_fi_update - Update the cfs_rq->min_vruntime_fi in a CFS hierarchy if needed.
1301913200 */
0 commit comments