Skip to content

Commit 7a25759

Browse files
gormanmrafaeljw
authored andcommitted
cpuidle: Select polling interval based on a c-state with a longer target residency
It was noted that a few workloads that idle rapidly regressed when commit 36fcb42 ("cpuidle: use first valid target residency as poll time") was merged. The workloads in question were heavy communicators that idle rapidly and were impacted by the c-state exit latency as the active CPUs were not polling at the time of wakeup. As they were not particularly realistic workloads, it was not considered to be a major problem. Unfortunately, a bug was reported for a real workload in a production environment that relied on large numbers of threads operating in a worker pool pattern. These threads would idle for periods of time longer than the C1 target residency and so incurred the c-state exit latency penalty. The application is very sensitive to wakeup latency and indirectly relying on behaviour prior to commit on a37b969 ("cpuidle: poll_state: Add time limit to poll_idle()") to poll for long enough to avoid the exit latency cost. The target residency of C1 is typically very short. On some x86 machines, it can be as low as 2 microseconds. In poll_idle(), the clock is checked every POLL_IDLE_RELAX_COUNT interations of cpu_relax() and even one iteration of that loop can be over 1 microsecond so the polling interval is very close to the granularity of what poll_idle() can detect. Furthermore, a basic ping pong workload like perf bench pipe has a longer round-trip time than the 2 microseconds meaning that the CPU will almost certainly not be polling when the ping-pong completes. This patch selects a polling interval based on an enabled c-state that has an target residency longer than 10usec. If there is no enabled-cstate then polling will be up to a TICK_NSEC/16 similar to what it was up until kernel 4.20. Polling for a full tick is unlikely (rescheduling event) and is much longer than the existing target residencies for a deep c-state. As an example, consider a CPU with the following c-state information from an Intel CPU; residency exit_latency C1 2 2 C1E 20 10 C3 100 33 C6 400 133 The polling interval selected is 20usec. If booted with intel_idle.max_cstate=1 then the polling interval is 250usec as the deeper c-states were not available. On an AMD EPYC machine, the c-state information is more limited and looks like residency exit_latency C1 2 1 C2 800 400 The polling interval selected is 250usec. While C2 was considered, the polling interval was clamped by CPUIDLE_POLL_MAX. Note that it is not expected that polling will be a universal win. As well as potentially trading power for performance, the performance is not guaranteed if the extra polling prevented a turbo state being reached. Making it a tunable was considered but it's driver-specific, may be overridden by a governor and is not a guaranteed polling interval making it difficult to describe without knowledge of the implementation. tbench4 vanilla polling Hmean 1 497.89 ( 0.00%) 543.15 * 9.09%* Hmean 2 975.88 ( 0.00%) 1059.73 * 8.59%* Hmean 4 1953.97 ( 0.00%) 2081.37 * 6.52%* Hmean 8 3645.76 ( 0.00%) 4052.95 * 11.17%* Hmean 16 6882.21 ( 0.00%) 6995.93 * 1.65%* Hmean 32 10752.20 ( 0.00%) 10731.53 * -0.19%* Hmean 64 12875.08 ( 0.00%) 12478.13 * -3.08%* Hmean 128 21500.54 ( 0.00%) 21098.60 * -1.87%* Hmean 256 21253.70 ( 0.00%) 21027.18 * -1.07%* Hmean 320 20813.50 ( 0.00%) 20580.64 * -1.12%* Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
1 parent 0f6e2cb commit 7a25759

1 file changed

Lines changed: 23 additions & 2 deletions

File tree

drivers/cpuidle/cpuidle.c

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -368,6 +368,19 @@ void cpuidle_reflect(struct cpuidle_device *dev, int index)
368368
cpuidle_curr_governor->reflect(dev, index);
369369
}
370370

371+
/*
372+
* Min polling interval of 10usec is a guess. It is assuming that
373+
* for most users, the time for a single ping-pong workload like
374+
* perf bench pipe would generally complete within 10usec but
375+
* this is hardware dependant. Actual time can be estimated with
376+
*
377+
* perf bench sched pipe -l 10000
378+
*
379+
* Run multiple times to avoid cpufreq effects.
380+
*/
381+
#define CPUIDLE_POLL_MIN 10000
382+
#define CPUIDLE_POLL_MAX (TICK_NSEC / 16)
383+
371384
/**
372385
* cpuidle_poll_time - return amount of time to poll for,
373386
* governors can override dev->poll_limit_ns if necessary
@@ -382,15 +395,23 @@ u64 cpuidle_poll_time(struct cpuidle_driver *drv,
382395
int i;
383396
u64 limit_ns;
384397

398+
BUILD_BUG_ON(CPUIDLE_POLL_MIN > CPUIDLE_POLL_MAX);
399+
385400
if (dev->poll_limit_ns)
386401
return dev->poll_limit_ns;
387402

388-
limit_ns = TICK_NSEC;
403+
limit_ns = CPUIDLE_POLL_MAX;
389404
for (i = 1; i < drv->state_count; i++) {
405+
u64 state_limit;
406+
390407
if (dev->states_usage[i].disable)
391408
continue;
392409

393-
limit_ns = drv->states[i].target_residency_ns;
410+
state_limit = drv->states[i].target_residency_ns;
411+
if (state_limit < CPUIDLE_POLL_MIN)
412+
continue;
413+
414+
limit_ns = min_t(u64, state_limit, CPUIDLE_POLL_MAX);
394415
break;
395416
}
396417

0 commit comments

Comments
 (0)