Skip to content

Commit c89756b

Browse files
committed
Merge tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki: "Once again, the changes are dominated by cpufreq updates, but this time the majority of them are cpufreq core changes, mostly related to the introduction of policy locking guards and __free() usage, and fixes related to boost handling. Still, there is also a significant update of the intel_pstate driver making it register an energy model when running on a hybrid platform which is used for enabling energy-aware scheduling (EAS) if the driver operates in the passive mode (and schedutil is used as the cpufreq governor for all CPUs which is the passive mode default). There are some amd-pstate driver updates too, for a good measure, including the "Requested CPU Min frequency" BIOS option support and new online/offline callbacks. In the cpuidle space, the most significant change is the addition of a C1 demotion on/off sysfs knob to intel_idle which should help some users to configure their systems more precisely. There is also the conversion of the PSCI cpuidle driver to a faux device one and there are two small updates of cpuidle governors. Device power management is also modified quite a bit, especially the handling of devices with asynchronous suspend and resume enabled during system transitions. They are now going to be handled more asynchronously during suspend transitions and somewhat less aggressively during resume transitions. Apart from the above, the operating performance points (OPP) library is now going to use mutex locking guards and scope-based cleanup helpers and there is the usual bunch of assorted fixes and code cleanups. Specifics: - Fix potential division-by-zero error in em_compute_costs() (Yaxiong Tian) - Fix typos in energy model documentation and example driver code (Moon Hee Lee, Atul Kumar Pant) - Rearrange the energy model management code and add a new function for adjusting a CPU energy model after adjusting the capacity of the given CPU to it (Rafael Wysocki) - Refactor cpufreq_online(), add and use cpufreq policy locking guards, use __free() in policy reference counting, and clean up core cpufreq code on top of that (Rafael Wysocki) - Fix boost handling on CPU suspend/resume and sysfs updates (Viresh Kumar) - Fix des_perf clamping with max_perf in amd_pstate_update() (Dhananjay Ugwekar) - Add offline, online and suspend callbacks to the amd-pstate driver, rename and use the existing amd_pstate_epp callbacks in it (Dhananjay Ugwekar) - Add support for the "Requested CPU Min frequency" BIOS option to the amd-pstate driver (Dhananjay Ugwekar) - Reset amd-pstate driver mode after running selftests (Swapnil Sapkal) - Avoid shadowing ret in amd_pstate_ut_check_driver() (Nathan Chancellor) - Add helper for governor checks to the schedutil cpufreq governor and move cpufreq-specific EAS checks to cpufreq (Rafael Wysocki) - Populate the cpu_capacity sysfs entries from the intel_pstate driver after registering asym capacity support (Ricardo Neri) - Add support for enabling Energy-aware scheduling (EAS) to the intel_pstate driver when operating in the passive mode on a hybrid platform (Rafael Wysocki) - Drop redundant cpus_read_lock() from store_local_boost() in the cpufreq core (Seyediman Seyedarab) - Replace sscanf() with kstrtouint() in the cpufreq code and use a symbol instead of a raw number in it (Bowen Yu) - Add support for autonomous CPU performance state selection to the CPPC cpufreq driver (Lifeng Zheng) - OPP: Add dev_pm_opp_set_level() (Praveen Talari) - Introduce scope-based cleanup headers and mutex locking guards in OPP core (Viresh Kumar) - Switch OPP to use kmemdup_array() (Zhang Enpei) - Optimize bucket assignment when next_timer_ns equals KTIME_MAX in the menu cpuidle governor (Zhongqiu Han) - Convert the cpuidle PSCI driver to a faux device one (Sudeep Holla) - Add C1 demotion on/off sysfs knob to the intel_idle driver (Artem Bityutskiy) - Fix typos in two comments in the teo cpuidle governor (Atul Kumar Pant) - Fix denying of auto suspend in pm_suspend_timer_fn() (Charan Teja Kalla) - Move debug runtime PM attributes to runtime_attrs[] (Rafael Wysocki) - Add new devm_ functions for enabling runtime PM and runtime PM reference counting (Bence Csókás) - Remove size arguments from strscpy() calls in the hibernation core code (Thorsten Blum) - Adjust the handling of devices with asynchronous suspend enabled during system suspend and resume to start resuming them immediately after resuming their parents and to start suspending such a device immediately after suspending its first child (Rafael Wysocki) - Adjust messages printed during tasks freezing to avoid using pr_cont() (Andrew Sayers, Paul Menzel) - Clean up unnecessary usage of !! in pm_print_times_init() (Zihuan Zhang) - Add missing wakeup source attribute relax_count to sysfs and remove the space character at the end ofi the string produced by pm_show_wakelocks() (Zijun Hu) - Add configurable pm_test delay for hibernation (Zihuan Zhang) - Disable asynchronous suspend in ucsi_ccg_probe() to prevent the cypd4226 device on Tegra boards from suspending prematurely (Jon Hunter) - Unbreak printing PM debug messages during hibernation and clean up some related code (Rafael Wysocki) - Add a systemd service to run cpupower and change cpupower binding's Makefile to use -lcpupower (John B. Wyatt IV, Francesco Poli)" * tag 'pm-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (72 commits) cpufreq: CPPC: Add support for autonomous selection cpufreq: Update sscanf() to kstrtouint() cpufreq: Replace magic number OPP: switch to use kmemdup_array() PM: freezer: Rewrite restarting tasks log to remove stray *done.* PM: runtime: fix denying of auto suspend in pm_suspend_timer_fn() cpufreq: drop redundant cpus_read_lock() from store_local_boost() cpupower: do not install files to /etc/default/ cpupower: do not call systemctl at install time cpupower: do not write DESTDIR to cpupower.service PM: sleep: Introduce pm_sleep_transition_in_progress() cpufreq/amd-pstate: Avoid shadowing ret in amd_pstate_ut_check_driver() cpufreq: intel_pstate: Document hybrid processor support cpufreq: intel_pstate: EAS: Increase cost for CPUs using L3 cache cpufreq: intel_pstate: EAS support for hybrid platforms PM: EM: Introduce em_adjust_cpu_capacity() PM: EM: Move CPU capacity check to em_adjust_new_capacity() PM: EM: Documentation: Fix typos in example driver code cpufreq: Drop policy locking from cpufreq_policy_is_good_for_eas() PM: sleep: Introduce pm_suspend_in_progress() ...
2 parents 3702a51 + 3e0c509 commit c89756b

52 files changed

Lines changed: 1675 additions & 962 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Documentation/ABI/testing/sysfs-devices-system-cpu

Lines changed: 60 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ What: /sys/devices/system/cpu/cpuidle/available_governors
111111
/sys/devices/system/cpu/cpuidle/current_driver
112112
/sys/devices/system/cpu/cpuidle/current_governor
113113
/sys/devices/system/cpu/cpuidle/current_governer_ro
114+
/sys/devices/system/cpu/cpuidle/intel_c1_demotion
114115
Date: September 2007
115116
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
116117
Description: Discover cpuidle policy and mechanism
@@ -132,7 +133,11 @@ Description: Discover cpuidle policy and mechanism
132133

133134
current_governor_ro: (RO) displays current idle policy.
134135

135-
See Documentation/admin-guide/pm/cpuidle.rst and
136+
intel_c1_demotion: (RW) enables/disables the C1 demotion
137+
feature on Intel CPUs.
138+
139+
See Documentation/admin-guide/pm/cpuidle.rst,
140+
Documentation/admin-guide/pm/intel_idle.rst, and
136141
Documentation/driver-api/pm/cpuidle.rst for more information.
137142

138143

@@ -268,6 +273,60 @@ Description: Discover CPUs in the same CPU frequency coordination domain
268273
This file is only present if the acpi-cpufreq or the cppc-cpufreq
269274
drivers are in use.
270275

276+
What: /sys/devices/system/cpu/cpuX/cpufreq/auto_select
277+
Date: May 2025
278+
Contact: linux-pm@vger.kernel.org
279+
Description: Autonomous selection enable
280+
281+
Read/write interface to control autonomous selection enable
282+
Read returns autonomous selection status:
283+
0: autonomous selection is disabled
284+
1: autonomous selection is enabled
285+
286+
Write 'y' or '1' or 'on' to enable autonomous selection.
287+
Write 'n' or '0' or 'off' to disable autonomous selection.
288+
289+
This file is only present if the cppc-cpufreq driver is in use.
290+
291+
What: /sys/devices/system/cpu/cpuX/cpufreq/auto_act_window
292+
Date: May 2025
293+
Contact: linux-pm@vger.kernel.org
294+
Description: Autonomous activity window
295+
296+
This file indicates a moving utilization sensitivity window to
297+
the platform's autonomous selection policy.
298+
299+
Read/write an integer represents autonomous activity window (in
300+
microseconds) from/to this file. The max value to write is
301+
1270000000 but the max significand is 127. This means that if 128
302+
is written to this file, 127 will be stored. If the value is
303+
greater than 130, only the first two digits will be saved as
304+
significand.
305+
306+
Writing a zero value to this file enable the platform to
307+
determine an appropriate Activity Window depending on the workload.
308+
309+
Writing to this file only has meaning when Autonomous Selection is
310+
enabled.
311+
312+
This file is only present if the cppc-cpufreq driver is in use.
313+
314+
What: /sys/devices/system/cpu/cpuX/cpufreq/energy_performance_preference_val
315+
Date: May 2025
316+
Contact: linux-pm@vger.kernel.org
317+
Description: Energy performance preference
318+
319+
Read/write an 8-bit integer from/to this file. This file
320+
represents a range of values from 0 (performance preference) to
321+
0xFF (energy efficiency preference) that influences the rate of
322+
performance increase/decrease and the result of the hardware's
323+
energy efficiency and performance optimization policies.
324+
325+
Writing to this file only has meaning when Autonomous Selection is
326+
enabled.
327+
328+
This file is only present if the cppc-cpufreq driver is in use.
329+
271330

272331
What: /sys/devices/system/cpu/cpu*/cache/index3/cache_disable_{0,1}
273332
Date: August 2008

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1828,6 +1828,13 @@
18281828
lz4: Select LZ4 compression algorithm to
18291829
compress/decompress hibernation image.
18301830

1831+
hibernate.pm_test_delay=
1832+
[HIBERNATION]
1833+
Sets the number of seconds to remain in a hibernation test
1834+
mode before resuming the system (see
1835+
/sys/power/pm_test). Only available when CONFIG_PM_DEBUG
1836+
is set. Default value is 5.
1837+
18311838
highmem=nn[KMG] [KNL,BOOT,EARLY] forces the highmem zone to have an exact
18321839
size of <nn>. This works even on boxes that have no
18331840
highmem otherwise. This also works to reduce highmem

Documentation/admin-guide/pm/intel_idle.rst

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,27 @@ instruction at all.
3838
only way to pass early-configuration-time parameters to it is via the kernel
3939
command line.
4040

41+
Sysfs Interface
42+
===============
43+
44+
The ``intel_idle`` driver exposes the following ``sysfs`` attributes in
45+
``/sys/devices/system/cpu/cpuidle/``:
46+
47+
``intel_c1_demotion``
48+
Enable or disable C1 demotion for all CPUs in the system. This file is
49+
only exposed on platforms that support the C1 demotion feature and where
50+
it was tested. Value 0 means that C1 demotion is disabled, value 1 means
51+
that it is enabled. Write 0 or 1 to disable or enable C1 demotion for
52+
all CPUs.
53+
54+
The C1 demotion feature involves the platform firmware demoting deep
55+
C-state requests from the OS (e.g., C6 requests) to C1. The idea is that
56+
firmware monitors CPU wake-up rate, and if it is higher than a
57+
platform-specific threshold, the firmware demotes deep C-state requests
58+
to C1. For example, Linux requests C6, but firmware noticed too many
59+
wake-ups per second, and it keeps the CPU in C1. When the CPU stays in
60+
C1 long enough, the platform promotes it back to C6. This may improve
61+
some workloads' performance, but it may also increase power consumption.
4162

4263
.. _intel-idle-enumeration-of-states:
4364

Documentation/admin-guide/pm/intel_pstate.rst

Lines changed: 102 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -329,6 +329,106 @@ information listed above is the same for all of the processors supporting the
329329
HWP feature, which is why ``intel_pstate`` works with all of them.]
330330

331331

332+
Support for Hybrid Processors
333+
=============================
334+
335+
Some processors supported by ``intel_pstate`` contain two or more types of CPU
336+
cores differing by the maximum turbo P-state, performance vs power characteristics,
337+
cache sizes, and possibly other properties. They are commonly referred to as
338+
hybrid processors. To support them, ``intel_pstate`` requires HWP to be enabled
339+
and it assumes the HWP performance units to be the same for all CPUs in the
340+
system, so a given HWP performance level always represents approximately the
341+
same physical performance regardless of the core (CPU) type.
342+
343+
Hybrid Processors with SMT
344+
--------------------------
345+
346+
On systems where SMT (Simultaneous Multithreading), also referred to as
347+
HyperThreading (HT) in the context of Intel processors, is enabled on at least
348+
one core, ``intel_pstate`` assigns performance-based priorities to CPUs. Namely,
349+
the priority of a given CPU reflects its highest HWP performance level which
350+
causes the CPU scheduler to generally prefer more performant CPUs, so the less
351+
performant CPUs are used when the other ones are fully loaded. However, SMT
352+
siblings (that is, logical CPUs sharing one physical core) are treated in a
353+
special way such that if one of them is in use, the effective priority of the
354+
other ones is lowered below the priorities of the CPUs located in the other
355+
physical cores.
356+
357+
This approach maximizes performance in the majority of cases, but unfortunately
358+
it also leads to excessive energy usage in some important scenarios, like video
359+
playback, which is not generally desirable. While there is no other viable
360+
choice with SMT enabled because the effective capacity and utilization of SMT
361+
siblings are hard to determine, hybrid processors without SMT can be handled in
362+
more energy-efficient ways.
363+
364+
.. _CAS:
365+
366+
Capacity-Aware Scheduling Support
367+
---------------------------------
368+
369+
The capacity-aware scheduling (CAS) support in the CPU scheduler is enabled by
370+
``intel_pstate`` by default on hybrid processors without SMT. CAS generally
371+
causes the scheduler to put tasks on a CPU so long as there is a sufficient
372+
amount of spare capacity on it, and if the utilization of a given task is too
373+
high for it, the task will need to go somewhere else.
374+
375+
Since CAS takes CPU capacities into account, it does not require CPU
376+
prioritization and it allows tasks to be distributed more symmetrically among
377+
the more performant and less performant CPUs. Once placed on a CPU with enough
378+
capacity to accommodate it, a task may just continue to run there regardless of
379+
whether or not the other CPUs are fully loaded, so on average CAS reduces the
380+
utilization of the more performant CPUs which causes the energy usage to be more
381+
balanced because the more performant CPUs are generally less energy-efficient
382+
than the less performant ones.
383+
384+
In order to use CAS, the scheduler needs to know the capacity of each CPU in
385+
the system and it needs to be able to compute scale-invariant utilization of
386+
CPUs, so ``intel_pstate`` provides it with the requisite information.
387+
388+
First of all, the capacity of each CPU is represented by the ratio of its highest
389+
HWP performance level, multiplied by 1024, to the highest HWP performance level
390+
of the most performant CPU in the system, which works because the HWP performance
391+
units are the same for all CPUs. Second, the frequency-invariance computations,
392+
carried out by the scheduler to always express CPU utilization in the same units
393+
regardless of the frequency it is currently running at, are adjusted to take the
394+
CPU capacity into account. All of this happens when ``intel_pstate`` has
395+
registered itself with the ``CPUFreq`` core and it has figured out that it is
396+
running on a hybrid processor without SMT.
397+
398+
Energy-Aware Scheduling Support
399+
-------------------------------
400+
401+
If ``CONFIG_ENERGY_MODEL`` has been set during kernel configuration and
402+
``intel_pstate`` runs on a hybrid processor without SMT, in addition to enabling
403+
`CAS <CAS_>`_ it registers an Energy Model for the processor. This allows the
404+
Energy-Aware Scheduling (EAS) support to be enabled in the CPU scheduler if
405+
``schedutil`` is used as the ``CPUFreq`` governor which requires ``intel_pstate``
406+
to operate in the `passive mode <Passive Mode_>`_.
407+
408+
The Energy Model registered by ``intel_pstate`` is artificial (that is, it is
409+
based on abstract cost values and it does not include any real power numbers)
410+
and it is relatively simple to avoid unnecessary computations in the scheduler.
411+
There is a performance domain in it for every CPU in the system and the cost
412+
values for these performance domains have been chosen so that running a task on
413+
a less performant (small) CPU appears to be always cheaper than running that
414+
task on a more performant (big) CPU. However, for two CPUs of the same type,
415+
the cost difference depends on their current utilization, and the CPU whose
416+
current utilization is higher generally appears to be a more expensive
417+
destination for a given task. This helps to balance the load among CPUs of the
418+
same type.
419+
420+
Since EAS works on top of CAS, high-utilization tasks are always migrated to
421+
CPUs with enough capacity to accommodate them, but thanks to EAS, low-utilization
422+
tasks tend to be placed on the CPUs that look less expensive to the scheduler.
423+
Effectively, this causes the less performant and less loaded CPUs to be
424+
preferred as long as they have enough spare capacity to run the given task
425+
which generally leads to reduced energy usage.
426+
427+
The Energy Model created by ``intel_pstate`` can be inspected by looking at
428+
the ``energy_model`` directory in ``debugfs`` (typlically mounted on
429+
``/sys/kernel/debug/``).
430+
431+
332432
User Space Interface in ``sysfs``
333433
=================================
334434

@@ -697,8 +797,8 @@ of them have to be prepended with the ``intel_pstate=`` prefix.
697797
Limits`_ for details).
698798

699799
``no_cas``
700-
Do not enable capacity-aware scheduling (CAS) which is enabled by
701-
default on hybrid systems.
800+
Do not enable `capacity-aware scheduling <CAS_>`_ which is enabled by
801+
default on hybrid systems without SMT.
702802

703803
Diagnostics and Tuning
704804
======================

Documentation/power/energy-model.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -230,7 +230,7 @@ Drivers must provide a pointer to the allocated and initialized new EM
230230
and will be visible to other sub-systems in the kernel (thermal, powercap).
231231
The main design goal for this API is to be fast and avoid extra calculations
232232
or memory allocations at runtime. When pre-computed EMs are available in the
233-
device driver, than it should be possible to simply re-use them with low
233+
device driver, then it should be possible to simply reuse them with low
234234
performance overhead.
235235

236236
In order to free the EM, provided earlier by the driver (e.g. when the module
@@ -381,17 +381,17 @@ up periodically to check the temperature and modify the EM data::
381381
26 rcu_read_unlock();
382382
27
383383
28 /* Calculate 'cost' values for EAS */
384-
29 ret = em_dev_compute_costs(dev, table, pd->nr_perf_states);
384+
29 ret = em_dev_compute_costs(dev, new_table, pd->nr_perf_states);
385385
30 if (ret) {
386386
31 dev_warn(dev, "EM: compute costs failed %d\n", ret);
387-
32 em_free_table(em_table);
387+
32 em_table_free(em_table);
388388
33 return;
389389
34 }
390390
35
391391
36 ret = em_dev_update_perf_domain(dev, em_table);
392392
37 if (ret) {
393393
38 dev_warn(dev, "EM: update failed %d\n", ret);
394-
39 em_free_table(em_table);
394+
39 em_table_free(em_table);
395395
40 return;
396396
41 }
397397
42

arch/x86/pci/fixup.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -970,13 +970,13 @@ static void amd_rp_pme_suspend(struct pci_dev *dev)
970970
struct pci_dev *rp;
971971

972972
/*
973-
* PM_SUSPEND_ON means we're doing runtime suspend, which means
973+
* If system suspend is not in progress, we're doing runtime suspend, so
974974
* amd-pmc will not be involved so PMEs during D3 work as advertised.
975975
*
976976
* The PMEs *do* work if amd-pmc doesn't put the SoC in the hardware
977977
* sleep state, but we assume amd-pmc is always present.
978978
*/
979-
if (pm_suspend_target_state == PM_SUSPEND_ON)
979+
if (!pm_suspend_in_progress())
980980
return;
981981

982982
rp = pcie_find_root_port(dev);

drivers/base/arch_topology.c

Lines changed: 0 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -154,14 +154,6 @@ void topology_set_freq_scale(const struct cpumask *cpus, unsigned long cur_freq,
154154
per_cpu(arch_freq_scale, i) = scale;
155155
}
156156

157-
DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
158-
EXPORT_PER_CPU_SYMBOL_GPL(cpu_scale);
159-
160-
void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
161-
{
162-
per_cpu(cpu_scale, cpu) = capacity;
163-
}
164-
165157
DEFINE_PER_CPU(unsigned long, hw_pressure);
166158

167159
/**
@@ -207,53 +199,9 @@ void topology_update_hw_pressure(const struct cpumask *cpus,
207199
}
208200
EXPORT_SYMBOL_GPL(topology_update_hw_pressure);
209201

210-
static ssize_t cpu_capacity_show(struct device *dev,
211-
struct device_attribute *attr,
212-
char *buf)
213-
{
214-
struct cpu *cpu = container_of(dev, struct cpu, dev);
215-
216-
return sysfs_emit(buf, "%lu\n", topology_get_cpu_scale(cpu->dev.id));
217-
}
218-
219202
static void update_topology_flags_workfn(struct work_struct *work);
220203
static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn);
221204

222-
static DEVICE_ATTR_RO(cpu_capacity);
223-
224-
static int cpu_capacity_sysctl_add(unsigned int cpu)
225-
{
226-
struct device *cpu_dev = get_cpu_device(cpu);
227-
228-
if (!cpu_dev)
229-
return -ENOENT;
230-
231-
device_create_file(cpu_dev, &dev_attr_cpu_capacity);
232-
233-
return 0;
234-
}
235-
236-
static int cpu_capacity_sysctl_remove(unsigned int cpu)
237-
{
238-
struct device *cpu_dev = get_cpu_device(cpu);
239-
240-
if (!cpu_dev)
241-
return -ENOENT;
242-
243-
device_remove_file(cpu_dev, &dev_attr_cpu_capacity);
244-
245-
return 0;
246-
}
247-
248-
static int register_cpu_capacity_sysctl(void)
249-
{
250-
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "topology/cpu-capacity",
251-
cpu_capacity_sysctl_add, cpu_capacity_sysctl_remove);
252-
253-
return 0;
254-
}
255-
subsys_initcall(register_cpu_capacity_sysctl);
256-
257205
static int update_topology;
258206

259207
int topology_update_cpu_topology(void)

0 commit comments

Comments
 (0)