Skip to content

Commit 8fc63a9

Browse files
committed
Merge branch 'smp-topo' into next
Merge a branch containing SMP topology updates from Srikar, purely so we can include the cover letter which has a lot of good detail here: PowerVM systems configured in shared processors mode have some unique challenges. Some device-tree properties will be missing on a shared processor. Hence some sched domains may not make sense for shared processor systems. Most shared processor systems are over-provisioned. Underlying PowerVM Hypervisor would schedule at a Big Core (SMT8) granularity. The most recent power processors support two almost independent cores. In a lightly loaded condition, it helps the overall system performance if we pack to lesser number of Big Cores. Since each thread-group is independent, running threads on both the thread-groups of a SMT8 core, should have a minimal adverse impact in non over provisioned scenarios. These changes in this patchset will not affect in the over provisioned scenario. If there are more threads than SMT domains, then asym_packing will not kick-in. System Configuration type=Shared mode=Uncapped smt=8 lcpu=96 mem=1066409344 kB cpus=96 ent=64.00 So *64 Entitled cores/ 96 Virtual processor* Scenario lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 768 On-line CPU(s) list: 0-767 Model name: POWER10 (architected), altivec supported Model: 2.0 (pvr 0080 0200) Thread(s) per core: 8 Core(s) per socket: 16 Socket(s): 6 Hypervisor vendor: pHyp Virtualization type: para L1d cache: 6 MiB (192 instances) L1i cache: 9 MiB (192 instances) NUMA node(s): 6 NUMA node0 CPU(s): 0-7,32-39,80-87,128-135,176-183,224-231,272-279,320-327,368-375,416-423,464-471,512-519,560-567,608-615,656-663,704-711,752-759 NUMA node1 CPU(s): 8-15,40-47,88-95,136-143,184-191,232-239,280-287,328-335,376-383,424-431,472-479,520-527,568-575,616-623,664-671,712-719,760-767 NUMA node4 CPU(s): 64-71,112-119,160-167,208-215,256-263,304-311,352-359,400-407,448-455,496-503,544-551,592-599,640-647,688-695,736-743 NUMA node5 CPU(s): 16-23,48-55,96-103,144-151,192-199,240-247,288-295,336-343,384-391,432-439,480-487,528-535,576-583,624-631,672-679,720-727 NUMA node6 CPU(s): 72-79,120-127,168-175,216-223,264-271,312-319,360-367,408-415,456-463,504-511,552-559,600-607,648-655,696-703,744-751 NUMA node7 CPU(s): 24-31,56-63,104-111,152-159,200-207,248-255,296-303,344-351,392-399,440-447,488-495,536-543,584-591,632-639,680-687,728-735 ebizzy -t 32 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 3840178 4059268 3978042 3973936.6 84264.456 +patch 5 3768393 3927901 3874994 3854046 71532.926 -3.01692 >From lparstat (when the workload stabilized) Kernel %user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 4.16 0.00 0.00 95.84 26.06 40.72 4.16 69.88 276906989 578 +patch 4.16 0.00 0.00 95.83 17.70 27.66 4.17 78.26 70436663 119 ebizzy -t 128 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 5520692 5981856 5717709 5727053.2 176093.2 +patch 5 5305888 6259610 5854590 5843311 375917.03 2.02998 >From lparstat (when the workload stabilized) Kernel %user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 16.66 0.00 0.00 83.33 45.49 71.08 16.67 50.50 288778533 581 +patch 16.65 0.00 0.00 83.35 30.15 47.11 16.65 65.76 85196150 133 ebizzy -t 512 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 19563921 20049955 19701510 19728733 198295.18 +patch 5 19455992 20176445 19718427 19832017 304094.05 0.523521 >From lparstat (when the workload stabilized) %Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint 66.6.0-rc3 6.44 0.01 0.00 33.55 94.14 147.09 66.45 1.33 313345175 621 6+patch 6.44 0.01 0.00 33.55 94.15 147.11 66.45 1.33 109193889 309 System Configuration type=Shared mode=Uncapped smt=8 lcpu=40 mem=1067539392 kB cpus=96 ent=40.00 So *40 Entitled cores/ 40 Virtual processor* Scenario lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 320 On-line CPU(s) list: 0-319 Model name: POWER10 (architected), altivec supported Model: 2.0 (pvr 0080 0200) Thread(s) per core: 8 Core(s) per socket: 10 Socket(s): 4 Hypervisor vendor: pHyp Virtualization type: para L1d cache: 2.5 MiB (80 instances) L1i cache: 3.8 MiB (80 instances) NUMA node(s): 4 NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,128-135,160-167,192-199,224-231,256-263,288-295 NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303 NUMA node4 CPU(s): 16-23,48-55,80-87,112-119,144-151,176-183,208-215,240-247,272-279,304-311 NUMA node5 CPU(s): 24-31,56-63,88-95,120-127,152-159,184-191,216-223,248-255,280-287,312-319 ebizzy -t 32 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 3535518 3864532 3745967 3704233.2 130216.76 +patch 5 3608385 3708026 3649379 3651596.6 37862.163 -1.42099 %Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 10.00 0.01 0.00 89.99 22.98 57.45 10.01 41.01 1135139 262 +patch 10.00 0.00 0.00 90.00 16.95 42.37 10.00 47.05 925561 19 ebizzy -t 64 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 4434984 4957281 4548786 4591298.2 211770.2 +patch 5 4461115 4835167 4544716 4607795.8 151474.85 0.359323 %Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 20.01 0.00 0.00 79.99 38.22 95.55 20.01 25.77 1287553 265 +patch 19.99 0.00 0.00 80.01 25.55 63.88 19.99 38.44 1077341 20 ebizzy -t 256 -S 200 (5 iterations) Records per second. (Higher is better) Kernel N Min Max Median Avg Stddev %Change 6.6.0-rc3 5 8850648 8982659 8951911 8936869.2 52278.031 +patch 5 8751038 9060510 8981409 8942268.4 117070.6 0.0604149 %Kernel user %sys %wait %idle physc %entc lbusy app vcsw phint 6.6.0-rc3 80.02 0.01 0.01 19.96 40.00 100.00 80.03 24.00 1597665 276 +patch 80.02 0.01 0.01 19.96 40.00 100.00 80.03 23.99 1383921 63 Observation: We are able to see Improvement in ebizzy throughput even with lesser core utilization (almost half the core utilization) in low utilization scenarios while still retaining throughput in mid and higher utilization scenarios. Note: The numbers are with Uncapped + no-noise case. In the Capped and/or noise case, due to contention on the Cores, the numbers are expected to further improve. Note: The numbers included (sched/fair: Enable group_asym_packing in find_idlest_group) https://lore.kernel.org/all/20231018155036.2314342-1-srikar@linux.vnet.ibm.com/
2 parents 6f4b705 + c469757 commit 8fc63a9

1 file changed

Lines changed: 70 additions & 54 deletions

File tree

arch/powerpc/kernel/smp.c

Lines changed: 70 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -77,10 +77,10 @@ static DEFINE_PER_CPU(int, cpu_state) = { 0 };
7777
#endif
7878

7979
struct task_struct *secondary_current;
80-
bool has_big_cores;
81-
bool coregroup_enabled;
82-
bool thread_group_shares_l2;
83-
bool thread_group_shares_l3;
80+
bool has_big_cores __ro_after_init;
81+
bool coregroup_enabled __ro_after_init;
82+
bool thread_group_shares_l2 __ro_after_init;
83+
bool thread_group_shares_l3 __ro_after_init;
8484

8585
DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
8686
DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
@@ -93,15 +93,6 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
9393
EXPORT_PER_CPU_SYMBOL(cpu_core_map);
9494
EXPORT_SYMBOL_GPL(has_big_cores);
9595

96-
enum {
97-
#ifdef CONFIG_SCHED_SMT
98-
smt_idx,
99-
#endif
100-
cache_idx,
101-
mc_idx,
102-
die_idx,
103-
};
104-
10596
#define MAX_THREAD_LIST_SIZE 8
10697
#define THREAD_GROUP_SHARE_L1 1
10798
#define THREAD_GROUP_SHARE_L2_L3 2
@@ -987,7 +978,7 @@ static int __init init_thread_group_cache_map(int cpu, int cache_property)
987978
return 0;
988979
}
989980

990-
static bool shared_caches;
981+
static bool shared_caches __ro_after_init;
991982

992983
#ifdef CONFIG_SCHED_SMT
993984
/* cpumask of CPUs with asymmetric SMT dependency */
@@ -1003,6 +994,13 @@ static int powerpc_smt_flags(void)
1003994
}
1004995
#endif
1005996

997+
/*
998+
* On shared processor LPARs scheduled on a big core (which has two or more
999+
* independent thread groups per core), prefer lower numbered CPUs, so
1000+
* that workload consolidates to lesser number of cores.
1001+
*/
1002+
static __ro_after_init DEFINE_STATIC_KEY_FALSE(splpar_asym_pack);
1003+
10061004
/*
10071005
* P9 has a slightly odd architecture where pairs of cores share an L2 cache.
10081006
* This topology makes it *much* cheaper to migrate tasks between adjacent cores
@@ -1011,9 +1009,20 @@ static int powerpc_smt_flags(void)
10111009
*/
10121010
static int powerpc_shared_cache_flags(void)
10131011
{
1012+
if (static_branch_unlikely(&splpar_asym_pack))
1013+
return SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING;
1014+
10141015
return SD_SHARE_PKG_RESOURCES;
10151016
}
10161017

1018+
static int powerpc_shared_proc_flags(void)
1019+
{
1020+
if (static_branch_unlikely(&splpar_asym_pack))
1021+
return SD_ASYM_PACKING;
1022+
1023+
return 0;
1024+
}
1025+
10171026
/*
10181027
* We can't just pass cpu_l2_cache_mask() directly because
10191028
* returns a non-const pointer and the compiler barfs on that.
@@ -1037,6 +1046,10 @@ static struct cpumask *cpu_coregroup_mask(int cpu)
10371046

10381047
static bool has_coregroup_support(void)
10391048
{
1049+
/* Coregroup identification not available on shared systems */
1050+
if (is_shared_processor())
1051+
return 0;
1052+
10401053
return coregroup_enabled;
10411054
}
10421055

@@ -1045,16 +1058,6 @@ static const struct cpumask *cpu_mc_mask(int cpu)
10451058
return cpu_coregroup_mask(cpu);
10461059
}
10471060

1048-
static struct sched_domain_topology_level powerpc_topology[] = {
1049-
#ifdef CONFIG_SCHED_SMT
1050-
{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
1051-
#endif
1052-
{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
1053-
{ cpu_mc_mask, SD_INIT_NAME(MC) },
1054-
{ cpu_cpu_mask, SD_INIT_NAME(PKG) },
1055-
{ NULL, },
1056-
};
1057-
10581061
static int __init init_big_cores(void)
10591062
{
10601063
int cpu;
@@ -1682,43 +1685,45 @@ void start_secondary(void *unused)
16821685
BUG();
16831686
}
16841687

1685-
static void __init fixup_topology(void)
1688+
static struct sched_domain_topology_level powerpc_topology[6];
1689+
1690+
static void __init build_sched_topology(void)
16861691
{
1687-
int i;
1692+
int i = 0;
1693+
1694+
if (is_shared_processor() && has_big_cores)
1695+
static_branch_enable(&splpar_asym_pack);
16881696

16891697
#ifdef CONFIG_SCHED_SMT
16901698
if (has_big_cores) {
16911699
pr_info("Big cores detected but using small core scheduling\n");
1692-
powerpc_topology[smt_idx].mask = smallcore_smt_mask;
1700+
powerpc_topology[i++] = (struct sched_domain_topology_level){
1701+
smallcore_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT)
1702+
};
1703+
} else {
1704+
powerpc_topology[i++] = (struct sched_domain_topology_level){
1705+
cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT)
1706+
};
16931707
}
16941708
#endif
1709+
if (shared_caches) {
1710+
powerpc_topology[i++] = (struct sched_domain_topology_level){
1711+
shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE)
1712+
};
1713+
}
1714+
if (has_coregroup_support()) {
1715+
powerpc_topology[i++] = (struct sched_domain_topology_level){
1716+
cpu_mc_mask, powerpc_shared_proc_flags, SD_INIT_NAME(MC)
1717+
};
1718+
}
1719+
powerpc_topology[i++] = (struct sched_domain_topology_level){
1720+
cpu_cpu_mask, powerpc_shared_proc_flags, SD_INIT_NAME(PKG)
1721+
};
16951722

1696-
if (!has_coregroup_support())
1697-
powerpc_topology[mc_idx].mask = powerpc_topology[cache_idx].mask;
1698-
1699-
/*
1700-
* Try to consolidate topology levels here instead of
1701-
* allowing scheduler to degenerate.
1702-
* - Dont consolidate if masks are different.
1703-
* - Dont consolidate if sd_flags exists and are different.
1704-
*/
1705-
for (i = 1; i <= die_idx; i++) {
1706-
if (powerpc_topology[i].mask != powerpc_topology[i - 1].mask)
1707-
continue;
1708-
1709-
if (powerpc_topology[i].sd_flags && powerpc_topology[i - 1].sd_flags &&
1710-
powerpc_topology[i].sd_flags != powerpc_topology[i - 1].sd_flags)
1711-
continue;
1712-
1713-
if (!powerpc_topology[i - 1].sd_flags)
1714-
powerpc_topology[i - 1].sd_flags = powerpc_topology[i].sd_flags;
1723+
/* There must be one trailing NULL entry left. */
1724+
BUG_ON(i >= ARRAY_SIZE(powerpc_topology) - 1);
17151725

1716-
powerpc_topology[i].mask = powerpc_topology[i + 1].mask;
1717-
powerpc_topology[i].sd_flags = powerpc_topology[i + 1].sd_flags;
1718-
#ifdef CONFIG_SCHED_DEBUG
1719-
powerpc_topology[i].name = powerpc_topology[i + 1].name;
1720-
#endif
1721-
}
1726+
set_sched_topology(powerpc_topology);
17221727
}
17231728

17241729
void __init smp_cpus_done(unsigned int max_cpus)
@@ -1733,9 +1738,20 @@ void __init smp_cpus_done(unsigned int max_cpus)
17331738
smp_ops->bringup_done();
17341739

17351740
dump_numa_cpu_topology();
1741+
build_sched_topology();
1742+
}
17361743

1737-
fixup_topology();
1738-
set_sched_topology(powerpc_topology);
1744+
/*
1745+
* For asym packing, by default lower numbered CPU has higher priority.
1746+
* On shared processors, pack to lower numbered core. However avoid moving
1747+
* between thread_groups within the same core.
1748+
*/
1749+
int arch_asym_cpu_priority(int cpu)
1750+
{
1751+
if (static_branch_unlikely(&splpar_asym_pack))
1752+
return -cpu / threads_per_core;
1753+
1754+
return -cpu;
17391755
}
17401756

17411757
#ifdef CONFIG_HOTPLUG_CPU

0 commit comments

Comments
 (0)