Skip to content

Commit 3b66e6b

Browse files
committed
Merge tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: - cgroup rstat shared the tracking tree across all controllers with the rationale being that a cgroup which is using one resource is likely to be using other resources at the same time (ie. if something is allocating memory, it's probably consuming CPU cycles). However, this turned out to not scale very well especially with memcg using rstat for internal operations which made memcg stat read and flush patterns substantially different from other controllers. JP Kobryn split the rstat tree per controller. - cgroup BPF support was hooking into cgroup init/exit paths directly. Convert them to use a notifier chain instead so that other usages can be added easily. The two of the patches which implement this are mislabeled as belonging to sched_ext instead of cgroup. Sorry. - Relatively minor cpuset updates - Documentation updates * tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (23 commits) sched_ext: Convert cgroup BPF support to use cgroup_lifetime_notifier sched_ext: Introduce cgroup_lifetime_notifier cgroup: Minor reorganization of cgroup_create() cgroup, docs: cpu controller's interaction with various scheduling policies cgroup, docs: convert space indentation to tab indentation cgroup: avoid per-cpu allocation of size zero rstat cpu locks cgroup, docs: be specific about bandwidth control of rt processes cgroup: document the rstat per-cpu initialization cgroup: helper for checking rstat participation of css cgroup: use subsystem-specific rstat locks to avoid contention cgroup: use separate rstat trees for each subsystem cgroup: compare css to cgroup::self in helper for distingushing css cgroup: warn on rstat usage by early init subsystems cgroup/cpuset: drop useless cpumask_empty() in compute_effective_exclusive_cpumask() cgroup/rstat: Improve cgroup_rstat_push_children() documentation cgroup: fix goto ordering in cgroup_init() cgroup: fix pointer check in css_rstat_init() cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs cgroup/cpuset: Fix obsolete comment in cpuset_css_offline() cgroup/cpuset: Always use cpu_active_mask ...
2 parents 91ad250 + 82648b8 commit 3b66e6b

14 files changed

Lines changed: 665 additions & 342 deletions

File tree

Documentation/admin-guide/cgroup-v2.rst

Lines changed: 57 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1076,7 +1076,7 @@ cpufreq governor about the minimum desired frequency which should always be
10761076
provided by a CPU, as well as the maximum desired frequency, which should not
10771077
be exceeded by a CPU.
10781078

1079-
WARNING: cgroup2 cpu controller doesn't yet fully support the control of
1079+
WARNING: cgroup2 cpu controller doesn't yet support the (bandwidth) control of
10801080
realtime processes. For a kernel built with the CONFIG_RT_GROUP_SCHED option
10811081
enabled for group scheduling of realtime processes, the cpu controller can only
10821082
be enabled when all RT processes are in the root cgroup. Be aware that system
@@ -1095,19 +1095,34 @@ realtime processes irrespective of CONFIG_RT_GROUP_SCHED.
10951095
CPU Interface Files
10961096
~~~~~~~~~~~~~~~~~~~
10971097

1098-
All time durations are in microseconds.
1098+
The interaction of a process with the cpu controller depends on its scheduling
1099+
policy and the underlying scheduler. From the point of view of the cpu controller,
1100+
processes can be categorized as follows:
1101+
1102+
* Processes under the fair-class scheduler
1103+
* Processes under a BPF scheduler with the ``cgroup_set_weight`` callback
1104+
* Everything else: ``SCHED_{FIFO,RR,DEADLINE}`` and processes under a BPF scheduler
1105+
without the ``cgroup_set_weight`` callback
1106+
1107+
For details on when a process is under the fair-class scheduler or a BPF scheduler,
1108+
check out :ref:`Documentation/scheduler/sched-ext.rst <sched-ext>`.
1109+
1110+
For each of the following interface files, the above categories
1111+
will be referred to. All time durations are in microseconds.
10991112

11001113
cpu.stat
11011114
A read-only flat-keyed file.
11021115
This file exists whether the controller is enabled or not.
11031116

1104-
It always reports the following three stats:
1117+
It always reports the following three stats, which account for all the
1118+
processes in the cgroup:
11051119

11061120
- usage_usec
11071121
- user_usec
11081122
- system_usec
11091123

1110-
and the following five when the controller is enabled:
1124+
and the following five when the controller is enabled, which account for
1125+
only the processes under the fair-class scheduler:
11111126

11121127
- nr_periods
11131128
- nr_throttled
@@ -1125,6 +1140,10 @@ All time durations are in microseconds.
11251140
If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1),
11261141
then the weight will show as a 0.
11271142

1143+
This file affects only processes under the fair-class scheduler and a BPF
1144+
scheduler with the ``cgroup_set_weight`` callback depending on what the
1145+
callback actually does.
1146+
11281147
cpu.weight.nice
11291148
A read-write single value file which exists on non-root
11301149
cgroups. The default is "0".
@@ -1137,6 +1156,10 @@ All time durations are in microseconds.
11371156
granularity is coarser for the nice values, the read value is
11381157
the closest approximation of the current weight.
11391158

1159+
This file affects only processes under the fair-class scheduler and a BPF
1160+
scheduler with the ``cgroup_set_weight`` callback depending on what the
1161+
callback actually does.
1162+
11401163
cpu.max
11411164
A read-write two value file which exists on non-root cgroups.
11421165
The default is "max 100000".
@@ -1149,43 +1172,55 @@ All time durations are in microseconds.
11491172
$PERIOD duration. "max" for $MAX indicates no limit. If only
11501173
one number is written, $MAX is updated.
11511174

1175+
This file affects only processes under the fair-class scheduler.
1176+
11521177
cpu.max.burst
11531178
A read-write single value file which exists on non-root
11541179
cgroups. The default is "0".
11551180

11561181
The burst in the range [0, $MAX].
11571182

1183+
This file affects only processes under the fair-class scheduler.
1184+
11581185
cpu.pressure
11591186
A read-write nested-keyed file.
11601187

11611188
Shows pressure stall information for CPU. See
11621189
:ref:`Documentation/accounting/psi.rst <psi>` for details.
11631190

1191+
This file accounts for all the processes in the cgroup.
1192+
11641193
cpu.uclamp.min
1165-
A read-write single value file which exists on non-root cgroups.
1166-
The default is "0", i.e. no utilization boosting.
1194+
A read-write single value file which exists on non-root cgroups.
1195+
The default is "0", i.e. no utilization boosting.
1196+
1197+
The requested minimum utilization (protection) as a percentage
1198+
rational number, e.g. 12.34 for 12.34%.
11671199

1168-
The requested minimum utilization (protection) as a percentage
1169-
rational number, e.g. 12.34 for 12.34%.
1200+
This interface allows reading and setting minimum utilization clamp
1201+
values similar to the sched_setattr(2). This minimum utilization
1202+
value is used to clamp the task specific minimum utilization clamp,
1203+
including those of realtime processes.
11701204

1171-
This interface allows reading and setting minimum utilization clamp
1172-
values similar to the sched_setattr(2). This minimum utilization
1173-
value is used to clamp the task specific minimum utilization clamp.
1205+
The requested minimum utilization (protection) is always capped by
1206+
the current value for the maximum utilization (limit), i.e.
1207+
`cpu.uclamp.max`.
11741208

1175-
The requested minimum utilization (protection) is always capped by
1176-
the current value for the maximum utilization (limit), i.e.
1177-
`cpu.uclamp.max`.
1209+
This file affects all the processes in the cgroup.
11781210

11791211
cpu.uclamp.max
1180-
A read-write single value file which exists on non-root cgroups.
1181-
The default is "max". i.e. no utilization capping
1212+
A read-write single value file which exists on non-root cgroups.
1213+
The default is "max". i.e. no utilization capping
1214+
1215+
The requested maximum utilization (limit) as a percentage rational
1216+
number, e.g. 98.76 for 98.76%.
11821217

1183-
The requested maximum utilization (limit) as a percentage rational
1184-
number, e.g. 98.76 for 98.76%.
1218+
This interface allows reading and setting maximum utilization clamp
1219+
values similar to the sched_setattr(2). This maximum utilization
1220+
value is used to clamp the task specific maximum utilization clamp,
1221+
including those of realtime processes.
11851222

1186-
This interface allows reading and setting maximum utilization clamp
1187-
values similar to the sched_setattr(2). This maximum utilization
1188-
value is used to clamp the task specific maximum utilization clamp.
1223+
This file affects all the processes in the cgroup.
11891224

11901225
cpu.idle
11911226
A read-write single value file which exists on non-root cgroups.
@@ -1197,7 +1232,7 @@ All time durations are in microseconds.
11971232
own relative priorities, but the cgroup itself will be treated as
11981233
very low priority relative to its peers.
11991234

1200-
1235+
This file affects only processes under the fair-class scheduler.
12011236

12021237
Memory
12031238
------

block/blk-cgroup.c

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1074,8 +1074,8 @@ static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu)
10741074
/*
10751075
* For covering concurrent parent blkg update from blkg_release().
10761076
*
1077-
* When flushing from cgroup, cgroup_rstat_lock is always held, so
1078-
* this lock won't cause contention most of time.
1077+
* When flushing from cgroup, the subsystem rstat lock is always held,
1078+
* so this lock won't cause contention most of time.
10791079
*/
10801080
raw_spin_lock_irqsave(&blkg_stat_lock, flags);
10811081

@@ -1144,7 +1144,7 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
11441144
/*
11451145
* We source root cgroup stats from the system-wide stats to avoid
11461146
* tracking the same information twice and incurring overhead when no
1147-
* cgroups are defined. For that reason, cgroup_rstat_flush in
1147+
* cgroups are defined. For that reason, css_rstat_flush in
11481148
* blkcg_print_stat does not actually fill out the iostat in the root
11491149
* cgroup's blkcg_gq.
11501150
*
@@ -1253,7 +1253,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
12531253
if (!seq_css(sf)->parent)
12541254
blkcg_fill_root_iostats();
12551255
else
1256-
cgroup_rstat_flush(blkcg->css.cgroup);
1256+
css_rstat_flush(&blkcg->css);
12571257

12581258
rcu_read_lock();
12591259
hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
@@ -2243,7 +2243,7 @@ void blk_cgroup_bio_start(struct bio *bio)
22432243
}
22442244

22452245
u64_stats_update_end_irqrestore(&bis->sync, flags);
2246-
cgroup_rstat_updated(blkcg->css.cgroup, cpu);
2246+
css_rstat_updated(&blkcg->css, cpu);
22472247
put_cpu();
22482248
}
22492249

include/linux/bpf-cgroup.h

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -114,8 +114,7 @@ struct bpf_prog_list {
114114
u32 flags;
115115
};
116116

117-
int cgroup_bpf_inherit(struct cgroup *cgrp);
118-
void cgroup_bpf_offline(struct cgroup *cgrp);
117+
void __init cgroup_bpf_lifetime_notifier_init(void);
119118

120119
int __cgroup_bpf_run_filter_skb(struct sock *sk,
121120
struct sk_buff *skb,
@@ -431,8 +430,10 @@ const struct bpf_func_proto *
431430
cgroup_current_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog);
432431
#else
433432

434-
static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
435-
static inline void cgroup_bpf_offline(struct cgroup *cgrp) {}
433+
static inline void cgroup_bpf_lifetime_notifier_init(void)
434+
{
435+
return;
436+
}
436437

437438
static inline int cgroup_bpf_prog_attach(const union bpf_attr *attr,
438439
enum bpf_prog_type ptype,

include/linux/cgroup-defs.h

Lines changed: 66 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,23 @@ struct cgroup_subsys_state {
169169
/* reference count - access via css_[try]get() and css_put() */
170170
struct percpu_ref refcnt;
171171

172+
/*
173+
* Depending on the context, this field is initialized
174+
* via css_rstat_init() at different places:
175+
*
176+
* when css is associated with cgroup::self
177+
* when css->cgroup is the root cgroup
178+
* performed in cgroup_init()
179+
* when css->cgroup is not the root cgroup
180+
* performed in cgroup_create()
181+
* when css is associated with a subsystem
182+
* when css->cgroup is the root cgroup
183+
* performed in cgroup_init_subsys() in the non-early path
184+
* when css->cgroup is not the root cgroup
185+
* performed in css_create()
186+
*/
187+
struct css_rstat_cpu __percpu *rstat_cpu;
188+
172189
/*
173190
* siblings list anchored at the parent's ->children
174191
*
@@ -177,9 +194,6 @@ struct cgroup_subsys_state {
177194
struct list_head sibling;
178195
struct list_head children;
179196

180-
/* flush target list anchored at cgrp->rstat_css_list */
181-
struct list_head rstat_css_node;
182-
183197
/*
184198
* PI: Subsys-unique ID. 0 is unused and root is always 1. The
185199
* matching css can be looked up using css_from_id().
@@ -219,6 +233,16 @@ struct cgroup_subsys_state {
219233
* Protected by cgroup_mutex.
220234
*/
221235
int nr_descendants;
236+
237+
/*
238+
* A singly-linked list of css structures to be rstat flushed.
239+
* This is a scratch field to be used exclusively by
240+
* css_rstat_flush().
241+
*
242+
* Protected by rstat_base_lock when css is cgroup::self.
243+
* Protected by css->ss->rstat_ss_lock otherwise.
244+
*/
245+
struct cgroup_subsys_state *rstat_flush_next;
222246
};
223247

224248
/*
@@ -329,10 +353,10 @@ struct cgroup_base_stat {
329353

330354
/*
331355
* rstat - cgroup scalable recursive statistics. Accounting is done
332-
* per-cpu in cgroup_rstat_cpu which is then lazily propagated up the
356+
* per-cpu in css_rstat_cpu which is then lazily propagated up the
333357
* hierarchy on reads.
334358
*
335-
* When a stat gets updated, the cgroup_rstat_cpu and its ancestors are
359+
* When a stat gets updated, the css_rstat_cpu and its ancestors are
336360
* linked into the updated tree. On the following read, propagation only
337361
* considers and consumes the updated tree. This makes reading O(the
338362
* number of descendants which have been active since last read) instead of
@@ -344,10 +368,29 @@ struct cgroup_base_stat {
344368
* frequency decreases the cost of each read.
345369
*
346370
* This struct hosts both the fields which implement the above -
347-
* updated_children and updated_next - and the fields which track basic
348-
* resource statistics on top of it - bsync, bstat and last_bstat.
371+
* updated_children and updated_next.
349372
*/
350-
struct cgroup_rstat_cpu {
373+
struct css_rstat_cpu {
374+
/*
375+
* Child cgroups with stat updates on this cpu since the last read
376+
* are linked on the parent's ->updated_children through
377+
* ->updated_next. updated_children is terminated by its container css.
378+
*
379+
* In addition to being more compact, singly-linked list pointing to
380+
* the css makes it unnecessary for each per-cpu struct to point back
381+
* to the associated css.
382+
*
383+
* Protected by per-cpu css->ss->rstat_ss_cpu_lock.
384+
*/
385+
struct cgroup_subsys_state *updated_children;
386+
struct cgroup_subsys_state *updated_next; /* NULL if not on the list */
387+
};
388+
389+
/*
390+
* This struct hosts the fields which track basic resource statistics on
391+
* top of it - bsync, bstat and last_bstat.
392+
*/
393+
struct cgroup_rstat_base_cpu {
351394
/*
352395
* ->bsync protects ->bstat. These are the only fields which get
353396
* updated in the hot path.
@@ -374,20 +417,6 @@ struct cgroup_rstat_cpu {
374417
* deltas to propagate to the per-cpu subtree_bstat.
375418
*/
376419
struct cgroup_base_stat last_subtree_bstat;
377-
378-
/*
379-
* Child cgroups with stat updates on this cpu since the last read
380-
* are linked on the parent's ->updated_children through
381-
* ->updated_next.
382-
*
383-
* In addition to being more compact, singly-linked list pointing
384-
* to the cgroup makes it unnecessary for each per-cpu struct to
385-
* point back to the associated cgroup.
386-
*
387-
* Protected by per-cpu cgroup_rstat_cpu_lock.
388-
*/
389-
struct cgroup *updated_children; /* terminated by self cgroup */
390-
struct cgroup *updated_next; /* NULL iff not on the list */
391420
};
392421

393422
struct cgroup_freezer_state {
@@ -516,23 +545,23 @@ struct cgroup {
516545
struct cgroup *dom_cgrp;
517546
struct cgroup *old_dom_cgrp; /* used while enabling threaded */
518547

519-
/* per-cpu recursive resource statistics */
520-
struct cgroup_rstat_cpu __percpu *rstat_cpu;
521-
struct list_head rstat_css_list;
522-
523548
/*
524-
* Add padding to separate the read mostly rstat_cpu and
525-
* rstat_css_list into a different cacheline from the following
526-
* rstat_flush_next and *bstat fields which can have frequent updates.
549+
* Depending on the context, this field is initialized via
550+
* css_rstat_init() at different places:
551+
*
552+
* when cgroup is the root cgroup
553+
* performed in cgroup_setup_root()
554+
* otherwise
555+
* performed in cgroup_create()
527556
*/
528-
CACHELINE_PADDING(_pad_);
557+
struct cgroup_rstat_base_cpu __percpu *rstat_base_cpu;
529558

530559
/*
531-
* A singly-linked list of cgroup structures to be rstat flushed.
532-
* This is a scratch field to be used exclusively by
533-
* cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
560+
* Add padding to keep the read mostly rstat per-cpu pointer on a
561+
* different cacheline than the following *bstat fields which can have
562+
* frequent updates.
534563
*/
535-
struct cgroup *rstat_flush_next;
564+
CACHELINE_PADDING(_pad_);
536565

537566
/* cgroup basic resource statistics */
538567
struct cgroup_base_stat last_bstat;
@@ -790,6 +819,9 @@ struct cgroup_subsys {
790819
* specifies the mask of subsystems that this one depends on.
791820
*/
792821
unsigned int depends_on;
822+
823+
spinlock_t rstat_ss_lock;
824+
raw_spinlock_t __percpu *rstat_ss_cpu_lock;
793825
};
794826

795827
extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;

0 commit comments

Comments
 (0)