Skip to content

Commit d16738a

Browse files
committed
Merge tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull kthread updates from Frederic Weisbecker: "The kthread code provides an infrastructure which manages the preferred affinity of unbound kthreads (node or custom cpumask) against housekeeping (CPU isolation) constraints and CPU hotplug events. One crucial missing piece is the handling of cpuset: when an isolated partition is created, deleted, or its CPUs updated, all the unbound kthreads in the top cpuset become indifferently affine to _all_ the non-isolated CPUs, possibly breaking their preferred affinity along the way. Solve this with performing the kthreads affinity update from cpuset to the kthreads consolidated relevant code instead so that preferred affinities are honoured and applied against the updated cpuset isolated partitions. The dispatch of the new isolated cpumasks to timers, workqueues and kthreads is performed by housekeeping, as per the nice Tejun's suggestion. As a welcome side effect, HK_TYPE_DOMAIN then integrates both the set from boot defined domain isolation (through isolcpus=) and cpuset isolated partitions. Housekeeping cpumasks are now modifiable with a specific RCU based synchronization. A big step toward making nohz_full= also mutable through cpuset in the future" * tag 'kthread-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (33 commits) doc: Add housekeeping documentation kthread: Document kthread_affine_preferred() kthread: Comment on the purpose and placement of kthread_affine_node() call kthread: Honour kthreads preferred affinity after cpuset changes sched/arm64: Move fallback task cpumask to HK_TYPE_DOMAIN sched: Switch the fallback task allowed cpumask to HK_TYPE_DOMAIN kthread: Rely on HK_TYPE_DOMAIN for preferred affinity management kthread: Include kthreadd to the managed affinity list kthread: Include unbound kthreads in the managed affinity list kthread: Refine naming of affinity related fields PCI: Remove superfluous HK_TYPE_WQ check sched/isolation: Remove HK_TYPE_TICK test from cpu_is_isolated() cpuset: Remove cpuset_cpu_is_isolated() timers/migration: Remove superfluous cpuset isolation test cpuset: Propagate cpuset isolation update to timers through housekeeping cpuset: Propagate cpuset isolation update to workqueue through housekeeping PCI: Flush PCI probe workqueue on cpuset isolated partition change sched/isolation: Flush vmstat workqueues on cpuset isolated partition change sched/isolation: Flush memcg workqueues on cpuset isolated partition change cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset ...
2 parents 0506158 + fa39ec4 commit d16738a

28 files changed

Lines changed: 538 additions & 206 deletions

File tree

Documentation/arch/arm64/asymmetric-32bit.rst

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -154,10 +154,14 @@ mode will return to host userspace with an ``exit_reason`` of
154154
``KVM_EXIT_FAIL_ENTRY`` and will remain non-runnable until successfully
155155
re-initialised by a subsequent ``KVM_ARM_VCPU_INIT`` operation.
156156

157-
NOHZ FULL
158-
---------
157+
SCHEDULER DOMAIN ISOLATION
158+
--------------------------
159159

160-
To avoid perturbing an adaptive-ticks CPU (specified using
161-
``nohz_full=``) when a 32-bit task is forcefully migrated, these CPUs
160+
To avoid perturbing a boot-defined domain isolated CPU (specified using
161+
``isolcpus=[domain]``) when a 32-bit task is forcefully migrated, these CPUs
162162
are treated as 64-bit-only when support for asymmetric 32-bit systems
163163
is enabled.
164+
165+
However as opposed to boot-defined domain isolation, runtime-defined domain
166+
isolation using cpuset isolated partition is not advised on asymmetric
167+
32-bit systems and will result in undefined behaviour.
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
======================================
2+
Housekeeping
3+
======================================
4+
5+
6+
CPU Isolation moves away kernel work that may otherwise run on any CPU.
7+
The purpose of its related features is to reduce the OS jitter that some
8+
extreme workloads can't stand, such as in some DPDK usecases.
9+
10+
The kernel work moved away by CPU isolation is commonly described as
11+
"housekeeping" because it includes ground work that performs cleanups,
12+
statistics maintainance and actions relying on them, memory release,
13+
various deferrals etc...
14+
15+
Sometimes housekeeping is just some unbound work (unbound workqueues,
16+
unbound timers, ...) that gets easily assigned to non-isolated CPUs.
17+
But sometimes housekeeping is tied to a specific CPU and requires
18+
elaborated tricks to be offloaded to non-isolated CPUs (RCU_NOCB, remote
19+
scheduler tick, etc...).
20+
21+
Thus, a housekeeping CPU can be considered as the reverse of an isolated
22+
CPU. It is simply a CPU that can execute housekeeping work. There must
23+
always be at least one online housekeeping CPU at any time. The CPUs that
24+
are not isolated are automatically assigned as housekeeping.
25+
26+
Housekeeping is currently divided in four features described
27+
by the ``enum hk_type type``:
28+
29+
1. HK_TYPE_DOMAIN matches the work moved away by scheduler domain
30+
isolation performed through ``isolcpus=domain`` boot parameter or
31+
isolated cpuset partitions in cgroup v2. This includes scheduler
32+
load balancing, unbound workqueues and timers.
33+
34+
2. HK_TYPE_KERNEL_NOISE matches the work moved away by tick isolation
35+
performed through ``nohz_full=`` or ``isolcpus=nohz`` boot
36+
parameters. This includes remote scheduler tick, vmstat and lockup
37+
watchdog.
38+
39+
3. HK_TYPE_MANAGED_IRQ matches the IRQ handlers moved away by managed
40+
IRQ isolation performed through ``isolcpus=managed_irq``.
41+
42+
4. HK_TYPE_DOMAIN_BOOT matches the work moved away by scheduler domain
43+
isolation performed through ``isolcpus=domain`` only. It is similar
44+
to HK_TYPE_DOMAIN except it ignores the isolation performed by
45+
cpusets.
46+
47+
48+
Housekeeping cpumasks
49+
=================================
50+
51+
Housekeeping cpumasks include the CPUs that can execute the work moved
52+
away by the matching isolation feature. These cpumasks are returned by
53+
the following function::
54+
55+
const struct cpumask *housekeeping_cpumask(enum hk_type type)
56+
57+
By default, if neither ``nohz_full=``, nor ``isolcpus``, nor cpuset's
58+
isolated partitions are used, which covers most usecases, this function
59+
returns the cpu_possible_mask.
60+
61+
Otherwise the function returns the cpumask complement of the isolation
62+
feature. For example:
63+
64+
With isolcpus=domain,7 the following will return a mask with all possible
65+
CPUs except 7::
66+
67+
housekeeping_cpumask(HK_TYPE_DOMAIN)
68+
69+
Similarly with nohz_full=5,6 the following will return a mask with all
70+
possible CPUs except 5,6::
71+
72+
housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)
73+
74+
75+
Synchronization against cpusets
76+
=================================
77+
78+
Cpuset can modify the HK_TYPE_DOMAIN housekeeping cpumask while creating,
79+
modifying or deleting an isolated partition.
80+
81+
The users of HK_TYPE_DOMAIN cpumask must then make sure to synchronize
82+
properly against cpuset in order to make sure that:
83+
84+
1. The cpumask snapshot stays coherent.
85+
86+
2. No housekeeping work is queued on a newly made isolated CPU.
87+
88+
3. Pending housekeeping work that was queued to a non isolated
89+
CPU which just turned isolated through cpuset must be flushed
90+
before the related created/modified isolated partition is made
91+
available to userspace.
92+
93+
This synchronization is maintained by an RCU based scheme. The cpuset update
94+
side waits for an RCU grace period after updating the HK_TYPE_DOMAIN
95+
cpumask and before flushing pending works. On the read side, care must be
96+
taken to gather the housekeeping target election and the work enqueue within
97+
the same RCU read side critical section.
98+
99+
A typical layout example would look like this on the update side
100+
(``housekeeping_update()``)::
101+
102+
rcu_assign_pointer(housekeeping_cpumasks[type], trial);
103+
synchronize_rcu();
104+
flush_workqueue(example_workqueue);
105+
106+
And then on the read side::
107+
108+
rcu_read_lock();
109+
cpu = housekeeping_any_cpu(HK_TYPE_DOMAIN);
110+
queue_work_on(cpu, example_workqueue, work);
111+
rcu_read_unlock();

Documentation/core-api/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ it.
2525
symbol-namespaces
2626
asm-annotations
2727
real-time/index
28+
housekeeping.rst
2829

2930
Data structures and low-level utilities
3031
=======================================

arch/arm64/kernel/cpufeature.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1669,7 +1669,7 @@ const struct cpumask *system_32bit_el0_cpumask(void)
16691669

16701670
const struct cpumask *task_cpu_fallback_mask(struct task_struct *p)
16711671
{
1672-
return __task_cpu_possible_mask(p, housekeeping_cpumask(HK_TYPE_TICK));
1672+
return __task_cpu_possible_mask(p, housekeeping_cpumask(HK_TYPE_DOMAIN));
16731673
}
16741674

16751675
static int __init parse_32bit_el0_param(char *str)
@@ -3987,8 +3987,8 @@ static int enable_mismatched_32bit_el0(unsigned int cpu)
39873987
bool cpu_32bit = false;
39883988

39893989
if (id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0)) {
3990-
if (!housekeeping_cpu(cpu, HK_TYPE_TICK))
3991-
pr_info("Treating adaptive-ticks CPU %u as 64-bit only\n", cpu);
3990+
if (!housekeeping_cpu(cpu, HK_TYPE_DOMAIN))
3991+
pr_info("Treating domain isolated CPU %u as 64-bit only\n", cpu);
39923992
else
39933993
cpu_32bit = true;
39943994
}

block/blk-mq.c

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4270,12 +4270,16 @@ static void blk_mq_map_swqueue(struct request_queue *q)
42704270

42714271
/*
42724272
* Rule out isolated CPUs from hctx->cpumask to avoid
4273-
* running block kworker on isolated CPUs
4273+
* running block kworker on isolated CPUs.
4274+
* FIXME: cpuset should propagate further changes to isolated CPUs
4275+
* here.
42744276
*/
4277+
rcu_read_lock();
42754278
for_each_cpu(cpu, hctx->cpumask) {
42764279
if (cpu_is_isolated(cpu))
42774280
cpumask_clear_cpu(cpu, hctx->cpumask);
42784281
}
4282+
rcu_read_unlock();
42794283

42804284
/*
42814285
* Initialize batch roundrobin counts

drivers/base/cpu.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -291,7 +291,7 @@ static ssize_t print_cpus_isolated(struct device *dev,
291291
return -ENOMEM;
292292

293293
cpumask_andnot(isolated, cpu_possible_mask,
294-
housekeeping_cpumask(HK_TYPE_DOMAIN));
294+
housekeeping_cpumask(HK_TYPE_DOMAIN_BOOT));
295295
len = sysfs_emit(buf, "%*pbl\n", cpumask_pr_args(isolated));
296296

297297
free_cpumask_var(isolated);

drivers/pci/pci-driver.c

Lines changed: 52 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -302,9 +302,8 @@ struct drv_dev_and_id {
302302
const struct pci_device_id *id;
303303
};
304304

305-
static long local_pci_probe(void *_ddi)
305+
static int local_pci_probe(struct drv_dev_and_id *ddi)
306306
{
307-
struct drv_dev_and_id *ddi = _ddi;
308307
struct pci_dev *pci_dev = ddi->dev;
309308
struct pci_driver *pci_drv = ddi->drv;
310309
struct device *dev = &pci_dev->dev;
@@ -338,6 +337,21 @@ static long local_pci_probe(void *_ddi)
338337
return 0;
339338
}
340339

340+
static struct workqueue_struct *pci_probe_wq;
341+
342+
struct pci_probe_arg {
343+
struct drv_dev_and_id *ddi;
344+
struct work_struct work;
345+
int ret;
346+
};
347+
348+
static void local_pci_probe_callback(struct work_struct *work)
349+
{
350+
struct pci_probe_arg *arg = container_of(work, struct pci_probe_arg, work);
351+
352+
arg->ret = local_pci_probe(arg->ddi);
353+
}
354+
341355
static bool pci_physfn_is_probed(struct pci_dev *dev)
342356
{
343357
#ifdef CONFIG_PCI_IOV
@@ -362,40 +376,55 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
362376
dev->is_probed = 1;
363377

364378
cpu_hotplug_disable();
365-
366379
/*
367380
* Prevent nesting work_on_cpu() for the case where a Virtual Function
368381
* device is probed from work_on_cpu() of the Physical device.
369382
*/
370383
if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
371384
pci_physfn_is_probed(dev)) {
372-
cpu = nr_cpu_ids;
385+
error = local_pci_probe(&ddi);
373386
} else {
374-
cpumask_var_t wq_domain_mask;
387+
struct pci_probe_arg arg = { .ddi = &ddi };
375388

376-
if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) {
377-
error = -ENOMEM;
378-
goto out;
389+
INIT_WORK_ONSTACK(&arg.work, local_pci_probe_callback);
390+
/*
391+
* The target election and the enqueue of the work must be within
392+
* the same RCU read side section so that when the workqueue pool
393+
* is flushed after a housekeeping cpumask update, further readers
394+
* are guaranteed to queue the probing work to the appropriate
395+
* targets.
396+
*/
397+
rcu_read_lock();
398+
cpu = cpumask_any_and(cpumask_of_node(node),
399+
housekeeping_cpumask(HK_TYPE_DOMAIN));
400+
401+
if (cpu < nr_cpu_ids) {
402+
struct workqueue_struct *wq = pci_probe_wq;
403+
404+
if (WARN_ON_ONCE(!wq))
405+
wq = system_percpu_wq;
406+
queue_work_on(cpu, wq, &arg.work);
407+
rcu_read_unlock();
408+
flush_work(&arg.work);
409+
error = arg.ret;
410+
} else {
411+
rcu_read_unlock();
412+
error = local_pci_probe(&ddi);
379413
}
380-
cpumask_and(wq_domain_mask,
381-
housekeeping_cpumask(HK_TYPE_WQ),
382-
housekeeping_cpumask(HK_TYPE_DOMAIN));
383414

384-
cpu = cpumask_any_and(cpumask_of_node(node),
385-
wq_domain_mask);
386-
free_cpumask_var(wq_domain_mask);
415+
destroy_work_on_stack(&arg.work);
387416
}
388417

389-
if (cpu < nr_cpu_ids)
390-
error = work_on_cpu(cpu, local_pci_probe, &ddi);
391-
else
392-
error = local_pci_probe(&ddi);
393-
out:
394418
dev->is_probed = 0;
395419
cpu_hotplug_enable();
396420
return error;
397421
}
398422

423+
void pci_probe_flush_workqueue(void)
424+
{
425+
flush_workqueue(pci_probe_wq);
426+
}
427+
399428
/**
400429
* __pci_device_probe - check if a driver wants to claim a specific PCI device
401430
* @drv: driver to call to check if it wants the PCI device
@@ -1733,6 +1762,10 @@ static int __init pci_driver_init(void)
17331762
{
17341763
int ret;
17351764

1765+
pci_probe_wq = alloc_workqueue("sync_wq", WQ_PERCPU, 0);
1766+
if (!pci_probe_wq)
1767+
return -ENOMEM;
1768+
17361769
ret = bus_register(&pci_bus_type);
17371770
if (ret)
17381771
return ret;

include/linux/cpuhplock.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
struct device;
1414

1515
extern int lockdep_is_cpus_held(void);
16+
extern int lockdep_is_cpus_write_held(void);
1617

1718
#ifdef CONFIG_HOTPLUG_CPU
1819
void cpus_write_lock(void);

include/linux/cpuset.h

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@
1818
#include <linux/mmu_context.h>
1919
#include <linux/jump_label.h>
2020

21+
extern bool lockdep_is_cpuset_held(void);
22+
2123
#ifdef CONFIG_CPUSETS
2224

2325
/*
@@ -77,7 +79,6 @@ extern void cpuset_unlock(void);
7779
extern void cpuset_cpus_allowed_locked(struct task_struct *p, struct cpumask *mask);
7880
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
7981
extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
80-
extern bool cpuset_cpu_is_isolated(int cpu);
8182
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
8283
#define cpuset_current_mems_allowed (current->mems_allowed)
8384
void cpuset_init_current_mems_allowed(void);
@@ -213,11 +214,6 @@ static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p)
213214
return false;
214215
}
215216

216-
static inline bool cpuset_cpu_is_isolated(int cpu)
217-
{
218-
return false;
219-
}
220-
221217
static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
222218
{
223219
return node_possible_map;

include/linux/kthread.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,7 @@ void kthread_unpark(struct task_struct *k);
100100
void kthread_parkme(void);
101101
void kthread_exit(long result) __noreturn;
102102
void kthread_complete_and_exit(struct completion *, long) __noreturn;
103+
int kthreads_update_housekeeping(void);
103104

104105
int kthreadd(void *unused);
105106
extern struct task_struct *kthreadd_task;

0 commit comments

Comments
 (0)