1- ====================================
2- Concurrency Managed Workqueue (cmwq)
3- ====================================
1+ =========
2+ Workqueue
3+ =========
44
55:Date: September, 2010
66:Author: Tejun Heo <tj@kernel.org>
@@ -25,8 +25,8 @@ there is no work item left on the workqueue the worker becomes idle.
2525When a new work item gets queued, the worker begins executing again.
2626
2727
28- Why cmwq ?
29- =========
28+ Why Concurrency Managed Workqueue ?
29+ ==================================
3030
3131In the original wq implementation, a multi threaded (MT) wq had one
3232worker thread per CPU and a single threaded (ST) wq had one worker
@@ -408,6 +408,180 @@ directory.
408408 behavior of older kernels.
409409
410410
411+ Affinity Scopes and Performance
412+ ===============================
413+
414+ It'd be ideal if an unbound workqueue's behavior is optimal for vast
415+ majority of use cases without further tuning. Unfortunately, in the current
416+ kernel, there exists a pronounced trade-off between locality and utilization
417+ necessitating explicit configurations when workqueues are heavily used.
418+
419+ Higher locality leads to higher efficiency where more work is performed for
420+ the same number of consumed CPU cycles. However, higher locality may also
421+ cause lower overall system utilization if the work items are not spread
422+ enough across the affinity scopes by the issuers. The following performance
423+ testing with dm-crypt clearly illustrates this trade-off.
424+
425+ The tests are run on a CPU with 12-cores/24-threads split across four L3
426+ caches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency.
427+ ``/dev/dm-0 `` is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and
428+ opened with ``cryptsetup `` with default settings.
429+
430+
431+ Scenario 1: Enough issuers and work spread across the machine
432+ -------------------------------------------------------------
433+
434+ The command used: ::
435+
436+ $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \
437+ --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \
438+ --name=iops-test-job --verify=sha512
439+
440+ There are 24 issuers, each issuing 64 IOs concurrently. ``--verify=sha512 ``
441+ makes ``fio `` generate and read back the content each time which makes
442+ execution locality matter between the issuer and ``kcryptd ``. The followings
443+ are the read bandwidths and CPU utilizations depending on different affinity
444+ scope settings on ``kcryptd `` measured over five runs. Bandwidths are in
445+ MiBps, and CPU util in percents.
446+
447+ .. list-table ::
448+ :widths: 16 20 20
449+ :header-rows: 1
450+
451+ * - Affinity
452+ - Bandwidth (MiBps)
453+ - CPU util (%)
454+
455+ * - system
456+ - 1159.40 ±1.34
457+ - 99.31 ±0.02
458+
459+ * - cache
460+ - 1166.40 ±0.89
461+ - 99.34 ±0.01
462+
463+ * - cache (strict)
464+ - 1166.00 ±0.71
465+ - 99.35 ±0.01
466+
467+ With enough issuers spread across the system, there is no downside to
468+ "cache", strict or otherwise. All three configurations saturate the whole
469+ machine but the cache-affine ones outperform by 0.6% thanks to improved
470+ locality.
471+
472+
473+ Scenario 2: Fewer issuers, enough work for saturation
474+ -----------------------------------------------------
475+
476+ The command used: ::
477+
478+ $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
479+ --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \
480+ --time_based --group_reporting --name=iops-test-job --verify=sha512
481+
482+ The only difference from the previous scenario is ``--numjobs=8 ``. There are
483+ a third of the issuers but is still enough total work to saturate the
484+ system.
485+
486+ .. list-table ::
487+ :widths: 16 20 20
488+ :header-rows: 1
489+
490+ * - Affinity
491+ - Bandwidth (MiBps)
492+ - CPU util (%)
493+
494+ * - system
495+ - 1155.40 ±0.89
496+ - 97.41 ±0.05
497+
498+ * - cache
499+ - 1154.40 ±1.14
500+ - 96.15 ±0.09
501+
502+ * - cache (strict)
503+ - 1112.00 ±4.64
504+ - 93.26 ±0.35
505+
506+ This is more than enough work to saturate the system. Both "system" and
507+ "cache" are nearly saturating the machine but not fully. "cache" is using
508+ less CPU but the better efficiency puts it at the same bandwidth as
509+ "system".
510+
511+ Eight issuers moving around over four L3 cache scope still allow "cache
512+ (strict)" to mostly saturate the machine but the loss of work conservation
513+ is now starting to hurt with 3.7% bandwidth loss.
514+
515+
516+ Scenario 3: Even fewer issuers, not enough work to saturate
517+ -----------------------------------------------------------
518+
519+ The command used: ::
520+
521+ $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
522+ --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \
523+ --time_based --group_reporting --name=iops-test-job --verify=sha512
524+
525+ Again, the only difference is ``--numjobs=4 ``. With the number of issuers
526+ reduced to four, there now isn't enough work to saturate the whole system
527+ and the bandwidth becomes dependent on completion latencies.
528+
529+ .. list-table ::
530+ :widths: 16 20 20
531+ :header-rows: 1
532+
533+ * - Affinity
534+ - Bandwidth (MiBps)
535+ - CPU util (%)
536+
537+ * - system
538+ - 993.60 ±1.82
539+ - 75.49 ±0.06
540+
541+ * - cache
542+ - 973.40 ±1.52
543+ - 74.90 ±0.07
544+
545+ * - cache (strict)
546+ - 828.20 ±4.49
547+ - 66.84 ±0.29
548+
549+ Now, the tradeoff between locality and utilization is clearer. "cache" shows
550+ 2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%.
551+
552+
553+ Conclusion and Recommendations
554+ ------------------------------
555+
556+ In the above experiments, the efficiency advantage of the "cache" affinity
557+ scope over "system" is, while consistent and noticeable, small. However, the
558+ impact is dependent on the distances between the scopes and may be more
559+ pronounced in processors with more complex topologies.
560+
561+ While the loss of work-conservation in certain scenarios hurts, it is a lot
562+ better than "cache (strict)" and maximizing workqueue utilization is
563+ unlikely to be the common case anyway. As such, "cache" is the default
564+ affinity scope for unbound pools.
565+
566+ * As there is no one option which is great for most cases, workqueue usages
567+ that may consume a significant amount of CPU are recommended to configure
568+ the workqueues using ``apply_workqueue_attrs() `` and/or enable
569+ ``WQ_SYSFS ``.
570+
571+ * An unbound workqueue with strict "cpu" affinity scope behaves the same as
572+ ``WQ_CPU_INTENSIVE `` per-cpu workqueue. There is no real advanage to the
573+ latter and an unbound workqueue provides a lot more flexibility.
574+
575+ * Affinity scopes are introduced in Linux v6.5. To emulate the previous
576+ behavior, use strict "numa" affinity scope.
577+
578+ * The loss of work-conservation in non-strict affinity scopes is likely
579+ originating from the scheduler. There is no theoretical reason why the
580+ kernel wouldn't be able to do the right thing and maintain
581+ work-conservation in most cases. As such, it is possible that future
582+ scheduler improvements may make most of these tunables unnecessary.
583+
584+
411585Examining Configuration
412586=======================
413587
0 commit comments