Skip to content

Commit f169c62

Browse files
gormanmingomolnar
authored andcommitted
sched/numa: Complete scanning of inactive VMAs when there is no alternative
VMAs are skipped if there is no recent fault activity but this represents a chicken-and-egg problem as there may be no fault activity if the PTEs are never updated to trap NUMA hints. There is an indirect reliance on scanning to be forced early in the lifetime of a task but this may fail to detect changes in phase behaviour. Force inactive VMAs to be scanned when all other eligible VMAs have been updated within the same scan sequence. Test results in general look good with some changes in performance, both negative and positive, depending on whether the additional scanning and faulting was beneficial or not to the workload. The autonuma benchmark workload NUMA01_THREADLOCAL was picked for closer examination. The workload creates two processes with numerous threads and thread-local storage that is zero-filled in a loop. It exercises the corner case where unrelated threads may skip VMAs that are thread-local to another thread and still has some VMAs that inactive while the workload executes. The VMA skipping activity frequency with and without the patch: 6.6.0-rc2-sched-numabtrace-v1 ============================= 649 reason=scan_delay 9,094 reason=unsuitable 48,915 reason=shared_ro 143,919 reason=inaccessible 193,050 reason=pid_inactive 6.6.0-rc2-sched-numabselective-v1 ============================= 146 reason=seq_completed 622 reason=ignore_pid_inactive 624 reason=scan_delay 6,570 reason=unsuitable 16,101 reason=shared_ro 27,608 reason=inaccessible 41,939 reason=pid_inactive Note that with the patch applied, the PID activity is ignored (ignore_pid_inactive) to ensure a VMA with some activity is completely scanned. In addition, a small number of VMAs are scanned when no other eligible VMA is available during a single scan window (seq_completed). The number of times a VMA is skipped due to no PID activity from the scanning task (pid_inactive) drops dramatically. It is expected that this will increase the number of PTEs updated for NUMA hinting faults as well as hinting faults but these represent PTEs that would otherwise have been missed. The tradeoff is scan+fault overhead versus improving locality due to migration. On a 2-socket Cascade Lake test machine, the time to complete the workload is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%) Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%* Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%) CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%) Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%) The time to complete the workload is reduced by almost 30%: 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 / Duration User 91201.80 63506.64 Duration System 2015.53 1819.78 Duration Elapsed 1234.77 868.37 In this specific case, system CPU time was not increased but it's not universally true. From vmstat, the NUMA scanning and fault activity is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Ops NUMA base-page range updates 64272.00 26374386.00 Ops NUMA PTE updates 36624.00 55538.00 Ops NUMA PMD updates 54.00 51404.00 Ops NUMA hint faults 15504.00 75786.00 Ops NUMA hint local faults % 14860.00 56763.00 Ops NUMA hint local percent 95.85 74.90 Ops NUMA pages migrated 1629.00 6469222.00 Both the number of PTE updates and hint faults is dramatically increased. While this is superficially unfortunate, it represents ranges that were simply skipped without the patch. As a result of the scanning and hinting faults, many more pages were also migrated but as the time to completion is reduced, the overhead is offset by the gain. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Link: https://lore.kernel.org/r/20231010083143.19593-7-mgorman@techsingularity.net
1 parent b7a5b53 commit f169c62

4 files changed

Lines changed: 61 additions & 4 deletions

File tree

include/linux/mm_types.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -575,6 +575,12 @@ struct vma_numab_state {
575575
* every VMA_PID_RESET_PERIOD jiffies:
576576
*/
577577
unsigned long pids_active[2];
578+
579+
/*
580+
* MM scan sequence ID when the VMA was last completely scanned.
581+
* A VMA is not eligible for scanning if prev_scan_seq == numa_scan_seq
582+
*/
583+
int prev_scan_seq;
578584
};
579585

580586
/*

include/linux/sched/numa_balancing.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ enum numa_vmaskip_reason {
2222
NUMAB_SKIP_SCAN_DELAY,
2323
NUMAB_SKIP_PID_INACTIVE,
2424
NUMAB_SKIP_IGNORE_PID,
25+
NUMAB_SKIP_SEQ_COMPLETED,
2526
};
2627

2728
#ifdef CONFIG_NUMA_BALANCING

include/trace/events/sched.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -671,7 +671,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
671671
EM( NUMAB_SKIP_INACCESSIBLE, "inaccessible" ) \
672672
EM( NUMAB_SKIP_SCAN_DELAY, "scan_delay" ) \
673673
EM( NUMAB_SKIP_PID_INACTIVE, "pid_inactive" ) \
674-
EMe(NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" )
674+
EM( NUMAB_SKIP_IGNORE_PID, "ignore_pid_inactive" ) \
675+
EMe(NUMAB_SKIP_SEQ_COMPLETED, "seq_completed" )
675676

676677
/* Redefine for export. */
677678
#undef EM

kernel/sched/fair.c

Lines changed: 52 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3158,6 +3158,8 @@ static void task_numa_work(struct callback_head *work)
31583158
unsigned long nr_pte_updates = 0;
31593159
long pages, virtpages;
31603160
struct vma_iterator vmi;
3161+
bool vma_pids_skipped;
3162+
bool vma_pids_forced = false;
31613163

31623164
SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
31633165

@@ -3200,7 +3202,6 @@ static void task_numa_work(struct callback_head *work)
32003202
*/
32013203
p->node_stamp += 2 * TICK_NSEC;
32023204

3203-
start = mm->numa_scan_offset;
32043205
pages = sysctl_numa_balancing_scan_size;
32053206
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
32063207
virtpages = pages * 8; /* Scan up to this much virtual space */
@@ -3210,6 +3211,16 @@ static void task_numa_work(struct callback_head *work)
32103211

32113212
if (!mmap_read_trylock(mm))
32123213
return;
3214+
3215+
/*
3216+
* VMAs are skipped if the current PID has not trapped a fault within
3217+
* the VMA recently. Allow scanning to be forced if there is no
3218+
* suitable VMA remaining.
3219+
*/
3220+
vma_pids_skipped = false;
3221+
3222+
retry_pids:
3223+
start = mm->numa_scan_offset;
32133224
vma_iter_init(&vmi, mm, start);
32143225
vma = vma_next(&vmi);
32153226
if (!vma) {
@@ -3260,6 +3271,13 @@ static void task_numa_work(struct callback_head *work)
32603271
/* Reset happens after 4 times scan delay of scan start */
32613272
vma->numab_state->pids_active_reset = vma->numab_state->next_scan +
32623273
msecs_to_jiffies(VMA_PID_RESET_PERIOD);
3274+
3275+
/*
3276+
* Ensure prev_scan_seq does not match numa_scan_seq,
3277+
* to prevent VMAs being skipped prematurely on the
3278+
* first scan:
3279+
*/
3280+
vma->numab_state->prev_scan_seq = mm->numa_scan_seq - 1;
32633281
}
32643282

32653283
/*
@@ -3281,8 +3299,19 @@ static void task_numa_work(struct callback_head *work)
32813299
vma->numab_state->pids_active[1] = 0;
32823300
}
32833301

3284-
/* Do not scan the VMA if task has not accessed */
3285-
if (!vma_is_accessed(mm, vma)) {
3302+
/* Do not rescan VMAs twice within the same sequence. */
3303+
if (vma->numab_state->prev_scan_seq == mm->numa_scan_seq) {
3304+
mm->numa_scan_offset = vma->vm_end;
3305+
trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_SEQ_COMPLETED);
3306+
continue;
3307+
}
3308+
3309+
/*
3310+
* Do not scan the VMA if task has not accessed it, unless no other
3311+
* VMA candidate exists.
3312+
*/
3313+
if (!vma_pids_forced && !vma_is_accessed(mm, vma)) {
3314+
vma_pids_skipped = true;
32863315
trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_PID_INACTIVE);
32873316
continue;
32883317
}
@@ -3311,8 +3340,28 @@ static void task_numa_work(struct callback_head *work)
33113340

33123341
cond_resched();
33133342
} while (end != vma->vm_end);
3343+
3344+
/* VMA scan is complete, do not scan until next sequence. */
3345+
vma->numab_state->prev_scan_seq = mm->numa_scan_seq;
3346+
3347+
/*
3348+
* Only force scan within one VMA at a time, to limit the
3349+
* cost of scanning a potentially uninteresting VMA.
3350+
*/
3351+
if (vma_pids_forced)
3352+
break;
33143353
} for_each_vma(vmi, vma);
33153354

3355+
/*
3356+
* If no VMAs are remaining and VMAs were skipped due to the PID
3357+
* not accessing the VMA previously, then force a scan to ensure
3358+
* forward progress:
3359+
*/
3360+
if (!vma && !vma_pids_forced && vma_pids_skipped) {
3361+
vma_pids_forced = true;
3362+
goto retry_pids;
3363+
}
3364+
33163365
out:
33173366
/*
33183367
* It is possible to reach the end of the VMA list but the last few

0 commit comments

Comments
 (0)