Skip to content

Commit d7597f5

Browse files
Stefan Roeschakpm00
authored andcommitted
mm: add new api to enable ksm per process
Patch series "mm: process/cgroup ksm support", v9. So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. Use case 1: The madvise call is not available in the programming language. An example for this are programs with forked workloads using a garbage collected language without pointers. In such a language madvise cannot be made available. In addition the addresses of objects get moved around as they are garbage collected. KSM sharing needs to be enabled "from the outside" for these type of workloads. Use case 2: The same interpreter can also be used for workloads where KSM brings no benefit or even has overhead. We'd like to be able to enable KSM on a workload by workload basis. Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. Security concerns: In previous discussions security concerns have been brought up. The problem is that an individual workload does not have the knowledge about what else is running on a machine. Therefore it has to be very conservative in what memory areas can be shared or not. However, if the system is dedicated to running multiple jobs within the same security domain, its the job scheduler that has the knowledge that sharing can be safely enabled and is even desirable. Performance: Experiments with using UKSM have shown a capacity increase of around 20%. Here are the metrics from an instagram workload (taken from a machine with 64GB main memory): full_scans: 445 general_profit: 20158298048 max_page_sharing: 256 merge_across_nodes: 1 pages_shared: 129547 pages_sharing: 5119146 pages_to_scan: 4000 pages_unshared: 1760924 pages_volatile: 10761341 run: 1 sleep_millisecs: 20 stable_node_chains: 167 stable_node_chains_prune_millisecs: 2000 stable_node_dups: 2751 use_zero_pages: 0 zero_pages_sharing: 0 After the service is running for 30 minutes to an hour, 4 to 5 million shared pages are common for this workload when using KSM. Detailed changes: 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 3. Add general_profit metric The general_profit metric of KSM is specified in the documentation, but not calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm. 4. Add more metrics to ksm_stat This adds the process profit metric to /proc/<pid>/ksm_stat. 5. Add more tests to ksm_tests and ksm_functional_tests This adds an option to specify the merge type to the ksm_tests. This allows to test madvise and prctl KSM. It also adds a two new tests to ksm_functional_tests: one to test the new prctl options and the other one is a fork test to verify that the KSM process setting is inherited by client processes. This patch (of 3): So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 1) Introduce new MMF_VM_MERGE_ANY flag This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is set, kernel samepage merging (ksm) gets enabled for all vma's of a process. 2) Setting VM_MERGEABLE on VMA creation When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the VM_MERGEABLE flag will be set for this VMA. 3) support disabling of ksm for a process This adds the ability to disable ksm for a process if ksm has been enabled for the process with prctl. 4) add new prctl option to get and set ksm for a process This adds two new options to the prctl system call - enable ksm for all vmas of a process (if the vmas support it). - query if ksm has been enabled for a process. 3. Disabling MMF_VM_MERGE_ANY for storage keys in s390 In the s390 architecture when storage keys are used, the MMF_VM_MERGE_ANY will be disabled. Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: David Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent 2124f79 commit d7597f5

7 files changed

Lines changed: 146 additions & 19 deletions

File tree

arch/s390/mm/gmap.c

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2591,6 +2591,13 @@ int gmap_mark_unmergeable(void)
25912591
int ret;
25922592
VMA_ITERATOR(vmi, mm, 0);
25932593

2594+
/*
2595+
* Make sure to disable KSM (if enabled for the whole process or
2596+
* individual VMAs). Note that nothing currently hinders user space
2597+
* from re-enabling it.
2598+
*/
2599+
clear_bit(MMF_VM_MERGE_ANY, &mm->flags);
2600+
25942601
for_each_vma(vmi, vma) {
25952602
/* Copy vm_flags to avoid partial modifications in ksm_madvise */
25962603
vm_flags = vma->vm_flags;

include/linux/ksm.h

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,26 @@
1818
#ifdef CONFIG_KSM
1919
int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
2020
unsigned long end, int advice, unsigned long *vm_flags);
21+
22+
void ksm_add_vma(struct vm_area_struct *vma);
23+
int ksm_enable_merge_any(struct mm_struct *mm);
24+
2125
int __ksm_enter(struct mm_struct *mm);
2226
void __ksm_exit(struct mm_struct *mm);
2327

2428
static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
2529
{
26-
if (test_bit(MMF_VM_MERGEABLE, &oldmm->flags))
27-
return __ksm_enter(mm);
30+
int ret;
31+
32+
if (test_bit(MMF_VM_MERGEABLE, &oldmm->flags)) {
33+
ret = __ksm_enter(mm);
34+
if (ret)
35+
return ret;
36+
}
37+
38+
if (test_bit(MMF_VM_MERGE_ANY, &oldmm->flags))
39+
set_bit(MMF_VM_MERGE_ANY, &mm->flags);
40+
2841
return 0;
2942
}
3043

@@ -57,6 +70,10 @@ void collect_procs_ksm(struct page *page, struct list_head *to_kill,
5770
#endif
5871
#else /* !CONFIG_KSM */
5972

73+
static inline void ksm_add_vma(struct vm_area_struct *vma)
74+
{
75+
}
76+
6077
static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
6178
{
6279
return 0;

include/linux/sched/coredump.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,4 +90,5 @@ static inline int get_dumpable(struct mm_struct *mm)
9090
#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
9191
MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK)
9292

93+
#define MMF_VM_MERGE_ANY 29
9394
#endif /* _LINUX_SCHED_COREDUMP_H */

include/uapi/linux/prctl.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -292,4 +292,6 @@ struct prctl_mm_map {
292292

293293
#define PR_GET_AUXV 0x41555856
294294

295+
#define PR_SET_MEMORY_MERGE 67
296+
#define PR_GET_MEMORY_MERGE 68
295297
#endif /* _LINUX_PRCTL_H */

kernel/sys.c

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
#include <linux/highuid.h>
1616
#include <linux/fs.h>
1717
#include <linux/kmod.h>
18+
#include <linux/ksm.h>
1819
#include <linux/perf_event.h>
1920
#include <linux/resource.h>
2021
#include <linux/kernel.h>
@@ -2687,6 +2688,32 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
26872688
case PR_SET_VMA:
26882689
error = prctl_set_vma(arg2, arg3, arg4, arg5);
26892690
break;
2691+
#ifdef CONFIG_KSM
2692+
case PR_SET_MEMORY_MERGE:
2693+
if (arg3 || arg4 || arg5)
2694+
return -EINVAL;
2695+
if (mmap_write_lock_killable(me->mm))
2696+
return -EINTR;
2697+
2698+
if (arg2) {
2699+
error = ksm_enable_merge_any(me->mm);
2700+
} else {
2701+
/*
2702+
* TODO: we might want disable KSM on all VMAs and
2703+
* trigger unsharing to completely disable KSM.
2704+
*/
2705+
clear_bit(MMF_VM_MERGE_ANY, &me->mm->flags);
2706+
error = 0;
2707+
}
2708+
mmap_write_unlock(me->mm);
2709+
break;
2710+
case PR_GET_MEMORY_MERGE:
2711+
if (arg2 || arg3 || arg4 || arg5)
2712+
return -EINVAL;
2713+
2714+
error = !!test_bit(MMF_VM_MERGE_ANY, &me->mm->flags);
2715+
break;
2716+
#endif
26902717
default:
26912718
error = -EINVAL;
26922719
break;

mm/ksm.c

Lines changed: 87 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -515,6 +515,28 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
515515
return (ret & VM_FAULT_OOM) ? -ENOMEM : 0;
516516
}
517517

518+
static bool vma_ksm_compatible(struct vm_area_struct *vma)
519+
{
520+
if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE | VM_PFNMAP |
521+
VM_IO | VM_DONTEXPAND | VM_HUGETLB |
522+
VM_MIXEDMAP))
523+
return false; /* just ignore the advice */
524+
525+
if (vma_is_dax(vma))
526+
return false;
527+
528+
#ifdef VM_SAO
529+
if (vma->vm_flags & VM_SAO)
530+
return false;
531+
#endif
532+
#ifdef VM_SPARC_ADI
533+
if (vma->vm_flags & VM_SPARC_ADI)
534+
return false;
535+
#endif
536+
537+
return true;
538+
}
539+
518540
static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
519541
unsigned long addr)
520542
{
@@ -1026,6 +1048,7 @@ static int unmerge_and_remove_all_rmap_items(void)
10261048

10271049
mm_slot_free(mm_slot_cache, mm_slot);
10281050
clear_bit(MMF_VM_MERGEABLE, &mm->flags);
1051+
clear_bit(MMF_VM_MERGE_ANY, &mm->flags);
10291052
mmdrop(mm);
10301053
} else
10311054
spin_unlock(&ksm_mmlist_lock);
@@ -2408,6 +2431,7 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
24082431

24092432
mm_slot_free(mm_slot_cache, mm_slot);
24102433
clear_bit(MMF_VM_MERGEABLE, &mm->flags);
2434+
clear_bit(MMF_VM_MERGE_ANY, &mm->flags);
24112435
mmap_read_unlock(mm);
24122436
mmdrop(mm);
24132437
} else {
@@ -2485,6 +2509,66 @@ static int ksm_scan_thread(void *nothing)
24852509
return 0;
24862510
}
24872511

2512+
static void __ksm_add_vma(struct vm_area_struct *vma)
2513+
{
2514+
unsigned long vm_flags = vma->vm_flags;
2515+
2516+
if (vm_flags & VM_MERGEABLE)
2517+
return;
2518+
2519+
if (vma_ksm_compatible(vma))
2520+
vm_flags_set(vma, VM_MERGEABLE);
2521+
}
2522+
2523+
/**
2524+
* ksm_add_vma - Mark vma as mergeable if compatible
2525+
*
2526+
* @vma: Pointer to vma
2527+
*/
2528+
void ksm_add_vma(struct vm_area_struct *vma)
2529+
{
2530+
struct mm_struct *mm = vma->vm_mm;
2531+
2532+
if (test_bit(MMF_VM_MERGE_ANY, &mm->flags))
2533+
__ksm_add_vma(vma);
2534+
}
2535+
2536+
static void ksm_add_vmas(struct mm_struct *mm)
2537+
{
2538+
struct vm_area_struct *vma;
2539+
2540+
VMA_ITERATOR(vmi, mm, 0);
2541+
for_each_vma(vmi, vma)
2542+
__ksm_add_vma(vma);
2543+
}
2544+
2545+
/**
2546+
* ksm_enable_merge_any - Add mm to mm ksm list and enable merging on all
2547+
* compatible VMA's
2548+
*
2549+
* @mm: Pointer to mm
2550+
*
2551+
* Returns 0 on success, otherwise error code
2552+
*/
2553+
int ksm_enable_merge_any(struct mm_struct *mm)
2554+
{
2555+
int err;
2556+
2557+
if (test_bit(MMF_VM_MERGE_ANY, &mm->flags))
2558+
return 0;
2559+
2560+
if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
2561+
err = __ksm_enter(mm);
2562+
if (err)
2563+
return err;
2564+
}
2565+
2566+
set_bit(MMF_VM_MERGE_ANY, &mm->flags);
2567+
ksm_add_vmas(mm);
2568+
2569+
return 0;
2570+
}
2571+
24882572
int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
24892573
unsigned long end, int advice, unsigned long *vm_flags)
24902574
{
@@ -2493,25 +2577,10 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
24932577

24942578
switch (advice) {
24952579
case MADV_MERGEABLE:
2496-
/*
2497-
* Be somewhat over-protective for now!
2498-
*/
2499-
if (*vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE |
2500-
VM_PFNMAP | VM_IO | VM_DONTEXPAND |
2501-
VM_HUGETLB | VM_MIXEDMAP))
2502-
return 0; /* just ignore the advice */
2503-
2504-
if (vma_is_dax(vma))
2580+
if (vma->vm_flags & VM_MERGEABLE)
25052581
return 0;
2506-
2507-
#ifdef VM_SAO
2508-
if (*vm_flags & VM_SAO)
2582+
if (!vma_ksm_compatible(vma))
25092583
return 0;
2510-
#endif
2511-
#ifdef VM_SPARC_ADI
2512-
if (*vm_flags & VM_SPARC_ADI)
2513-
return 0;
2514-
#endif
25152584

25162585
if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
25172586
err = __ksm_enter(mm);
@@ -2615,6 +2684,7 @@ void __ksm_exit(struct mm_struct *mm)
26152684

26162685
if (easy_to_free) {
26172686
mm_slot_free(mm_slot_cache, mm_slot);
2687+
clear_bit(MMF_VM_MERGE_ANY, &mm->flags);
26182688
clear_bit(MMF_VM_MERGEABLE, &mm->flags);
26192689
mmdrop(mm);
26202690
} else if (mm_slot) {

mm/mmap.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@
4646
#include <linux/pkeys.h>
4747
#include <linux/oom.h>
4848
#include <linux/sched/mm.h>
49+
#include <linux/ksm.h>
4950

5051
#include <linux/uaccess.h>
5152
#include <asm/cacheflush.h>
@@ -2729,6 +2730,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
27292730
if (file && vm_flags & VM_SHARED)
27302731
mapping_unmap_writable(file->f_mapping);
27312732
file = vma->vm_file;
2733+
ksm_add_vma(vma);
27322734
expanded:
27332735
perf_event_mmap(vma);
27342736

@@ -3001,6 +3003,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
30013003
goto mas_store_fail;
30023004

30033005
mm->map_count++;
3006+
ksm_add_vma(vma);
30043007
out:
30053008
perf_event_mmap(vma);
30063009
mm->total_vm += len >> PAGE_SHIFT;

0 commit comments

Comments
 (0)