Skip to content

Commit 7fa8a8e

Browse files
committed
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2 parents 91ec4b0 + 4d4b6d6 commit 7fa8a8e

306 files changed

Lines changed: 11767 additions & 8185 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Documentation/ABI/testing/sysfs-kernel-mm-ksm

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,3 +51,11 @@ Description: Control merging pages across different NUMA nodes.
5151

5252
When it is set to 0 only pages from the same node are merged,
5353
otherwise pages from all nodes can be merged together (default).
54+
55+
What: /sys/kernel/mm/ksm/general_profit
56+
Date: April 2023
57+
KernelVersion: 6.4
58+
Contact: Linux memory management mailing list <linux-mm@kvack.org>
59+
Description: Measure how effective KSM is.
60+
general_profit: how effective is KSM. The formula for the
61+
calculation is in Documentation/admin-guide/mm/ksm.rst.

Documentation/admin-guide/kdump/vmcoreinfo.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@ variables.
172172
Offset of the free_list's member. This value is used to compute the number
173173
of free pages.
174174

175-
Each zone has a free_area structure array called free_area[MAX_ORDER].
175+
Each zone has a free_area structure array called free_area[MAX_ORDER + 1].
176176
The free_list represents a linked list of free page blocks.
177177

178178
(list_head, next|prev)
@@ -189,8 +189,8 @@ Offsets of the vmap_area's members. They carry vmalloc-specific
189189
information. Makedumpfile gets the start address of the vmalloc region
190190
from this.
191191

192-
(zone.free_area, MAX_ORDER)
193-
---------------------------
192+
(zone.free_area, MAX_ORDER + 1)
193+
-------------------------------
194194

195195
Free areas descriptor. User-space tools use this value to iterate the
196196
free_area ranges. MAX_ORDER is used by the zone buddy allocator.

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4012,7 +4012,7 @@
40124012
[KNL] Minimal page reporting order
40134013
Format: <integer>
40144014
Adjust the minimal page reporting order. The page
4015-
reporting is disabled when it exceeds (MAX_ORDER-1).
4015+
reporting is disabled when it exceeds MAX_ORDER.
40164016

40174017
panic= [KNL] Kernel behaviour on panic: delay <timeout>
40184018
timeout > 0: seconds before rebooting

Documentation/admin-guide/mm/ksm.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,8 @@ stable_node_chains_prune_millisecs
157157

158158
The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
159159

160+
general_profit
161+
how effective is KSM. The calculation is explained below.
160162
pages_shared
161163
how many shared pages are being used
162164
pages_sharing
@@ -207,7 +209,8 @@ several times, which are unprofitable memory consumed.
207209
ksm_rmap_items * sizeof(rmap_item).
208210

209211
where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,
210-
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``.
212+
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``. The process profit
213+
is also shown in ``/proc/<pid>/ksm_stat`` as ksm_process_profit.
211214

212215
From the perspective of application, a high ratio of ``ksm_rmap_items`` to
213216
``ksm_merging_pages`` means a bad madvise-applied policy, so developers or

Documentation/admin-guide/mm/userfaultfd.rst

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,31 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter
219219
you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was
220220
used.
221221

222+
Userfaultfd write-protect mode currently behave differently on none ptes
223+
(when e.g. page is missing) over different types of memories.
224+
225+
For anonymous memory, ``ioctl(UFFDIO_WRITEPROTECT)`` will ignore none ptes
226+
(e.g. when pages are missing and not populated). For file-backed memories
227+
like shmem and hugetlbfs, none ptes will be write protected just like a
228+
present pte. In other words, there will be a userfaultfd write fault
229+
message generated when writing to a missing page on file typed memories,
230+
as long as the page range was write-protected before. Such a message will
231+
not be generated on anonymous memories by default.
232+
233+
If the application wants to be able to write protect none ptes on anonymous
234+
memory, one can pre-populate the memory with e.g. MADV_POPULATE_READ. On
235+
newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED
236+
and set the feature bit in advance to make sure none ptes will also be
237+
write protected even upon anonymous memory.
238+
239+
When using ``UFFDIO_REGISTER_MODE_WP`` in combination with either
240+
``UFFDIO_REGISTER_MODE_MISSING`` or ``UFFDIO_REGISTER_MODE_MINOR``, when
241+
resolving missing / minor faults with ``UFFDIO_COPY`` or ``UFFDIO_CONTINUE``
242+
respectively, it may be desirable for the new page / mapping to be
243+
write-protected (so future writes will also result in a WP fault). These ioctls
244+
support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``
245+
respectively) to configure the mapping this way.
246+
222247
QEMU/KVM
223248
========
224249

Documentation/core-api/printk-formats.rst

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -575,20 +575,26 @@ The field width is passed by value, the bitmap is passed by reference.
575575
Helper macros cpumask_pr_args() and nodemask_pr_args() are available to ease
576576
printing cpumask and nodemask.
577577

578-
Flags bitfields such as page flags, gfp_flags
579-
---------------------------------------------
578+
Flags bitfields such as page flags, page_type, gfp_flags
579+
--------------------------------------------------------
580580

581581
::
582582

583583
%pGp 0x17ffffc0002036(referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff)
584+
%pGt 0xffffff7f(buddy)
584585
%pGg GFP_USER|GFP_DMA32|GFP_NOWARN
585586
%pGv read|exec|mayread|maywrite|mayexec|denywrite
586587

587588
For printing flags bitfields as a collection of symbolic constants that
588589
would construct the value. The type of flags is given by the third
589-
character. Currently supported are [p]age flags, [v]ma_flags (both
590-
expect ``unsigned long *``) and [g]fp_flags (expects ``gfp_t *``). The flag
591-
names and print order depends on the particular type.
590+
character. Currently supported are:
591+
592+
- p - [p]age flags, expects value of type (``unsigned long *``)
593+
- t - page [t]ype, expects value of type (``unsigned int *``)
594+
- v - [v]ma_flags, expects value of type (``unsigned long *``)
595+
- g - [g]fp_flags, expects value of type (``gfp_t *``)
596+
597+
The flag names and print order depends on the particular type.
592598

593599
Note that this format should not be used directly in the
594600
:c:func:`TP_printk()` part of a tracepoint. Instead, use the show_*_flags()

Documentation/filesystems/locking.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -645,7 +645,7 @@ ops mmap_lock PageLocked(page)
645645
open: yes
646646
close: yes
647647
fault: yes can return with page locked
648-
map_pages: yes
648+
map_pages: read
649649
page_mkwrite: yes can return with page locked
650650
pfn_mkwrite: yes
651651
access: yes
@@ -661,7 +661,7 @@ locked. The VM will unlock the page.
661661

662662
->map_pages() is called when VM asks to map easy accessible pages.
663663
Filesystem should find and map pages associated with offsets from "start_pgoff"
664-
till "end_pgoff". ->map_pages() is called with page table locked and must
664+
till "end_pgoff". ->map_pages() is called with the RCU lock held and must
665665
not block. If it's not possible to reach a page without blocking,
666666
filesystem should skip it. Filesystem should use do_set_pte() to setup
667667
page table entry. Pointer to entry associated with the page is passed in

Documentation/filesystems/proc.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -996,6 +996,7 @@ Example output. You may not have all of these fields.
996996
VmallocUsed: 40444 kB
997997
VmallocChunk: 0 kB
998998
Percpu: 29312 kB
999+
EarlyMemtestBad: 0 kB
9991000
HardwareCorrupted: 0 kB
10001001
AnonHugePages: 4149248 kB
10011002
ShmemHugePages: 0 kB
@@ -1146,6 +1147,13 @@ VmallocChunk
11461147
Percpu
11471148
Memory allocated to the percpu allocator used to back percpu
11481149
allocations. This stat excludes the cost of metadata.
1150+
EarlyMemtestBad
1151+
The amount of RAM/memory in kB, that was identified as corrupted
1152+
by early memtest. If memtest was not run, this field will not
1153+
be displayed at all. Size is never rounded down to 0 kB.
1154+
That means if 0 kB is reported, you can safely assume
1155+
there was at least one pass of memtest and none of the passes
1156+
found a single faulty byte of RAM.
11491157
HardwareCorrupted
11501158
The amount of RAM/memory in KB, the kernel identifies as
11511159
corrupted.

Documentation/filesystems/tmpfs.rst

Lines changed: 55 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,17 +13,29 @@ everything stored therein is lost.
1313

1414
tmpfs puts everything into the kernel internal caches and grows and
1515
shrinks to accommodate the files it contains and is able to swap
16-
unneeded pages out to swap space. It has maximum size limits which can
17-
be adjusted on the fly via 'mount -o remount ...'
18-
19-
If you compare it to ramfs (which was the template to create tmpfs)
20-
you gain swapping and limit checking. Another similar thing is the RAM
21-
disk (/dev/ram*), which simulates a fixed size hard disk in physical
22-
RAM, where you have to create an ordinary filesystem on top. Ramdisks
23-
cannot swap and you do not have the possibility to resize them.
24-
25-
Since tmpfs lives completely in the page cache and on swap, all tmpfs
26-
pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
16+
unneeded pages out to swap space, if swap was enabled for the tmpfs
17+
mount. tmpfs also supports THP.
18+
19+
tmpfs extends ramfs with a few userspace configurable options listed and
20+
explained further below, some of which can be reconfigured dynamically on the
21+
fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs
22+
filesystem can be resized but it cannot be resized to a size below its current
23+
usage. tmpfs also supports POSIX ACLs, and extended attributes for the
24+
trusted.* and security.* namespaces. ramfs does not use swap and you cannot
25+
modify any parameter for a ramfs filesystem. The size limit of a ramfs
26+
filesystem is how much memory you have available, and so care must be taken if
27+
used so to not run out of memory.
28+
29+
An alternative to tmpfs and ramfs is to use brd to create RAM disks
30+
(/dev/ram*), which allows you to simulate a block device disk in physical RAM.
31+
To write data you would just then need to create an regular filesystem on top
32+
this ramdisk. As with ramfs, brd ramdisks cannot swap. brd ramdisks are also
33+
configured in size at initialization and you cannot dynamically resize them.
34+
Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the
35+
block layer at all.
36+
37+
Since tmpfs lives completely in the page cache and optionally on swap,
38+
all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
2739
free(1). Notice that these counters also include shared memory
2840
(shmem, see ipcs(1)). The most reliable way to get the count is
2941
using df(1) and du(1).
@@ -72,6 +84,8 @@ nr_inodes The maximum number of inodes for this instance. The default
7284
is half of the number of your physical RAM pages, or (on a
7385
machine with highmem) the number of lowmem RAM pages,
7486
whichever is the lower.
87+
noswap Disables swap. Remounts must respect the original settings.
88+
By default swap is enabled.
7589
========= ============================================================
7690

7791
These parameters accept a suffix k, m or g for kilo, mega and giga and
@@ -85,6 +99,36 @@ mount with such options, since it allows any user with write access to
8599
use up all the memory on the machine; but enhances the scalability of
86100
that instance in a system with many CPUs making intensive use of it.
87101

102+
tmpfs also supports Transparent Huge Pages which requires a kernel
103+
configured with CONFIG_TRANSPARENT_HUGEPAGE and with huge supported for
104+
your system (has_transparent_hugepage(), which is architecture specific).
105+
The mount options for this are:
106+
107+
====== ============================================================
108+
huge=0 never: disables huge pages for the mount
109+
huge=1 always: enables huge pages for the mount
110+
huge=2 within_size: only allocate huge pages if the page will be
111+
fully within i_size, also respect fadvise()/madvise() hints.
112+
huge=3 advise: only allocate huge pages if requested with
113+
fadvise()/madvise()
114+
====== ============================================================
115+
116+
There is a sysfs file which you can also use to control system wide THP
117+
configuration for all tmpfs mounts, the file is:
118+
119+
/sys/kernel/mm/transparent_hugepage/shmem_enabled
120+
121+
This sysfs file is placed on top of THP sysfs directory and so is registered
122+
by THP code. It is however only used to control all tmpfs mounts with one
123+
single knob. Since it controls all tmpfs mounts it should only be used either
124+
for emergency or testing purposes. The values you can set for shmem_enabled are:
125+
126+
== ============================================================
127+
-1 deny: disables huge on shm_mnt and all mounts, for
128+
emergency use
129+
-2 force: enables huge on shm_mnt and all mounts, w/o needing
130+
option, for testing
131+
== ============================================================
88132

89133
tmpfs has a mount option to set the NUMA memory allocation policy for
90134
all files in that instance (if CONFIG_NUMA is enabled) - which can be

Documentation/mm/active_mm.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22
Active MM
33
=========
44

5+
Note, the mm_count refcount may no longer include the "lazy" users
6+
(running tasks with ->active_mm == mm && ->mm == NULL) on kernels
7+
with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy
8+
references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
9+
helpers, which abstract this config option.
10+
511
::
612

713
List: linux-kernel

0 commit comments

Comments
 (0)