Skip to content

Commit e72ef2d

Browse files
Linus WalleijJonathan Corbet
authored andcommitted
Documentation/mm: Initial page table documentation
This is based on an earlier blog post at people.kernel.org, it describes the concepts about page tables that were hardest for me to grasp when dealing with them for the first time, such as the prevalent three-letter acronyms pfn, pgd, p4d, pud, pmd and pte. I don't know if this is what people want, but it's what I would have wanted. The wording, introduction, choice of initial subjects and choice of style is mine. I discussed at one point with Mike Rapoport to bring this into the kernel documentation, so here is a small proposal. The current form is augmented in response to feedback from Mike Rapoport, Matthew Wilcox, Jonathan Cameron, Kuan-Ying Lee, Randy Dunlap and Bagas Sanjaya. Cc: Matthew Wilcox <willy@infradead.org> Reviewed-by: Mike Rapoport <rppt@kernel.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://people.kernel.org/linusw/arm32-page-tables Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20230614072548.996940-1-linus.walleij@linaro.org
1 parent d27e40b commit e72ef2d

1 file changed

Lines changed: 149 additions & 0 deletions

File tree

Documentation/mm/page_tables.rst

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,152 @@
33
===========
44
Page Tables
55
===========
6+
7+
Paged virtual memory was invented along with virtual memory as a concept in
8+
1962 on the Ferranti Atlas Computer which was the first computer with paged
9+
virtual memory. The feature migrated to newer computers and became a de facto
10+
feature of all Unix-like systems as time went by. In 1985 the feature was
11+
included in the Intel 80386, which was the CPU Linux 1.0 was developed on.
12+
13+
Page tables map virtual addresses as seen by the CPU into physical addresses
14+
as seen on the external memory bus.
15+
16+
Linux defines page tables as a hierarchy which is currently five levels in
17+
height. The architecture code for each supported architecture will then
18+
map this to the restrictions of the hardware.
19+
20+
The physical address corresponding to the virtual address is often referenced
21+
by the underlying physical page frame. The **page frame number** or **pfn**
22+
is the physical address of the page (as seen on the external memory bus)
23+
divided by `PAGE_SIZE`.
24+
25+
Physical memory address 0 will be *pfn 0* and the highest pfn will be
26+
the last page of physical memory the external address bus of the CPU can
27+
address.
28+
29+
With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at
30+
address 0x00000000, pfn 1 is at address 0x00001000, pfn 2 is at 0x00002000
31+
and so on until we reach pfn 0xfffff at 0xfffff000. With 16KB pages pfs are
32+
at 0x00004000, 0x00008000 ... 0xffffc000 and pfn goes from 0 to 0x3fffff.
33+
34+
As you can see, with 4KB pages the page base address uses bits 12-31 of the
35+
address, and this is why `PAGE_SHIFT` in this case is defined as 12 and
36+
`PAGE_SIZE` is usually defined in terms of the page shift as `(1 << PAGE_SHIFT)`
37+
38+
Over time a deeper hierarchy has been developed in response to increasing memory
39+
sizes. When Linux was created, 4KB pages and a single page table called
40+
`swapper_pg_dir` with 1024 entries was used, covering 4MB which coincided with
41+
the fact that Torvald's first computer had 4MB of physical memory. Entries in
42+
this single table were referred to as *PTE*:s - page table entries.
43+
44+
The software page table hierarchy reflects the fact that page table hardware has
45+
become hierarchical and that in turn is done to save page table memory and
46+
speed up mapping.
47+
48+
One could of course imagine a single, linear page table with enormous amounts
49+
of entries, breaking down the whole memory into single pages. Such a page table
50+
would be very sparse, because large portions of the virtual memory usually
51+
remains unused. By using hierarchical page tables large holes in the virtual
52+
address space does not waste valuable page table memory, because it will suffice
53+
to mark large areas as unmapped at a higher level in the page table hierarchy.
54+
55+
Additionally, on modern CPUs, a higher level page table entry can point directly
56+
to a physical memory range, which allows mapping a contiguous range of several
57+
megabytes or even gigabytes in a single high-level page table entry, taking
58+
shortcuts in mapping virtual memory to physical memory: there is no need to
59+
traverse deeper in the hierarchy when you find a large mapped range like this.
60+
61+
The page table hierarchy has now developed into this::
62+
63+
+-----+
64+
| PGD |
65+
+-----+
66+
|
67+
| +-----+
68+
+-->| P4D |
69+
+-----+
70+
|
71+
| +-----+
72+
+-->| PUD |
73+
+-----+
74+
|
75+
| +-----+
76+
+-->| PMD |
77+
+-----+
78+
|
79+
| +-----+
80+
+-->| PTE |
81+
+-----+
82+
83+
84+
Symbols on the different levels of the page table hierarchy have the following
85+
meaning beginning from the bottom:
86+
87+
- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
88+
The *pte* is an array of `PTRS_PER_PTE` elements of the `pteval_t` type, each
89+
mapping a single page of virtual memory to a single page of physical memory.
90+
The architecture defines the size and contents of `pteval_t`.
91+
92+
A typical example is that the `pteval_t` is a 32- or 64-bit value with the
93+
upper bits being a **pfn** (page frame number), and the lower bits being some
94+
architecture-specific bits such as memory protection.
95+
96+
The **entry** part of the name is a bit confusing because while in Linux 1.0
97+
this did refer to a single page table entry in the single top level page
98+
table, it was retrofitted to be an array of mapping elements when two-level
99+
page tables were first introduced, so the *pte* is the lowermost page
100+
*table*, not a page table *entry*.
101+
102+
- **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy right
103+
above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s.
104+
105+
- **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after
106+
the other levels to handle 4-level page tables. It is potentially unused,
107+
or *folded* as we will discuss later.
108+
109+
- **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to
110+
handle 5-level page tables after the *pud* was introduced. Now it was clear
111+
that we needed to replace *pgd*, *pmd*, *pud* etc with a figure indicating the
112+
directory level and that we cannot go on with ad hoc names any more. This
113+
is only used on systems which actually have 5 levels of page tables, otherwise
114+
it is folded.
115+
116+
- **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel
117+
main page table handling the PGD for the kernel memory is still found in
118+
`swapper_pg_dir`, but each userspace process in the system also has its own
119+
memory context and thus its own *pgd*, found in `struct mm_struct` which
120+
in turn is referenced to in each `struct task_struct`. So tasks have memory
121+
context in the form of a `struct mm_struct` and this in turn has a
122+
`struct pgt_t *pgd` pointer to the corresponding page global directory.
123+
124+
To repeat: each level in the page table hierarchy is a *array of pointers*, so
125+
the **pgd** contains `PTRS_PER_PGD` pointers to the next level below, **p4d**
126+
contains `PTRS_PER_P4D` pointers to **pud** items and so on. The number of
127+
pointers on each level is architecture-defined.::
128+
129+
PMD
130+
--> +-----+ PTE
131+
| ptr |-------> +-----+
132+
| ptr |- | ptr |-------> PAGE
133+
| ptr | \ | ptr |
134+
| ptr | \ ...
135+
| ... | \
136+
| ptr | \ PTE
137+
+-----+ +----> +-----+
138+
| ptr |-------> PAGE
139+
| ptr |
140+
...
141+
142+
143+
Page Table Folding
144+
==================
145+
146+
If the architecture does not use all the page table levels, they can be *folded*
147+
which means skipped, and all operations performed on page tables will be
148+
compile-time augmented to just skip a level when accessing the next lower
149+
level.
150+
151+
Page table handling code that wishes to be architecture-neutral, such as the
152+
virtual memory manager, will need to be written so that it traverses all of the
153+
currently five levels. This style should also be preferred for
154+
architecture-specific code, so as to be robust to future changes.

0 commit comments

Comments
 (0)