|
3 | 3 | =========== |
4 | 4 | Page Tables |
5 | 5 | =========== |
| 6 | + |
| 7 | +Paged virtual memory was invented along with virtual memory as a concept in |
| 8 | +1962 on the Ferranti Atlas Computer which was the first computer with paged |
| 9 | +virtual memory. The feature migrated to newer computers and became a de facto |
| 10 | +feature of all Unix-like systems as time went by. In 1985 the feature was |
| 11 | +included in the Intel 80386, which was the CPU Linux 1.0 was developed on. |
| 12 | + |
| 13 | +Page tables map virtual addresses as seen by the CPU into physical addresses |
| 14 | +as seen on the external memory bus. |
| 15 | + |
| 16 | +Linux defines page tables as a hierarchy which is currently five levels in |
| 17 | +height. The architecture code for each supported architecture will then |
| 18 | +map this to the restrictions of the hardware. |
| 19 | + |
| 20 | +The physical address corresponding to the virtual address is often referenced |
| 21 | +by the underlying physical page frame. The **page frame number** or **pfn** |
| 22 | +is the physical address of the page (as seen on the external memory bus) |
| 23 | +divided by `PAGE_SIZE`. |
| 24 | + |
| 25 | +Physical memory address 0 will be *pfn 0* and the highest pfn will be |
| 26 | +the last page of physical memory the external address bus of the CPU can |
| 27 | +address. |
| 28 | + |
| 29 | +With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at |
| 30 | +address 0x00000000, pfn 1 is at address 0x00001000, pfn 2 is at 0x00002000 |
| 31 | +and so on until we reach pfn 0xfffff at 0xfffff000. With 16KB pages pfs are |
| 32 | +at 0x00004000, 0x00008000 ... 0xffffc000 and pfn goes from 0 to 0x3fffff. |
| 33 | + |
| 34 | +As you can see, with 4KB pages the page base address uses bits 12-31 of the |
| 35 | +address, and this is why `PAGE_SHIFT` in this case is defined as 12 and |
| 36 | +`PAGE_SIZE` is usually defined in terms of the page shift as `(1 << PAGE_SHIFT)` |
| 37 | + |
| 38 | +Over time a deeper hierarchy has been developed in response to increasing memory |
| 39 | +sizes. When Linux was created, 4KB pages and a single page table called |
| 40 | +`swapper_pg_dir` with 1024 entries was used, covering 4MB which coincided with |
| 41 | +the fact that Torvald's first computer had 4MB of physical memory. Entries in |
| 42 | +this single table were referred to as *PTE*:s - page table entries. |
| 43 | + |
| 44 | +The software page table hierarchy reflects the fact that page table hardware has |
| 45 | +become hierarchical and that in turn is done to save page table memory and |
| 46 | +speed up mapping. |
| 47 | + |
| 48 | +One could of course imagine a single, linear page table with enormous amounts |
| 49 | +of entries, breaking down the whole memory into single pages. Such a page table |
| 50 | +would be very sparse, because large portions of the virtual memory usually |
| 51 | +remains unused. By using hierarchical page tables large holes in the virtual |
| 52 | +address space does not waste valuable page table memory, because it will suffice |
| 53 | +to mark large areas as unmapped at a higher level in the page table hierarchy. |
| 54 | + |
| 55 | +Additionally, on modern CPUs, a higher level page table entry can point directly |
| 56 | +to a physical memory range, which allows mapping a contiguous range of several |
| 57 | +megabytes or even gigabytes in a single high-level page table entry, taking |
| 58 | +shortcuts in mapping virtual memory to physical memory: there is no need to |
| 59 | +traverse deeper in the hierarchy when you find a large mapped range like this. |
| 60 | + |
| 61 | +The page table hierarchy has now developed into this:: |
| 62 | + |
| 63 | + +-----+ |
| 64 | + | PGD | |
| 65 | + +-----+ |
| 66 | + | |
| 67 | + | +-----+ |
| 68 | + +-->| P4D | |
| 69 | + +-----+ |
| 70 | + | |
| 71 | + | +-----+ |
| 72 | + +-->| PUD | |
| 73 | + +-----+ |
| 74 | + | |
| 75 | + | +-----+ |
| 76 | + +-->| PMD | |
| 77 | + +-----+ |
| 78 | + | |
| 79 | + | +-----+ |
| 80 | + +-->| PTE | |
| 81 | + +-----+ |
| 82 | + |
| 83 | + |
| 84 | +Symbols on the different levels of the page table hierarchy have the following |
| 85 | +meaning beginning from the bottom: |
| 86 | + |
| 87 | +- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier. |
| 88 | + The *pte* is an array of `PTRS_PER_PTE` elements of the `pteval_t` type, each |
| 89 | + mapping a single page of virtual memory to a single page of physical memory. |
| 90 | + The architecture defines the size and contents of `pteval_t`. |
| 91 | + |
| 92 | + A typical example is that the `pteval_t` is a 32- or 64-bit value with the |
| 93 | + upper bits being a **pfn** (page frame number), and the lower bits being some |
| 94 | + architecture-specific bits such as memory protection. |
| 95 | + |
| 96 | + The **entry** part of the name is a bit confusing because while in Linux 1.0 |
| 97 | + this did refer to a single page table entry in the single top level page |
| 98 | + table, it was retrofitted to be an array of mapping elements when two-level |
| 99 | + page tables were first introduced, so the *pte* is the lowermost page |
| 100 | + *table*, not a page table *entry*. |
| 101 | + |
| 102 | +- **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy right |
| 103 | + above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s. |
| 104 | + |
| 105 | +- **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after |
| 106 | + the other levels to handle 4-level page tables. It is potentially unused, |
| 107 | + or *folded* as we will discuss later. |
| 108 | + |
| 109 | +- **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to |
| 110 | + handle 5-level page tables after the *pud* was introduced. Now it was clear |
| 111 | + that we needed to replace *pgd*, *pmd*, *pud* etc with a figure indicating the |
| 112 | + directory level and that we cannot go on with ad hoc names any more. This |
| 113 | + is only used on systems which actually have 5 levels of page tables, otherwise |
| 114 | + it is folded. |
| 115 | + |
| 116 | +- **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel |
| 117 | + main page table handling the PGD for the kernel memory is still found in |
| 118 | + `swapper_pg_dir`, but each userspace process in the system also has its own |
| 119 | + memory context and thus its own *pgd*, found in `struct mm_struct` which |
| 120 | + in turn is referenced to in each `struct task_struct`. So tasks have memory |
| 121 | + context in the form of a `struct mm_struct` and this in turn has a |
| 122 | + `struct pgt_t *pgd` pointer to the corresponding page global directory. |
| 123 | + |
| 124 | +To repeat: each level in the page table hierarchy is a *array of pointers*, so |
| 125 | +the **pgd** contains `PTRS_PER_PGD` pointers to the next level below, **p4d** |
| 126 | +contains `PTRS_PER_P4D` pointers to **pud** items and so on. The number of |
| 127 | +pointers on each level is architecture-defined.:: |
| 128 | + |
| 129 | + PMD |
| 130 | + --> +-----+ PTE |
| 131 | + | ptr |-------> +-----+ |
| 132 | + | ptr |- | ptr |-------> PAGE |
| 133 | + | ptr | \ | ptr | |
| 134 | + | ptr | \ ... |
| 135 | + | ... | \ |
| 136 | + | ptr | \ PTE |
| 137 | + +-----+ +----> +-----+ |
| 138 | + | ptr |-------> PAGE |
| 139 | + | ptr | |
| 140 | + ... |
| 141 | + |
| 142 | + |
| 143 | +Page Table Folding |
| 144 | +================== |
| 145 | + |
| 146 | +If the architecture does not use all the page table levels, they can be *folded* |
| 147 | +which means skipped, and all operations performed on page tables will be |
| 148 | +compile-time augmented to just skip a level when accessing the next lower |
| 149 | +level. |
| 150 | + |
| 151 | +Page table handling code that wishes to be architecture-neutral, such as the |
| 152 | +virtual memory manager, will need to be written so that it traverses all of the |
| 153 | +currently five levels. This style should also be preferred for |
| 154 | +architecture-specific code, so as to be robust to future changes. |
0 commit comments