|
1 | 1 | // SPDX-License-Identifier: GPL-2.0 |
2 | 2 | /* |
3 | | - * SLUB: A slab allocator that limits cache line use instead of queuing |
4 | | - * objects in per cpu and per node lists. |
| 3 | + * SLUB: A slab allocator with low overhead percpu array caches and mostly |
| 4 | + * lockless freeing of objects to slabs in the slowpath. |
5 | 5 | * |
6 | | - * The allocator synchronizes using per slab locks or atomic operations |
7 | | - * and only uses a centralized lock to manage a pool of partial slabs. |
| 6 | + * The allocator synchronizes using spin_trylock for percpu arrays in the |
| 7 | + * fastpath, and cmpxchg_double (or bit spinlock) for slowpath freeing. |
| 8 | + * Uses a centralized lock to manage a pool of partial slabs. |
8 | 9 | * |
9 | 10 | * (C) 2007 SGI, Christoph Lameter |
10 | 11 | * (C) 2011 Linux Foundation, Christoph Lameter |
| 12 | + * (C) 2025 SUSE, Vlastimil Babka |
11 | 13 | */ |
12 | 14 |
|
13 | 15 | #include <linux/mm.h> |
|
53 | 55 |
|
54 | 56 | /* |
55 | 57 | * Lock order: |
56 | | - * 1. slab_mutex (Global Mutex) |
57 | | - * 2. node->list_lock (Spinlock) |
58 | | - * 3. kmem_cache->cpu_slab->lock (Local lock) |
59 | | - * 4. slab_lock(slab) (Only on some arches) |
60 | | - * 5. object_map_lock (Only for debugging) |
| 58 | + * 0. cpu_hotplug_lock |
| 59 | + * 1. slab_mutex (Global Mutex) |
| 60 | + * 2a. kmem_cache->cpu_sheaves->lock (Local trylock) |
| 61 | + * 2b. node->barn->lock (Spinlock) |
| 62 | + * 2c. node->list_lock (Spinlock) |
| 63 | + * 3. slab_lock(slab) (Only on some arches) |
| 64 | + * 4. object_map_lock (Only for debugging) |
61 | 65 | * |
62 | 66 | * slab_mutex |
63 | 67 | * |
|
78 | 82 | * C. slab->objects -> Number of objects in slab |
79 | 83 | * D. slab->frozen -> frozen state |
80 | 84 | * |
81 | | - * Frozen slabs |
| 85 | + * SL_partial slabs |
| 86 | + * |
| 87 | + * Slabs on node partial list have at least one free object. A limited number |
| 88 | + * of slabs on the list can be fully free (slab->inuse == 0), until we start |
| 89 | + * discarding them. These slabs are marked with SL_partial, and the flag is |
| 90 | + * cleared while removing them, usually to grab their freelist afterwards. |
| 91 | + * This clearing also exempts them from list management. Please see |
| 92 | + * __slab_free() for more details. |
82 | 93 | * |
83 | | - * If a slab is frozen then it is exempt from list management. It is |
84 | | - * the cpu slab which is actively allocated from by the processor that |
85 | | - * froze it and it is not on any list. The processor that froze the |
86 | | - * slab is the one who can perform list operations on the slab. Other |
87 | | - * processors may put objects onto the freelist but the processor that |
88 | | - * froze the slab is the only one that can retrieve the objects from the |
89 | | - * slab's freelist. |
| 94 | + * Full slabs |
90 | 95 | * |
91 | | - * CPU partial slabs |
| 96 | + * For caches without debugging enabled, full slabs (slab->inuse == |
| 97 | + * slab->objects and slab->freelist == NULL) are not placed on any list. |
| 98 | + * The __slab_free() freeing the first object from such a slab will place |
| 99 | + * it on the partial list. Caches with debugging enabled place such slab |
| 100 | + * on the full list and use different allocation and freeing paths. |
| 101 | + * |
| 102 | + * Frozen slabs |
92 | 103 | * |
93 | | - * The partially empty slabs cached on the CPU partial list are used |
94 | | - * for performance reasons, which speeds up the allocation process. |
95 | | - * These slabs are not frozen, but are also exempt from list management, |
96 | | - * by clearing the SL_partial flag when moving out of the node |
97 | | - * partial list. Please see __slab_free() for more details. |
| 104 | + * If a slab is frozen then it is exempt from list management. It is used to |
| 105 | + * indicate a slab that has failed consistency checks and thus cannot be |
| 106 | + * allocated from anymore - it is also marked as full. Any previously |
| 107 | + * allocated objects will be simply leaked upon freeing instead of attempting |
| 108 | + * to modify the potentially corrupted freelist and metadata. |
98 | 109 | * |
99 | 110 | * To sum up, the current scheme is: |
100 | | - * - node partial slab: SL_partial && !frozen |
101 | | - * - cpu partial slab: !SL_partial && !frozen |
102 | | - * - cpu slab: !SL_partial && frozen |
103 | | - * - full slab: !SL_partial && !frozen |
| 111 | + * - node partial slab: SL_partial && !full && !frozen |
| 112 | + * - taken off partial list: !SL_partial && !full && !frozen |
| 113 | + * - full slab, not on any list: !SL_partial && full && !frozen |
| 114 | + * - frozen due to inconsistency: !SL_partial && full && frozen |
104 | 115 | * |
105 | | - * list_lock |
| 116 | + * node->list_lock (spinlock) |
106 | 117 | * |
107 | 118 | * The list_lock protects the partial and full list on each node and |
108 | 119 | * the partial slab counter. If taken then no new slabs may be added or |
|
112 | 123 | * |
113 | 124 | * The list_lock is a centralized lock and thus we avoid taking it as |
114 | 125 | * much as possible. As long as SLUB does not have to handle partial |
115 | | - * slabs, operations can continue without any centralized lock. F.e. |
116 | | - * allocating a long series of objects that fill up slabs does not require |
117 | | - * the list lock. |
| 126 | + * slabs, operations can continue without any centralized lock. |
118 | 127 | * |
119 | 128 | * For debug caches, all allocations are forced to go through a list_lock |
120 | 129 | * protected region to serialize against concurrent validation. |
121 | 130 | * |
122 | | - * cpu_slab->lock local lock |
| 131 | + * cpu_sheaves->lock (local_trylock) |
123 | 132 | * |
124 | | - * This locks protect slowpath manipulation of all kmem_cache_cpu fields |
125 | | - * except the stat counters. This is a percpu structure manipulated only by |
126 | | - * the local cpu, so the lock protects against being preempted or interrupted |
127 | | - * by an irq. Fast path operations rely on lockless operations instead. |
| 133 | + * This lock protects fastpath operations on the percpu sheaves. On !RT it |
| 134 | + * only disables preemption and does no atomic operations. As long as the main |
| 135 | + * or spare sheaf can handle the allocation or free, there is no other |
| 136 | + * overhead. |
128 | 137 | * |
129 | | - * On PREEMPT_RT, the local lock neither disables interrupts nor preemption |
130 | | - * which means the lockless fastpath cannot be used as it might interfere with |
131 | | - * an in-progress slow path operations. In this case the local lock is always |
132 | | - * taken but it still utilizes the freelist for the common operations. |
| 138 | + * node->barn->lock (spinlock) |
133 | 139 | * |
134 | | - * lockless fastpaths |
| 140 | + * This lock protects the operations on per-NUMA-node barn. It can quickly |
| 141 | + * serve an empty or full sheaf if available, and avoid more expensive refill |
| 142 | + * or flush operation. |
135 | 143 | * |
136 | | - * The fast path allocation (slab_alloc_node()) and freeing (do_slab_free()) |
137 | | - * are fully lockless when satisfied from the percpu slab (and when |
138 | | - * cmpxchg_double is possible to use, otherwise slab_lock is taken). |
139 | | - * They also don't disable preemption or migration or irqs. They rely on |
140 | | - * the transaction id (tid) field to detect being preempted or moved to |
141 | | - * another cpu. |
| 144 | + * Lockless freeing |
| 145 | + * |
| 146 | + * Objects may have to be freed to their slabs when they are from a remote |
| 147 | + * node (where we want to avoid filling local sheaves with remote objects) |
| 148 | + * or when there are too many full sheaves. On architectures supporting |
| 149 | + * cmpxchg_double this is done by a lockless update of slab's freelist and |
| 150 | + * counters, otherwise slab_lock is taken. This only needs to take the |
| 151 | + * list_lock if it's a first free to a full slab, or when a slab becomes empty |
| 152 | + * after the free. |
142 | 153 | * |
143 | 154 | * irq, preemption, migration considerations |
144 | 155 | * |
145 | | - * Interrupts are disabled as part of list_lock or local_lock operations, or |
| 156 | + * Interrupts are disabled as part of list_lock or barn lock operations, or |
146 | 157 | * around the slab_lock operation, in order to make the slab allocator safe |
147 | 158 | * to use in the context of an irq. |
| 159 | + * Preemption is disabled as part of local_trylock operations. |
| 160 | + * kmalloc_nolock() and kfree_nolock() are safe in NMI context but see |
| 161 | + * their limitations. |
148 | 162 | * |
149 | | - * In addition, preemption (or migration on PREEMPT_RT) is disabled in the |
150 | | - * allocation slowpath, bulk allocation, and put_cpu_partial(), so that the |
151 | | - * local cpu doesn't change in the process and e.g. the kmem_cache_cpu pointer |
152 | | - * doesn't have to be revalidated in each section protected by the local lock. |
153 | | - * |
154 | | - * SLUB assigns one slab for allocation to each processor. |
155 | | - * Allocations only occur from these slabs called cpu slabs. |
| 163 | + * SLUB assigns two object arrays called sheaves for caching allocations and |
| 164 | + * frees on each cpu, with a NUMA node shared barn for balancing between cpus. |
| 165 | + * Allocations and frees are primarily served from these sheaves. |
156 | 166 | * |
157 | 167 | * Slabs with free elements are kept on a partial list and during regular |
158 | 168 | * operations no list for full slabs is used. If an object in a full slab is |
159 | 169 | * freed then the slab will show up again on the partial lists. |
160 | 170 | * We track full slabs for debugging purposes though because otherwise we |
161 | 171 | * cannot scan all objects. |
162 | 172 | * |
163 | | - * Slabs are freed when they become empty. Teardown and setup is |
164 | | - * minimal so we rely on the page allocators per cpu caches for |
165 | | - * fast frees and allocs. |
166 | | - * |
167 | | - * slab->frozen The slab is frozen and exempt from list processing. |
168 | | - * This means that the slab is dedicated to a purpose |
169 | | - * such as satisfying allocations for a specific |
170 | | - * processor. Objects may be freed in the slab while |
171 | | - * it is frozen but slab_free will then skip the usual |
172 | | - * list operations. It is up to the processor holding |
173 | | - * the slab to integrate the slab into the slab lists |
174 | | - * when the slab is no longer needed. |
175 | | - * |
176 | | - * One use of this flag is to mark slabs that are |
177 | | - * used for allocations. Then such a slab becomes a cpu |
178 | | - * slab. The cpu slab may be equipped with an additional |
179 | | - * freelist that allows lockless access to |
180 | | - * free objects in addition to the regular freelist |
181 | | - * that requires the slab lock. |
| 173 | + * Slabs are freed when they become empty. Teardown and setup is minimal so we |
| 174 | + * rely on the page allocators per cpu caches for fast frees and allocs. |
182 | 175 | * |
183 | 176 | * SLAB_DEBUG_FLAGS Slab requires special handling due to debug |
184 | 177 | * options set. This moves slab handling out of |
|
0 commit comments