@@ -5,37 +5,12 @@ RCU and Unloadable Modules
55
66[Originally published in LWN Jan. 14, 2007: http://lwn.net/Articles/217484/]
77
8- RCU (read-copy update) is a synchronization mechanism that can be thought
9- of as a replacement for read-writer locking (among other things), but with
10- very low-overhead readers that are immune to deadlock, priority inversion,
11- and unbounded latency. RCU read-side critical sections are delimited
12- by rcu_read_lock() and rcu_read_unlock(), which, in non-CONFIG_PREEMPTION
13- kernels, generate no code whatsoever.
14-
15- This means that RCU writers are unaware of the presence of concurrent
16- readers, so that RCU updates to shared data must be undertaken quite
17- carefully, leaving an old version of the data structure in place until all
18- pre-existing readers have finished. These old versions are needed because
19- such readers might hold a reference to them. RCU updates can therefore be
20- rather expensive, and RCU is thus best suited for read-mostly situations.
21-
22- How can an RCU writer possibly determine when all readers are finished,
23- given that readers might well leave absolutely no trace of their
24- presence? There is a synchronize_rcu() primitive that blocks until all
25- pre-existing readers have completed. An updater wishing to delete an
26- element p from a linked list might do the following, while holding an
27- appropriate lock, of course::
28-
29- list_del_rcu(p);
30- synchronize_rcu();
31- kfree(p);
32-
33- But the above code cannot be used in IRQ context -- the call_rcu()
34- primitive must be used instead. This primitive takes a pointer to an
35- rcu_head struct placed within the RCU-protected data structure and
36- another pointer to a function that may be invoked later to free that
37- structure. Code to delete an element p from the linked list from IRQ
38- context might then be as follows::
8+ RCU updaters sometimes use call_rcu() to initiate an asynchronous wait for
9+ a grace period to elapse. This primitive takes a pointer to an rcu_head
10+ struct placed within the RCU-protected data structure and another pointer
11+ to a function that may be invoked later to free that structure. Code to
12+ delete an element p from the linked list from IRQ context might then be
13+ as follows::
3914
4015 list_del_rcu(p);
4116 call_rcu(&p->rcu, p_callback);
@@ -54,7 +29,7 @@ IRQ context. The function p_callback() might be defined as follows::
5429Unloading Modules That Use call_rcu()
5530-------------------------------------
5631
57- But what if p_callback is defined in an unloadable module?
32+ But what if the p_callback() function is defined in an unloadable module?
5833
5934If we unload the module while some RCU callbacks are pending,
6035the CPUs executing these callbacks are going to be severely
@@ -67,20 +42,21 @@ grace period to elapse, it does not wait for the callbacks to complete.
6742
6843One might be tempted to try several back-to-back synchronize_rcu()
6944calls, but this is still not guaranteed to work. If there is a very
70- heavy RCU-callback load, then some of the callbacks might be deferred
71- in order to allow other processing to proceed. Such deferral is required
72- in realtime kernels in order to avoid excessive scheduling latencies.
45+ heavy RCU-callback load, then some of the callbacks might be deferred in
46+ order to allow other processing to proceed. For but one example, such
47+ deferral is required in realtime kernels in order to avoid excessive
48+ scheduling latencies.
7349
7450
7551rcu_barrier()
7652-------------
7753
78- We instead need the rcu_barrier() primitive. Rather than waiting for
79- a grace period to elapse, rcu_barrier() waits for all outstanding RCU
80- callbacks to complete. Please note that rcu_barrier() does ** not ** imply
81- synchronize_rcu(), in particular, if there are no RCU callbacks queued
82- anywhere, rcu_barrier() is within its rights to return immediately,
83- without waiting for a grace period to elapse .
54+ This situation can be handled by the rcu_barrier() primitive. Rather
55+ than waiting for a grace period to elapse, rcu_barrier() waits for all
56+ outstanding RCU callbacks to complete. Please note that rcu_barrier()
57+ does ** not ** imply synchronize_rcu(), in particular, if there are no RCU
58+ callbacks queued anywhere, rcu_barrier() is within its rights to return
59+ immediately, without waiting for anything, let alone a grace period.
8460
8561Pseudo-code using rcu_barrier() is as follows:
8662
@@ -89,19 +65,22 @@ Pseudo-code using rcu_barrier() is as follows:
8965 3. Allow the module to be unloaded.
9066
9167There is also an srcu_barrier() function for SRCU, and you of course
92- must match the flavor of rcu_barrier () with that of call_rcu (). If your
93- module uses multiple flavors of call_rcu() , then it must also use multiple
94- flavors of rcu_barrier () when unloading that module. For example, if
95- it uses call_rcu(), call_srcu() on srcu_struct_1, and call_srcu() on
96- srcu_struct_2, then the following three lines of code will be required
97- when unloading::
68+ must match the flavor of srcu_barrier () with that of call_srcu ().
69+ If your module uses multiple srcu_struct structures , then it must also
70+ use multiple invocations of srcu_barrier () when unloading that module.
71+ For example, if it uses call_rcu(), call_srcu() on srcu_struct_1, and
72+ call_srcu() on srcu_struct_2, then the following three lines of code
73+ will be required when unloading::
9874
9975 1 rcu_barrier();
10076 2 srcu_barrier(&srcu_struct_1);
10177 3 srcu_barrier(&srcu_struct_2);
10278
103- The rcutorture module makes use of rcu_barrier() in its exit function
104- as follows::
79+ If latency is of the essence, workqueues could be used to run these
80+ three functions concurrently.
81+
82+ An ancient version of the rcutorture module makes use of rcu_barrier()
83+ in its exit function as follows::
10584
10685 1 static void
10786 2 rcu_torture_cleanup(void)
@@ -190,16 +169,17 @@ Quick Quiz #1:
190169:ref: `Answer to Quick Quiz #1 <answer_rcubarrier_quiz_1 >`
191170
192171Your module might have additional complications. For example, if your
193- module invokes call_rcu() from timers, you will need to first cancel all
194- the timers, and only then invoke rcu_barrier() to wait for any remaining
172+ module invokes call_rcu() from timers, you will need to first refrain
173+ from posting new timers, cancel (or wait for) all the already-posted
174+ timers, and only then invoke rcu_barrier() to wait for any remaining
195175RCU callbacks to complete.
196176
197- Of course, if you module uses call_rcu(), you will need to invoke
177+ Of course, if your module uses call_rcu(), you will need to invoke
198178rcu_barrier() before unloading. Similarly, if your module uses
199179call_srcu(), you will need to invoke srcu_barrier() before unloading,
200180and on the same srcu_struct structure. If your module uses call_rcu()
201- **and ** call_srcu(), then you will need to invoke rcu_barrier() ** and **
202- srcu_barrier().
181+ **and ** call_srcu(), then (as noted above) you will need to invoke
182+ rcu_barrier() ** and ** srcu_barrier().
203183
204184
205185Implementing rcu_barrier()
@@ -211,27 +191,40 @@ queues. His implementation queues an RCU callback on each of the per-CPU
211191callback queues, and then waits until they have all started executing, at
212192which point, all earlier RCU callbacks are guaranteed to have completed.
213193
214- The original code for rcu_barrier() was as follows::
215-
216- 1 void rcu_barrier(void)
217- 2 {
218- 3 BUG_ON(in_interrupt());
219- 4 /* Take cpucontrol mutex to protect against CPU hotplug */
220- 5 mutex_lock(&rcu_barrier_mutex);
221- 6 init_completion(&rcu_barrier_completion);
222- 7 atomic_set(&rcu_barrier_cpu_count, 0);
223- 8 on_each_cpu(rcu_barrier_func, NULL, 0, 1);
224- 9 wait_for_completion(&rcu_barrier_completion);
225- 10 mutex_unlock(&rcu_barrier_mutex);
226- 11 }
227-
228- Line 3 verifies that the caller is in process context, and lines 5 and 10
194+ The original code for rcu_barrier() was roughly as follows::
195+
196+ 1 void rcu_barrier(void)
197+ 2 {
198+ 3 BUG_ON(in_interrupt());
199+ 4 /* Take cpucontrol mutex to protect against CPU hotplug */
200+ 5 mutex_lock(&rcu_barrier_mutex);
201+ 6 init_completion(&rcu_barrier_completion);
202+ 7 atomic_set(&rcu_barrier_cpu_count, 1);
203+ 8 on_each_cpu(rcu_barrier_func, NULL, 0, 1);
204+ 9 if (atomic_dec_and_test(&rcu_barrier_cpu_count))
205+ 10 complete(&rcu_barrier_completion);
206+ 11 wait_for_completion(&rcu_barrier_completion);
207+ 12 mutex_unlock(&rcu_barrier_mutex);
208+ 13 }
209+
210+ Line 3 verifies that the caller is in process context, and lines 5 and 12
229211use rcu_barrier_mutex to ensure that only one rcu_barrier() is using the
230212global completion and counters at a time, which are initialized on lines
2312136 and 7. Line 8 causes each CPU to invoke rcu_barrier_func(), which is
232214shown below. Note that the final "1" in on_each_cpu()'s argument list
233215ensures that all the calls to rcu_barrier_func() will have completed
234- before on_each_cpu() returns. Line 9 then waits for the completion.
216+ before on_each_cpu() returns. Line 9 removes the initial count from
217+ rcu_barrier_cpu_count, and if this count is now zero, line 10 finalizes
218+ the completion, which prevents line 11 from blocking. Either way,
219+ line 11 then waits (if needed) for the completion.
220+
221+ .. _rcubarrier_quiz_2 :
222+
223+ Quick Quiz #2:
224+ Why doesn't line 8 initialize rcu_barrier_cpu_count to zero,
225+ thereby avoiding the need for lines 9 and 10?
226+
227+ :ref: `Answer to Quick Quiz #2 <answer_rcubarrier_quiz_2 >`
235228
236229This code was rewritten in 2008 and several times thereafter, but this
237230still gives the general idea.
@@ -253,7 +246,7 @@ to post an RCU callback, as follows::
253246Lines 3 and 4 locate RCU's internal per-CPU rcu_data structure,
254247which contains the struct rcu_head that needed for the later call to
255248call_rcu(). Line 7 picks up a pointer to this struct rcu_head, and line
256- 8 increments a global counter. This counter will later be decremented
249+ 8 increments the global counter. This counter will later be decremented
257250by the callback. Line 9 then registers the rcu_barrier_callback() on
258251the current CPU's queue.
259252
@@ -267,27 +260,28 @@ reaches zero, as follows::
267260 4 complete(&rcu_barrier_completion);
268261 5 }
269262
270- .. _ rcubarrier_quiz_2 :
263+ .. _ rcubarrier_quiz_3 :
271264
272- Quick Quiz #2 :
265+ Quick Quiz #3 :
273266 What happens if CPU 0's rcu_barrier_func() executes
274267 immediately (thus incrementing rcu_barrier_cpu_count to the
275268 value one), but the other CPU's rcu_barrier_func() invocations
276269 are delayed for a full grace period? Couldn't this result in
277270 rcu_barrier() returning prematurely?
278271
279- :ref: `Answer to Quick Quiz #2 < answer_rcubarrier_quiz_2 >`
272+ :ref: `Answer to Quick Quiz #3 < answer_rcubarrier_quiz_3 >`
280273
281274The current rcu_barrier() implementation is more complex, due to the need
282275to avoid disturbing idle CPUs (especially on battery-powered systems)
283276and the need to minimally disturb non-idle CPUs in real-time systems.
284- However, the code above illustrates the concepts.
277+ In addition, a great many optimizations have been applied. However,
278+ the code above illustrates the concepts.
285279
286280
287281rcu_barrier() Summary
288282---------------------
289283
290- The rcu_barrier() primitive has seen relatively little use , since most
284+ The rcu_barrier() primitive is used relatively infrequently , since most
291285code using RCU is in the core kernel rather than in modules. However, if
292286you are using RCU from an unloadable module, you need to use rcu_barrier()
293287so that your module may be safely unloaded.
@@ -318,6 +312,39 @@ Answer: Interestingly enough, rcu_barrier() was not originally
318312.. _answer_rcubarrier_quiz_2 :
319313
320314Quick Quiz #2:
315+ Why doesn't line 8 initialize rcu_barrier_cpu_count to zero,
316+ thereby avoiding the need for lines 9 and 10?
317+
318+ Answer: Suppose that the on_each_cpu() function shown on line 8 was
319+ delayed, so that CPU 0's rcu_barrier_func() executed and
320+ the corresponding grace period elapsed, all before CPU 1's
321+ rcu_barrier_func() started executing. This would result in
322+ rcu_barrier_cpu_count being decremented to zero, so that line
323+ 11's wait_for_completion() would return immediately, failing to
324+ wait for CPU 1's callbacks to be invoked.
325+
326+ Note that this was not a problem when the rcu_barrier() code
327+ was first added back in 2005. This is because on_each_cpu()
328+ disables preemption, which acted as an RCU read-side critical
329+ section, thus preventing CPU 0's grace period from completing
330+ until on_each_cpu() had dealt with all of the CPUs. However,
331+ with the advent of preemptible RCU, rcu_barrier() no longer
332+ waited on nonpreemptible regions of code in preemptible kernels,
333+ that being the job of the new rcu_barrier_sched() function.
334+
335+ However, with the RCU flavor consolidation around v4.20, this
336+ possibility was once again ruled out, because the consolidated
337+ RCU once again waits on nonpreemptible regions of code.
338+
339+ Nevertheless, that extra count might still be a good idea.
340+ Relying on these sort of accidents of implementation can result
341+ in later surprise bugs when the implementation changes.
342+
343+ :ref: `Back to Quick Quiz #2 <rcubarrier_quiz_2 >`
344+
345+ .. _answer_rcubarrier_quiz_3 :
346+
347+ Quick Quiz #3:
321348 What happens if CPU 0's rcu_barrier_func() executes
322349 immediately (thus incrementing rcu_barrier_cpu_count to the
323350 value one), but the other CPU's rcu_barrier_func() invocations
@@ -336,18 +363,15 @@ Answer: This cannot happen. The reason is that on_each_cpu() has its last
336363
337364 Therefore, on_each_cpu() disables preemption across its call
338365 to smp_call_function() and also across the local call to
339- rcu_barrier_func(). This prevents the local CPU from context
340- switching, again preventing grace periods from completing. This
366+ rcu_barrier_func(). Because recent RCU implementations treat
367+ preemption-disabled regions of code as RCU read-side critical
368+ sections, this prevents grace periods from completing. This
341369 means that all CPUs have executed rcu_barrier_func() before
342370 the first rcu_barrier_callback() can possibly execute, in turn
343371 preventing rcu_barrier_cpu_count from prematurely reaching zero.
344372
345- Currently, -rt implementations of RCU keep but a single global
346- queue for RCU callbacks, and thus do not suffer from this
347- problem. However, when the -rt RCU eventually does have per-CPU
348- callback queues, things will have to change. One simple change
349- is to add an rcu_read_lock() before line 8 of rcu_barrier()
350- and an rcu_read_unlock() after line 8 of this same function. If
351- you can think of a better change, please let me know!
373+ But if on_each_cpu() ever decides to forgo disabling preemption,
374+ as might well happen due to real-time latency considerations,
375+ initializing rcu_barrier_cpu_count to one will save the day.
352376
353- :ref: `Back to Quick Quiz #2 < rcubarrier_quiz_2 >`
377+ :ref: `Back to Quick Quiz #3 < rcubarrier_quiz_3 >`
0 commit comments