@@ -390,3 +390,91 @@ for example, "P3421".
390390
391391It is entirely possible to see stall warnings from normal and from
392392expedited grace periods at about the same time during the same run.
393+
394+ RCU_CPU_STALL_CPUTIME
395+ =====================
396+
397+ In kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y or booted with
398+ rcupdate.rcu_cpu_stall_cputime=1, the following additional information
399+ is supplied with each RCU CPU stall warning::
400+
401+ rcu: hardirqs softirqs csw/system
402+ rcu: number: 624 45 0
403+ rcu: cputime: 69 1 2425 ==> 2500(ms)
404+
405+ These statistics are collected during the sampling period. The values
406+ in row "number:" are the number of hard interrupts, number of soft
407+ interrupts, and number of context switches on the stalled CPU. The
408+ first three values in row "cputime:" indicate the CPU time in
409+ milliseconds consumed by hard interrupts, soft interrupts, and tasks
410+ on the stalled CPU. The last number is the measurement interval, again
411+ in milliseconds. Because user-mode tasks normally do not cause RCU CPU
412+ stalls, these tasks are typically kernel tasks, which is why only the
413+ system CPU time are considered.
414+
415+ The sampling period is shown as follows:
416+ :<------------first timeout---------->:<-----second timeout----->:
417+ :<--half timeout-->:<--half timeout-->: :
418+ : :<--first period-->: :
419+ : :<-----------second sampling period---------->:
420+ : : : :
421+ : snapshot time point 1st-stall 2nd-stall
422+
423+
424+ The following describes four typical scenarios:
425+
426+ 1. A CPU looping with interrupts disabled.::
427+
428+ rcu: hardirqs softirqs csw/system
429+ rcu: number: 0 0 0
430+ rcu: cputime: 0 0 0 ==> 2500(ms)
431+
432+ Because interrupts have been disabled throughout the measurement
433+ interval, there are no interrupts and no context switches.
434+ Furthermore, because CPU time consumption was measured using interrupt
435+ handlers, the system CPU consumption is misleadingly measured as zero.
436+ This scenario will normally also have "(0 ticks this GP)" printed on
437+ this CPU's summary line.
438+
439+ 2. A CPU looping with bottom halves disabled.
440+
441+ This is similar to the previous example, but with non-zero number of
442+ and CPU time consumed by hard interrupts, along with non-zero CPU
443+ time consumed by in-kernel execution.::
444+
445+ rcu: hardirqs softirqs csw/system
446+ rcu: number: 624 0 0
447+ rcu: cputime: 49 0 2446 ==> 2500(ms)
448+
449+ The fact that there are zero softirqs gives a hint that these were
450+ disabled, perhaps via local_bh_disable(). It is of course possible
451+ that there were no softirqs, perhaps because all events that would
452+ result in softirq execution are confined to other CPUs. In this case,
453+ the diagnosis should continue as shown in the next example.
454+
455+ 3. A CPU looping with preemption disabled.
456+
457+ Here, only the number of context switches is zero.::
458+
459+ rcu: hardirqs softirqs csw/system
460+ rcu: number: 624 45 0
461+ rcu: cputime: 69 1 2425 ==> 2500(ms)
462+
463+ This situation hints that the stalled CPU was looping with preemption
464+ disabled.
465+
466+ 4. No looping, but massive hard and soft interrupts.::
467+
468+ rcu: hardirqs softirqs csw/system
469+ rcu: number: xx xx 0
470+ rcu: cputime: xx xx 0 ==> 2500(ms)
471+
472+ Here, the number and CPU time of hard interrupts are all non-zero,
473+ but the number of context switches and the in-kernel CPU time consumed
474+ are zero. The number and cputime of soft interrupts will usually be
475+ non-zero, but could be zero, for example, if the CPU was spinning
476+ within a single hard interrupt handler.
477+
478+ If this type of RCU CPU stall warning can be reproduced, you can
479+ narrow it down by looking at /proc/interrupts or by writing code to
480+ trace each interrupt, for example, by referring to show_interrupts().
0 commit comments