Skip to content

Commit 6dcde5d

Browse files
diandersakpm00
authored andcommitted
watchdog/hardlockup: adopt softlockup logic avoiding double-dumps
Patch series "watchdog: Better handling of concurrent lockups". When we get multiple lockups at roughly the same time, the output in the kernel logs can be very confusing since the reports about the lockups end up interleaved in the logs. There is some code in the kernel to try to handle this but it wasn't that complete. Li Zhe recently made this a bit better for softlockups (specifically for the case where `kernel.softlockup_all_cpu_backtrace` is not set) in commit 9d02330 ("softlockup: serialized softlockup's log"), but that only handled softlockup reports. Hardlockup reports still had similar issues. This series also has a small fix to avoid dumping all stacks a second time in the case of a panic. This is a bit unrelated to the interleaving fixes but it does also improve the clarity of lockup reports. This patch (of 4): The hardlockup detector and softlockup detector both have the ability to dump the stack of all CPUs (`kernel.hardlockup_all_cpu_backtrace` and `kernel.softlockup_all_cpu_backtrace`). Both detectors also have some logic to attempt to avoid interleaving printouts if two CPUs were trying to do dumps of all CPUs at the same time. However: - The hardlockup detector's logic still allowed interleaving some information. Specifically another CPU could print modules and dump the stack of the locked CPU at the same time we were dumping all CPUs. - In the case where `kernel.hardlockup_panic` was set in addition to `kernel.hardlockup_all_cpu_backtrace`, when two CPUs both detected hardlockups at the same time the second CPU could call panic() while the first was still dumping stacks. This was especially bad if the locked up CPU wasn't responding to the request for a backtrace since the function nmi_trigger_cpumask_backtrace() can wait up to 10 seconds. Let's resolve this by adopting the softlockup logic in the hardlockup handler. NOTES: - As part of this, one might think that we should make a helper function that both the hard and softlockup detectors call. This turns out not to be super trivial since it would have to be parameterized quite a bit since there are separate global variables controlling each lockup detector and they print log messages that are just different enough that it would be a pain. We probably don't want to change the messages that are printed without good reason to avoid throwing log parsers for a loop. - One might also think that it would be a good idea to have the hardlockup and softlockup detector use the same global variable to prevent interleaving. This would make sure that softlockups and hardlockups can't interleave each other. That _almost_ works but has a dangerous flaw if `kernel.hardlockup_panic` is not the same as `kernel.softlockup_panic` because we might skip a call to panic() if one type of lockup was detected at the same time as another. Link: https://lkml.kernel.org/r/20231220211640.2023645-1-dianders@chromium.org Link: https://lkml.kernel.org/r/20231220131534.1.I4f35a69fbb124b5f0c71f75c631e11fabbe188ff@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Cc: John Ogness <john.ogness@linutronix.de> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent 2861b37 commit 6dcde5d

1 file changed

Lines changed: 13 additions & 7 deletions

File tree

kernel/watchdog.c

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ static DEFINE_PER_CPU(atomic_t, hrtimer_interrupts);
9191
static DEFINE_PER_CPU(int, hrtimer_interrupts_saved);
9292
static DEFINE_PER_CPU(bool, watchdog_hardlockup_warned);
9393
static DEFINE_PER_CPU(bool, watchdog_hardlockup_touched);
94-
static unsigned long watchdog_hardlockup_all_cpu_dumped;
94+
static unsigned long hard_lockup_nmi_warn;
9595

9696
notrace void arch_touch_nmi_watchdog(void)
9797
{
@@ -156,6 +156,15 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
156156
if (per_cpu(watchdog_hardlockup_warned, cpu))
157157
return;
158158

159+
/*
160+
* Prevent multiple hard-lockup reports if one cpu is already
161+
* engaged in dumping all cpu back traces.
162+
*/
163+
if (sysctl_hardlockup_all_cpu_backtrace) {
164+
if (test_and_set_bit_lock(0, &hard_lockup_nmi_warn))
165+
return;
166+
}
167+
159168
pr_emerg("Watchdog detected hard LOCKUP on cpu %d\n", cpu);
160169
print_modules();
161170
print_irqtrace_events(current);
@@ -168,13 +177,10 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
168177
trigger_single_cpu_backtrace(cpu);
169178
}
170179

171-
/*
172-
* Perform multi-CPU dump only once to avoid multiple
173-
* hardlockups generating interleaving traces
174-
*/
175-
if (sysctl_hardlockup_all_cpu_backtrace &&
176-
!test_and_set_bit(0, &watchdog_hardlockup_all_cpu_dumped))
180+
if (sysctl_hardlockup_all_cpu_backtrace) {
177181
trigger_allbutcpu_cpu_backtrace(cpu);
182+
clear_bit_unlock(0, &hard_lockup_nmi_warn);
183+
}
178184

179185
if (hardlockup_panic)
180186
nmi_panic(regs, "Hard LOCKUP");

0 commit comments

Comments
 (0)