Skip to content

Commit 0f90b88

Browse files
pmladektorvalds
authored andcommitted
watchdog: reliable handling of timestamps
Commit 9bf3bc9 ("watchdog: cleanup handling of false positives") tried to handle a virtual host stopped by the host a more straightforward and cleaner way. But it introduced a risk of false softlockup reports. The virtual host might be stopped at any time, for example between kvm_check_and_clear_guest_paused() and is_softlockup(). As a result, is_softlockup() might read the updated jiffies and detects a softlockup. A solution might be to put back kvm_check_and_clear_guest_paused() after is_softlockup() and detect it. But it would put back the cycle that complicates the logic. In fact, the handling of all the timestamps is not reliable. The code does not guarantee when and how many times the timestamps are read. For example, "period_ts" might be touched anytime also from NMI and re-read in is_softlockup(). It works just by chance. Fix all the problems by making the code even more explicit. 1. Make sure that "now" and "period_ts" timestamps are read only once. They might be changed at anytime by NMI or when the virtual guest is stopped by the host. Note that "now" timestamp does this implicitly because "jiffies" is marked volatile. 2. "now" time must be read first. The state of "period_ts" will decide whether it will be used or the period will get restarted. 3. kvm_check_and_clear_guest_paused() must be called before reading "period_ts". It touches the variable when the guest was stopped. As a result, "now" timestamp is used only when the watchdog was not touched and the guest not stopped in the meantime. "period_ts" is restarted in all other situations. Link: https://lkml.kernel.org/r/YKT55gw+RZfyoFf7@alley Fixes: 9bf3bc9 ("watchdog: cleanup handling of false positives") Signed-off-by: Petr Mladek <pmladek@suse.com> Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent f70b004 commit 0f90b88

1 file changed

Lines changed: 20 additions & 14 deletions

File tree

kernel/watchdog.c

Lines changed: 20 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -302,10 +302,10 @@ void touch_softlockup_watchdog_sync(void)
302302
__this_cpu_write(watchdog_report_ts, SOFTLOCKUP_DELAY_REPORT);
303303
}
304304

305-
static int is_softlockup(unsigned long touch_ts, unsigned long period_ts)
305+
static int is_softlockup(unsigned long touch_ts,
306+
unsigned long period_ts,
307+
unsigned long now)
306308
{
307-
unsigned long now = get_timestamp();
308-
309309
if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){
310310
/* Warn about unreasonable delays. */
311311
if (time_after(now, period_ts + get_softlockup_thresh()))
@@ -353,8 +353,7 @@ static int softlockup_fn(void *data)
353353
/* watchdog kicker functions */
354354
static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
355355
{
356-
unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
357-
unsigned long period_ts = __this_cpu_read(watchdog_report_ts);
356+
unsigned long touch_ts, period_ts, now;
358357
struct pt_regs *regs = get_irq_regs();
359358
int duration;
360359
int softlockup_all_cpu_backtrace = sysctl_softlockup_all_cpu_backtrace;
@@ -376,12 +375,23 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
376375
/* .. and repeat */
377376
hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));
378377

378+
/*
379+
* Read the current timestamp first. It might become invalid anytime
380+
* when a virtual machine is stopped by the host or when the watchog
381+
* is touched from NMI.
382+
*/
383+
now = get_timestamp();
379384
/*
380385
* If a virtual machine is stopped by the host it can look to
381-
* the watchdog like a soft lockup. Check to see if the host
382-
* stopped the vm before we process the timestamps.
386+
* the watchdog like a soft lockup. This function touches the watchdog.
383387
*/
384388
kvm_check_and_clear_guest_paused();
389+
/*
390+
* The stored timestamp is comparable with @now only when not touched.
391+
* It might get touched anytime from NMI. Make sure that is_softlockup()
392+
* uses the same (valid) value.
393+
*/
394+
period_ts = READ_ONCE(*this_cpu_ptr(&watchdog_report_ts));
385395

386396
/* Reset the interval when touched by known problematic code. */
387397
if (period_ts == SOFTLOCKUP_DELAY_REPORT) {
@@ -398,13 +408,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
398408
return HRTIMER_RESTART;
399409
}
400410

401-
/* check for a softlockup
402-
* This is done by making sure a high priority task is
403-
* being scheduled. The task touches the watchdog to
404-
* indicate it is getting cpu time. If it hasn't then
405-
* this is a good indication some task is hogging the cpu
406-
*/
407-
duration = is_softlockup(touch_ts, period_ts);
411+
/* Check for a softlockup. */
412+
touch_ts = __this_cpu_read(watchdog_touch_ts);
413+
duration = is_softlockup(touch_ts, period_ts, now);
408414
if (unlikely(duration)) {
409415
/*
410416
* Prevent multiple soft-lockup reports if one cpu is already

0 commit comments

Comments
 (0)