Skip to content

Commit 9f3b130

Browse files
zhiquan1-libp3tk0v
authored andcommitted
x86/mce: Mark fatal MCE's page as poison to avoid panic in the kdump kernel
Memory errors don't happen very often, especially fatal ones. However, in large-scale scenarios such as data centers, that probability increases with the amount of machines present. When a fatal machine check happens, mce_panic() is called based on the severity grading of that error. The page containing the error is not marked as poison. However, when kexec is enabled, tools like makedumpfile understand when pages are marked as poison and do not touch them so as not to cause a fatal machine check exception again while dumping the previous kernel's memory. Therefore, mark the page containing the error as poisoned so that the kexec'ed kernel can avoid accessing the page. [ bp: Rewrite commit message and comment. ] Co-developed-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Link: https://lore.kernel.org/r/20231014051754.3759099-1-zhiquan1.li@intel.com
1 parent b85ea95 commit 9f3b130

1 file changed

Lines changed: 16 additions & 0 deletions

File tree

arch/x86/kernel/cpu/mce/core.c

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@
4444
#include <linux/sync_core.h>
4545
#include <linux/task_work.h>
4646
#include <linux/hardirq.h>
47+
#include <linux/kexec.h>
4748

4849
#include <asm/intel-family.h>
4950
#include <asm/processor.h>
@@ -233,6 +234,7 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
233234
struct llist_node *pending;
234235
struct mce_evt_llist *l;
235236
int apei_err = 0;
237+
struct page *p;
236238

237239
/*
238240
* Allow instrumentation around external facilities usage. Not that it
@@ -286,6 +288,20 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
286288
if (!fake_panic) {
287289
if (panic_timeout == 0)
288290
panic_timeout = mca_cfg.panic_timeout;
291+
292+
/*
293+
* Kdump skips the poisoned page in order to avoid
294+
* touching the error bits again. Poison the page even
295+
* if the error is fatal and the machine is about to
296+
* panic.
297+
*/
298+
if (kexec_crash_loaded()) {
299+
if (final && (final->status & MCI_STATUS_ADDRV)) {
300+
p = pfn_to_online_page(final->addr >> PAGE_SHIFT);
301+
if (p)
302+
SetPageHWPoison(p);
303+
}
304+
}
289305
panic(msg);
290306
} else
291307
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);

0 commit comments

Comments
 (0)