Skip to content

Commit 88afbb2

Browse files
committed
Merge tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Thomas Gleixner: "A set of fixes for kexec(), reboot and shutdown issues: - Ensure that the WBINVD in stop_this_cpu() has been completed before the control CPU proceedes. stop_this_cpu() is used for kexec(), reboot and shutdown to park the APs in a HLT loop. The control CPU sends an IPI to the APs and waits for their CPU online bits to be cleared. Once they all are marked "offline" it proceeds. But stop_this_cpu() clears the CPU online bit before issuing WBINVD, which means there is no guarantee that the AP has reached the HLT loop. This was reported to cause intermittent reboot/shutdown failures due to some dubious interaction with the firmware. This is not only a problem of WBINVD. The code to actually "stop" the CPU which runs between clearing the online bit and reaching the HLT loop can cause large enough delays on its own (think virtualization). That's especially dangerous for kexec() as kexec() expects that all APs are in a safe state and not executing code while the boot CPU jumps to the new kernel. There are more issues vs kexec() which are addressed separately. Cure this by implementing an explicit synchronization point right before the AP reaches HLT. This guarantees that the AP has completed the full stop proceedure. - Fix the condition for WBINVD in stop_this_cpu(). The WBINVD in stop_this_cpu() is required for ensuring that when switching to or from memory encryption no dirty data is left in the cache lines which might cause a write back in the wrong more later. This checks CPUID directly because the feature bit might have been cleared due to a command line option. But that CPUID check accesses leaf 0x8000001f::EAX unconditionally. Intel CPUs return the content of the highest supported leaf when a non-existing leaf is read, while AMD CPUs return all zeros for unsupported leafs. So the result of the test on Intel CPUs is lottery and on AMD its just correct by chance. While harmless it's incorrect and causes the conditional wbinvd() to be issued where not required, which caused the above issue to be unearthed. - Make kexec() robust against AP code execution Ashok observed triple faults when doing kexec() on a system which had been booted with "nosmt". It turned out that the SMT siblings which had been brought up partially are parked in mwait_play_dead() to enable power savings. mwait_play_dead() is monitoring the thread flags of the AP's idle task, which has been chosen as it's unlikely to be written to. But kexec() can overwrite the previous kernel text and data including page tables etc. When it overwrites the cache lines monitored by an AP that AP resumes execution after the MWAIT on eventually overwritten text, stack and page tables, which obviously might end up in a triple fault easily. Make this more robust in several steps: 1) Use an explicit per CPU cache line for monitoring. 2) Write a command to these cache lines to kick APs out of MWAIT before proceeding with kexec(), shutdown or reboot. The APs confirm the wakeup by writing status back and then enter a HLT loop. 3) If the system uses INIT/INIT/STARTUP for AP bringup, park the APs in INIT state. HLT is not a guarantee that an AP won't wake up and resume execution. HLT is woken up by NMI and SMI. SMI puts the CPU back into HLT (+/- firmware bugs), but NMI is delivered to the CPU which executes the NMI handler. Same issue as the MWAIT scenario described above. Sending an INIT/INIT sequence to the APs puts them into wait for STARTUP state, which is safe against NMI. There is still an issue remaining which can't be fixed: #MCE If the AP sits in HLT and receives a broadcast #MCE it will try to handle it with the obvious consequences. INIT/INIT clears CR4.MCE in the AP which will cause a broadcast #MCE to shut down the machine. So there is a choice between fire (HLT) and frying pan (INIT). Frying pan has been chosen as it's at least preventing the NMI issue. On systems which are not using INIT/INIT/STARTUP there is not much which can be done right now, but at least the obvious and easy to trigger MWAIT issue has been addressed" * tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/smp: Put CPUs into INIT on shutdown if possible x86/smp: Split sending INIT IPI out into a helper function x86/smp: Cure kexec() vs. mwait_play_dead() breakage x86/smp: Use dedicated cache-line for mwait_play_dead() x86/smp: Remove pointless wmb()s from native_stop_other_cpus() x86/smp: Dont access non-existing CPUID leaf x86/smp: Make stop_other_cpus() more robust
2 parents cd336f6 + 45e34c8 commit 88afbb2

5 files changed

Lines changed: 217 additions & 72 deletions

File tree

arch/x86/include/asm/cpu.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,4 +95,6 @@ extern u64 x86_read_arch_cap_msr(void);
9595
int intel_find_matching_signature(void *mc, unsigned int csig, int cpf);
9696
int intel_microcode_sanity_check(void *mc, bool print_err, int hdr_type);
9797

98+
extern struct cpumask cpus_stop_mask;
99+
98100
#endif /* _ASM_X86_CPU_H */

arch/x86/include/asm/smp.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,11 +127,15 @@ void play_dead_common(void);
127127
void wbinvd_on_cpu(int cpu);
128128
int wbinvd_on_all_cpus(void);
129129

130+
void smp_kick_mwait_play_dead(void);
131+
130132
void native_smp_send_reschedule(int cpu);
131133
void native_send_call_func_ipi(const struct cpumask *mask);
132134
void native_send_call_func_single_ipi(int cpu);
133135
void x86_idle_thread_init(unsigned int cpu, struct task_struct *idle);
134136

137+
bool smp_park_other_cpus_in_init(void);
138+
135139
void smp_store_boot_cpu_info(void);
136140
void smp_store_cpu_info(int id);
137141

arch/x86/kernel/process.c

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -759,15 +759,26 @@ bool xen_set_default_idle(void)
759759
}
760760
#endif
761761

762+
struct cpumask cpus_stop_mask;
763+
762764
void __noreturn stop_this_cpu(void *dummy)
763765
{
766+
struct cpuinfo_x86 *c = this_cpu_ptr(&cpu_info);
767+
unsigned int cpu = smp_processor_id();
768+
764769
local_irq_disable();
770+
765771
/*
766-
* Remove this CPU:
772+
* Remove this CPU from the online mask and disable it
773+
* unconditionally. This might be redundant in case that the reboot
774+
* vector was handled late and stop_other_cpus() sent an NMI.
775+
*
776+
* According to SDM and APM NMIs can be accepted even after soft
777+
* disabling the local APIC.
767778
*/
768-
set_cpu_online(smp_processor_id(), false);
779+
set_cpu_online(cpu, false);
769780
disable_local_APIC();
770-
mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
781+
mcheck_cpu_clear(c);
771782

772783
/*
773784
* Use wbinvd on processors that support SME. This provides support
@@ -781,8 +792,17 @@ void __noreturn stop_this_cpu(void *dummy)
781792
* Test the CPUID bit directly because the machine might've cleared
782793
* X86_FEATURE_SME due to cmdline options.
783794
*/
784-
if (cpuid_eax(0x8000001f) & BIT(0))
795+
if (c->extended_cpuid_level >= 0x8000001f && (cpuid_eax(0x8000001f) & BIT(0)))
785796
native_wbinvd();
797+
798+
/*
799+
* This brings a cache line back and dirties it, but
800+
* native_stop_other_cpus() will overwrite cpus_stop_mask after it
801+
* observed that all CPUs reported stop. This write will invalidate
802+
* the related cache line on this CPU.
803+
*/
804+
cpumask_clear_cpu(cpu, &cpus_stop_mask);
805+
786806
for (;;) {
787807
/*
788808
* Use native_halt() so that memory contents don't change

arch/x86/kernel/smp.c

Lines changed: 74 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,14 @@
2121
#include <linux/interrupt.h>
2222
#include <linux/cpu.h>
2323
#include <linux/gfp.h>
24+
#include <linux/kexec.h>
2425

2526
#include <asm/mtrr.h>
2627
#include <asm/tlbflush.h>
2728
#include <asm/mmu_context.h>
2829
#include <asm/proto.h>
2930
#include <asm/apic.h>
31+
#include <asm/cpu.h>
3032
#include <asm/idtentry.h>
3133
#include <asm/nmi.h>
3234
#include <asm/mce.h>
@@ -129,7 +131,7 @@ static int smp_stop_nmi_callback(unsigned int val, struct pt_regs *regs)
129131
}
130132

131133
/*
132-
* this function calls the 'stop' function on all other CPUs in the system.
134+
* Disable virtualization, APIC etc. and park the CPU in a HLT loop
133135
*/
134136
DEFINE_IDTENTRY_SYSVEC(sysvec_reboot)
135137
{
@@ -146,76 +148,118 @@ static int register_stop_handler(void)
146148

147149
static void native_stop_other_cpus(int wait)
148150
{
149-
unsigned long flags;
150-
unsigned long timeout;
151+
unsigned int cpu = smp_processor_id();
152+
unsigned long flags, timeout;
151153

152154
if (reboot_force)
153155
return;
154156

155-
/*
156-
* Use an own vector here because smp_call_function
157-
* does lots of things not suitable in a panic situation.
158-
*/
157+
/* Only proceed if this is the first CPU to reach this code */
158+
if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1)
159+
return;
160+
161+
/* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */
162+
if (kexec_in_progress)
163+
smp_kick_mwait_play_dead();
159164

160165
/*
161-
* We start by using the REBOOT_VECTOR irq.
162-
* The irq is treated as a sync point to allow critical
163-
* regions of code on other cpus to release their spin locks
164-
* and re-enable irqs. Jumping straight to an NMI might
165-
* accidentally cause deadlocks with further shutdown/panic
166-
* code. By syncing, we give the cpus up to one second to
167-
* finish their work before we force them off with the NMI.
166+
* 1) Send an IPI on the reboot vector to all other CPUs.
167+
*
168+
* The other CPUs should react on it after leaving critical
169+
* sections and re-enabling interrupts. They might still hold
170+
* locks, but there is nothing which can be done about that.
171+
*
172+
* 2) Wait for all other CPUs to report that they reached the
173+
* HLT loop in stop_this_cpu()
174+
*
175+
* 3) If the system uses INIT/STARTUP for CPU bringup, then
176+
* send all present CPUs an INIT vector, which brings them
177+
* completely out of the way.
178+
*
179+
* 4) If #3 is not possible and #2 timed out send an NMI to the
180+
* CPUs which did not yet report
181+
*
182+
* 5) Wait for all other CPUs to report that they reached the
183+
* HLT loop in stop_this_cpu()
184+
*
185+
* #4 can obviously race against a CPU reaching the HLT loop late.
186+
* That CPU will have reported already and the "have all CPUs
187+
* reached HLT" condition will be true despite the fact that the
188+
* other CPU is still handling the NMI. Again, there is no
189+
* protection against that as "disabled" APICs still respond to
190+
* NMIs.
168191
*/
169-
if (num_online_cpus() > 1) {
170-
/* did someone beat us here? */
171-
if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
172-
return;
173-
174-
/* sync above data before sending IRQ */
175-
wmb();
192+
cpumask_copy(&cpus_stop_mask, cpu_online_mask);
193+
cpumask_clear_cpu(cpu, &cpus_stop_mask);
176194

195+
if (!cpumask_empty(&cpus_stop_mask)) {
177196
apic_send_IPI_allbutself(REBOOT_VECTOR);
178197

179198
/*
180199
* Don't wait longer than a second for IPI completion. The
181200
* wait request is not checked here because that would
182-
* prevent an NMI shutdown attempt in case that not all
201+
* prevent an NMI/INIT shutdown in case that not all
183202
* CPUs reach shutdown state.
184203
*/
185204
timeout = USEC_PER_SEC;
186-
while (num_online_cpus() > 1 && timeout--)
205+
while (!cpumask_empty(&cpus_stop_mask) && timeout--)
187206
udelay(1);
188207
}
189208

190-
/* if the REBOOT_VECTOR didn't work, try with the NMI */
191-
if (num_online_cpus() > 1) {
209+
/*
210+
* Park all other CPUs in INIT including "offline" CPUs, if
211+
* possible. That's a safe place where they can't resume execution
212+
* of HLT and then execute the HLT loop from overwritten text or
213+
* page tables.
214+
*
215+
* The only downside is a broadcast MCE, but up to the point where
216+
* the kexec() kernel brought all APs online again an MCE will just
217+
* make HLT resume and handle the MCE. The machine crashes and burns
218+
* due to overwritten text, page tables and data. So there is a
219+
* choice between fire and frying pan. The result is pretty much
220+
* the same. Chose frying pan until x86 provides a sane mechanism
221+
* to park a CPU.
222+
*/
223+
if (smp_park_other_cpus_in_init())
224+
goto done;
225+
226+
/*
227+
* If park with INIT was not possible and the REBOOT_VECTOR didn't
228+
* take all secondary CPUs offline, try with the NMI.
229+
*/
230+
if (!cpumask_empty(&cpus_stop_mask)) {
192231
/*
193232
* If NMI IPI is enabled, try to register the stop handler
194233
* and send the IPI. In any case try to wait for the other
195234
* CPUs to stop.
196235
*/
197236
if (!smp_no_nmi_ipi && !register_stop_handler()) {
198-
/* Sync above data before sending IRQ */
199-
wmb();
200-
201237
pr_emerg("Shutting down cpus with NMI\n");
202238

203-
apic_send_IPI_allbutself(NMI_VECTOR);
239+
for_each_cpu(cpu, &cpus_stop_mask)
240+
apic->send_IPI(cpu, NMI_VECTOR);
204241
}
205242
/*
206243
* Don't wait longer than 10 ms if the caller didn't
207244
* request it. If wait is true, the machine hangs here if
208245
* one or more CPUs do not reach shutdown state.
209246
*/
210247
timeout = USEC_PER_MSEC * 10;
211-
while (num_online_cpus() > 1 && (wait || timeout--))
248+
while (!cpumask_empty(&cpus_stop_mask) && (wait || timeout--))
212249
udelay(1);
213250
}
214251

252+
done:
215253
local_irq_save(flags);
216254
disable_local_APIC();
217255
mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
218256
local_irq_restore(flags);
257+
258+
/*
259+
* Ensure that the cpus_stop_mask cache lines are invalidated on
260+
* the other CPUs. See comment vs. SME in stop_this_cpu().
261+
*/
262+
cpumask_clear(&cpus_stop_mask);
219263
}
220264

221265
/*

0 commit comments

Comments
 (0)