[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[xen stable-4.18] x86/shutdown: offline APs with interrupts disabled on all CPUs



commit 79116ff8bbd4e93cee8a8a6515061e3c803b397b
Author:     Roger Pau Monné <roger.pau@xxxxxxxxxx>
AuthorDate: Mon Feb 17 13:32:17 2025 +0100
Commit:     Jan Beulich <jbeulich@xxxxxxxx>
CommitDate: Mon Feb 17 13:32:17 2025 +0100

    x86/shutdown: offline APs with interrupts disabled on all CPUs
    
    The current shutdown logic in smp_send_stop() will disable the APs while
    having interrupts enabled on the BSP or possibly other APs. On AMD systems
    this can lead to local APIC errors:
    
    APIC error on CPU0: 00(08), Receive accept error
    
    Such error message can be printed in a loop, thus blocking the system from
    rebooting.  I assume this loop is created by the error being triggered by
    the console interrupt, which is further stirred by the ESR handler
    printing to the console.
    
    Intel SDM states:
    
    "Receive Accept Error.
    
    Set when the local APIC detects that the message it received was not
    accepted by any APIC on the APIC bus, including itself. Used only on P6
    family and Pentium processors."
    
    So the error shouldn't trigger on any Intel CPU supported by Xen.
    
    However AMD doesn't make such claims, and indeed the error is broadcast to
    all local APICs when an interrupt targets a CPU that's already offline.
    
    To prevent the error from stalling the shutdown process perform the
    disabling of APs and the BSP local APIC with interrupts disabled on all
    CPUs in the system, so that by the time interrupts are unmasked on the BSP
    the local APIC is already disabled.  This can still lead to a spurious:
    
    APIC error on CPU0: 00(00)
    
    As a result of an LVT Error getting injected while interrupts are masked on
    the CPU, and the vector only handled after the local APIC is already
    disabled.  ESR reports 0 because as part of disable_local_APIC() the ESR
    register is cleared.
    
    Note the NMI crash path doesn't have such issue, because disabling of APs
    and the caller local APIC is already done in the same contiguous region
    with interrupts disabled.  There's a possible window on the NMI crash path
    (nmi_shootdown_cpus()) where some APs might be disabled (and thus
    interrupts targeting them raising "Receive accept error") before others APs
    have interrupts disabled.  However the shutdown NMI will be handled,
    regardless of whether the AP is processing a local APIC error, and hence
    such interrupts will not cause the shutdown process to get stuck.
    
    Remove the call to fixup_irqs() in smp_send_stop(): it doesn't achieve the
    intended goal of moving all interrupts to the BSP anyway.  The logic in
    fixup_irqs() will move interrupts whose affinity doesn't overlap with the
    passed mask, but the movement of interrupts is done to any CPU set in
    cpu_online_map.  As in the shutdown path fixup_irqs() is called before APs
    are cleared from cpu_online_map this leads to interrupts being shuffled
    around, but not assigned to the BSP exclusively.
    
    The Fixes tag is more of a guess than a certainty; it's possible the
    previous sleep window in fixup_irqs() allowed any in-flight interrupt to be
    delivered before APs went offline.  However fixup_irqs() was still
    incorrectly used, as it didn't (and still doesn't) move all interrupts to
    target the provided cpu mask.
    
    Fixes: e2bb28d62158 ('x86/irq: forward pending interrupts to new 
destination in fixup_irqs()')
    Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
    Reviewed-by: Jan Beulich <jbeulich@xxxxxxxx>
    master commit: 1191ce954f64244a3c5f553116184928bcc677e8
    master date: 2025-02-12 15:56:07 +0100
---
 xen/arch/x86/smp.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c
index 340fcafb46..ed67eff8db 100644
--- a/xen/arch/x86/smp.c
+++ b/xen/arch/x86/smp.c
@@ -343,6 +343,11 @@ void __stop_this_cpu(void)
 
 static void cf_check stop_this_cpu(void *dummy)
 {
+    const bool *stop_aps = dummy;
+
+    while ( !*stop_aps )
+        cpu_relax();
+
     __stop_this_cpu();
     for ( ; ; )
         halt();
@@ -355,16 +360,25 @@ static void cf_check stop_this_cpu(void *dummy)
 void smp_send_stop(void)
 {
     unsigned int cpu = smp_processor_id();
+    bool stop_aps = false;
+
+    /*
+     * Perform AP offlining and disabling of interrupt controllers with all
+     * CPUs on the system having interrupts disabled to prevent interrupt
+     * delivery errors.  On AMD systems "Receive accept error" will be
+     * broadcast to local APICs if interrupts target CPUs that are offline.
+     */
+    if ( num_online_cpus() > 1 )
+        smp_call_function(stop_this_cpu, &stop_aps, 0);
+
+    local_irq_disable();
 
     if ( num_online_cpus() > 1 )
     {
         int timeout = 10;
 
-        local_irq_disable();
-        fixup_irqs(cpumask_of(cpu), 0);
-        local_irq_enable();
-
-        smp_call_function(stop_this_cpu, NULL, 0);
+        /* Signal APs to stop. */
+        stop_aps = true;
 
         /* Wait 10ms for all other CPUs to go offline. */
         while ( (num_online_cpus() > 1) && (timeout-- > 0) )
@@ -373,13 +387,12 @@ void smp_send_stop(void)
 
     if ( cpu_online(cpu) )
     {
-        local_irq_disable();
         disable_IO_APIC();
         hpet_disable();
         __stop_this_cpu();
         x2apic_enabled = (current_local_apic_mode() == APIC_MODE_X2APIC);
-        local_irq_enable();
     }
+    local_irq_enable();
 }
 
 void smp_send_nmi_allbutself(void)
--
generated by git-patchbot for /home/xen/git/xen.git#stable-4.18



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.