[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [PATCH v2 1/5] x86/shutdown: offline APs with interrupts disabled on all CPUs
The current shutdown logic in smp_send_stop() will disable the APs while having interrupts enabled on the BSP or possibly other APs. On AMD systems this can lead to local APIC errors: APIC error on CPU0: 00(08), Receive accept error Such error message can be printed in a loop, thus blocking the system from rebooting. I assume this loop is created by the error being triggered by the console interrupt, which is further stirred by the ESR handler printing to the console. Intel SDM states: "Receive Accept Error. Set when the local APIC detects that the message it received was not accepted by any APIC on the APIC bus, including itself. Used only on P6 family and Pentium processors." So the error shouldn't trigger on any Intel CPU supported by Xen. However AMD doesn't make such claims, and indeed the error is broadcasted to all local APICs when an interrupt targets a CPU that's already offline. To prevent the error from stalling the shutdown process perform the disabling of APs and the BSP local APIC with interrupts disabled on all CPUs in the system, so that by the time interrupts are unmasked on the BSP the local APIC is already disabled. This can still lead to a spurious: APIC error on CPU0: 00(00) As a result of an LVT Error getting injected while interrupts are masked on the CPU, and the vector only handled after the local APIC is already disabled. Note the NMI crash path doesn't have such issue, because disabling of APs and the caller local APIC is already done in the same contiguous region with interrupts disabled. There's a possible window on the NMI crash path (nmi_shootdown_cpus()) where some APs might be disabled (and thus interrupts targeting them raising "Receive accept error") before others APs have interrupts disabled. However the shutdown NMI will be handled, regardless of whether the AP is processing a local APIC error, and hence such interrupts will not cause the shutdown process to get stuck. Remove the call to fixup_irqs() in smp_send_stop(), as it doesn't achieve the intended goal of moving all interrupts to the BSP anyway, because when called the APs are still set as online in cpu_online_map. Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> --- Changes since v1: - Change the approach and instead rely on having interrupts uniformly disabled when offlining APs. --- xen/arch/x86/smp.c | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c index 02a6ed7593f3..074baae2cc3b 100644 --- a/xen/arch/x86/smp.c +++ b/xen/arch/x86/smp.c @@ -345,6 +345,11 @@ void __stop_this_cpu(void) static void cf_check stop_this_cpu(void *dummy) { + const bool *stop_aps = dummy; + + while ( !*stop_aps ) + cpu_relax(); + __stop_this_cpu(); for ( ; ; ) halt(); @@ -357,16 +362,25 @@ static void cf_check stop_this_cpu(void *dummy) void smp_send_stop(void) { unsigned int cpu = smp_processor_id(); + bool stop_aps = false; + + /* + * Perform AP offlining and disabling of interrupt controllers with all + * CPUs on the system having interrupts disabled to prevent interrupt + * delivery errors. On AMD systems "Receive accept error" will be + * broadcasted to local APICs if interrupts target CPUs that are offline. + */ + if ( num_online_cpus() > 1 ) + smp_call_function(stop_this_cpu, &stop_aps, 0); + + local_irq_disable(); if ( num_online_cpus() > 1 ) { int timeout = 10; - local_irq_disable(); - fixup_irqs(cpumask_of(cpu), 0); - local_irq_enable(); - - smp_call_function(stop_this_cpu, NULL, 0); + /* Signal APs to stop. */ + stop_aps = true; /* Wait 10ms for all other CPUs to go offline. */ while ( (num_online_cpus() > 1) && (timeout-- > 0) ) @@ -375,13 +389,12 @@ void smp_send_stop(void) if ( cpu_online(cpu) ) { - local_irq_disable(); disable_IO_APIC(); hpet_disable(); __stop_this_cpu(); x2apic_enabled = (current_local_apic_mode() == APIC_MODE_X2APIC); - local_irq_enable(); } + local_irq_enable(); } void smp_send_nmi_allbutself(void) -- 2.46.0
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |