[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/2] x86/crash: Indicate how well nmi_shootdown_cpus() managed to do.



>>> On 24.09.13 at 21:56, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
> Having nmi_shootdown_cpus() report which pcpus failed to be shot down is a
> useful debugging hint as to what possibly went wrong (especially when the
> crash logs seem to indicate that an NMI timeout occurred while waiting for 
> one
> of the problematic pcpus to perform an action).
> 
> This is achieved by swapping an atomic_t count of unreported pcpus with a
> cpumask.  In the case that the 1 second timeout occurs, use the cpumask to
> identify the problematic pcpus.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> CC: Keir Fraser <keir@xxxxxxx>
> CC: Jan Beulich <JBeulich@xxxxxxxx>
> CC: Tim Deegan <tim@xxxxxxx>
> 
> ---
> 
> We in XenServer have seen a few crashes like this recently, and having an
> extra bit of debugging on the serial console or in the conring is
> substantially more helpful than trying to piece the crash together after-the-
> fact based on what information is missing.
> ---
>  xen/arch/x86/crash.c |   20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/crash.c b/xen/arch/x86/crash.c
> index 0a807d1..5f0f07c 100644
> --- a/xen/arch/x86/crash.c
> +++ b/xen/arch/x86/crash.c
> @@ -22,6 +22,7 @@
>  #include <xen/perfc.h>
>  #include <xen/kexec.h>
>  #include <xen/sched.h>
> +#include <xen/keyhandler.h>
>  #include <public/xen.h>
>  #include <asm/shared.h>
>  #include <asm/hvm/support.h>
> @@ -30,7 +31,7 @@
>  #include <xen/iommu.h>
>  #include <asm/hpet.h>
>  
> -static atomic_t waiting_for_crash_ipi;
> +static cpumask_t waiting_to_crash;
>  static unsigned int crashing_cpu;
>  static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>  
> @@ -65,7 +66,7 @@ void __attribute__((noreturn)) do_nmi_crash(struct 
> cpu_user_regs *regs)
>          __stop_this_cpu();
>  
>          this_cpu(crash_save_done) = 1;
> -        atomic_dec(&waiting_for_crash_ipi);
> +        cpumask_clear_cpu(cpu, &waiting_to_crash);
>      }
>  
>      /* Poor mans self_nmi().  __stop_this_cpu() has reverted the LAPIC
> @@ -122,7 +123,8 @@ static void nmi_shootdown_cpus(void)
>      crashing_cpu = cpu;
>      local_irq_count(crashing_cpu) = 0;
>  
> -    atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
> +    cpumask_copy(&waiting_to_crash, &cpu_online_map);
> +    cpumask_clear_cpu(cpu, &waiting_to_crash);

cpumask_andnot(&waiting_to_crash, &cpu_online_map, cpumask_of(cpu));

Jan

>  
>      /* Change NMI trap handlers.  Non-crashing pcpus get nmi_crash which
>       * invokes do_nmi_crash (above), which cause them to write state and
> @@ -162,12 +164,22 @@ static void nmi_shootdown_cpus(void)
>      smp_send_nmi_allbutself();
>  
>      msecs = 1000; /* Wait at most a second for the other cpus to stop */
> -    while ( (atomic_read(&waiting_for_crash_ipi) > 0) && msecs )
> +    while ( (cpumask_weight(&waiting_to_crash) > 0) && msecs )
>      {
>          mdelay(1);
>          msecs--;
>      }
>  
> +    /* Leave a hint of how well we did trying to shoot down the other cpus 
> */
> +    if ( msecs )
> +        printk("Shot down all cpus\n");
> +    else
> +    {
> +        cpulist_scnprintf(keyhandler_scratch, sizeof keyhandler_scratch,
> +                          &waiting_to_crash);
> +        printk("Failed to shoot down cpus {%s}\n", keyhandler_scratch);
> +    }
> +
>      /* Crash shutdown any IOMMU functionality as the crashdump kernel is 
> not
>       * happy when booting if interrupt/dma remapping is still enabled */
>      iommu_crash_shutdown();
> -- 
> 1.7.10.4




_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.