[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] NMI with SMP domain causing machine to reboot
I have spend most of the last weeks trying to nail down a nasty bug that is preventing me to release xenoprof for SMP domains. The bug is non-deterministic and when it happens the machine just reboots with no message or warning on the serial console. This made the debugging process painfull and slow. I started removing specific components of xenoprof code trying to find what component is causing the problem. After removing almost all code it seems the bug is associated with NMI interrupts. Right now the only code left is the code to program a hardware perf. counter to count "non-halted" clock cycles (hard-coded) and to handle NMI interrupts. All other logic was removed and and I am still seeing the machine auto rebooting at some non-determinist time. I am starting to suspect this might be a Xen bug and I will probably need some help from the Xen core team to nail this down. I have attached a patch that enables Xen to program the perf counter and handle the NMIs they generate. I have also attached a patch for a new user level test tool for starting the performance counter. I hope these patches enable others to reproduce the behaviour I am observing I only see this bug when running SMP domains (either dom0 or domU) with NMIs being generated. My machine has two CPUs with hyperthreading disabled. When I boot an SMP domain0 (with 2 VCPUs) I only see the the bug when NMIs are generated for CPU 1. Surprisingly, I have never seen the auto rebooting behavior when NMIs are generated on CPU 0 only. Since the bug is non determinitic it is possible that the bug is still there but for some reason not triggered for NMIs on CPU 0. Here is a sequence of steps that I use to trigger the bug (on an SMP dom0 with 2 VCPUs); 1) initialize the performance counter > xenpmc -i 2) start the counter > xenpmc -g 3) verify that NMIs are being generated > xenpmc -s This causes a counter of NMIs for [CPU0,CPU1] to be printed. This command was originally intended to stop the counters (and NMI generation) but the command was modified to just return without stopping the counters. As a side effect the number of NMIs are printed on the xen console and can be used to verify that NMIs are being generated In order to trigger the bug I execute the comand "xm dmesg" in a loop and eventually the machine auto reboot. (usually after a few minutes). I use the following shell script to execute "xm dmesg" in a loop. #!/bin/bash while true; do xm dmesg; sleep 1; done Does anybody has an idea of what can be causing this behavior and how we could nail this down? Thanks Renato Attachment:
nmitest_xen.patch Attachment:
nmitest_tools.patch _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |