[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: [Xen-devel] RE: How to generate a HW NMI
Here is some additional info from my experiments over the weekend. I took the Lenovo T500 and removed its internal WiFi miniPCIe card. In its place, I put in a miniPCIe to PCIe converter card with a PCIe socket. Into that socket, I placed a PCIe dump card. This card has a switch that when you press it, it creates an SERR error. Using the utility provided by the vendor, I enabled all the bridges between the card to carry the SERR signal to the CPU and cause the CPU to see it as an NMI. I tested the set-up several times. Every single time I pressed the switch, I got an NMI, followed by a kdump core. So I was sure the HW setup was working correctly. I left two Lenovo T500 running over the weekend and when I returned this morning, both had hung. Completely frozen. I pressed the NMI switch in both systems and nothing. No crashes, no coredumps. It looks as if the SERR/NMI is getting ignored/blocked or CPU is completely shutdown (STPCLK). This experiment helps me prove that the software watchdog code in Xen was not the problem and indeed the NMIs are getting blocked somehow. This is what I now need to investigate. Areas that I care to learn more about are the SMI handler and the external chip's use of the STPCLK signal to the CPU. As an additional bit of info, the only response we get when the systems are hung is a beep when the power cord is unplugged/plugged from the laptop. I don't know if the beep is done via a HW module or whether ACPI/BIOS is involved. Still looking for additional ideas. Regards, Roger R. Cruz -----Original Message----- From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Roger Cruz Sent: Monday, October 04, 2010 3:03 PM To: Jan Kiszka Cc: xen-devel@xxxxxxxxxxxxxxxxxxx; Konrad Rzeszutek Wilk Subject: [Xen-devel] RE: How to generate a HW NMI > BTW, "rmmod processor thermal" (should be equivalent to your Xen I am not familiar with the thermal module but my guess is that they are not the same as the C3 states which can be entered when the kernel becomes idle. I believe the thermal plays with other type of state (P?) where it alters the voltage and frequency of the CPU to keep the CPU still running but at a particular % of the top speed. The C3 state causes the CPU clocks to shutdown entirely and then it is awaken by an external event. R. -----Original Message----- From: Jan Kiszka [mailto:jan.kiszka@xxxxxxxxxxx] Sent: Monday, October 04, 2010 11:23 AM To: Roger Cruz Cc: Konrad Rzeszutek Wilk; xen-devel@xxxxxxxxxxxxxxxxxxx Subject: Re: How to generate a HW NMI Am 04.10.2010 16:19, Roger Cruz wrote: > Until Friday, all hard hangs that we and our customers had experienced > were on Lenovo T500 and X200, even with their latest BIOSes. Yeah, the T500 was reported as problematic here as well. My Fujitsu Celsius H700 also crashes. In contrast, we have positive results from a Dell server with an Asus P6T Deluxe V2 board and a Core i7 920. > The Lenovo > T400 has never hung for me and I don't have any reports on them from the > field. On Friday, I had an HP i5 hard hang with similar footprint as i5? Mmh, we only have reports from i7 so far. Which BIOS vendor? > the Lenovos. When this hard hang happens, the Xen watchdog (which is > driven by the NMI handler) will not do its job and cause a crash/stack > trace. > This is why we have started to suspect something with the BIOS > and SMIs as they are the only thing that can block an NMI. I am pretty > certain that this is somehow related to entering C3 power states and > possibly at the same time an SMI comes in. I tried various stuff under Linux as well: nmi_watchdog=1, tracing to VGA buffer right before/after guest-host switch (it always hangs after entry here), verified guest interruptibility before entry (though hypervisors usually do not play with the critical bits), read-out of host RAM (including kernel log buffer) via Firewire - it all points to a crash outside the scope of the host OS. > The time it takes to hang > varies from 30mins to 24 hrs. We are a bit more lucky, maybe due to our special guest (an old RTOS in 16-bit mode): I can reproduce the hang after a few minutes. BTW, "rmmod processor thermal" (should be equivalent to your Xen parameter) did not make a difference here. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: 10/04/10 02:35:00 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.856 / Virus Database: 271.1.1/3168 - Release Date: 10/04/10 02:35:00 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |