[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-users] MCE logs and CPU issue
Hi both, you might wanna throw 30 minutes into setting up a OMD nagios instance (www.omdistro.org), adding the affected servers to the check_mk config and grab my linux ECC error check plugin from the community exchange (http://exchange.check-mk.org) I *really really hope* I got everything right and it will be able to detect ECC 1/2bit errors once the CPUs report them. The error >> Feb 2 17:45:11 maradona kernel: [172988.068056] [CPU8, BANK12, addr >> cf839a00, state cc0035400001009f] is as descriptive as anything that isn't a real big iron Unix box can get. (Of course, then you'd have better ECC and a page deallocation table anyway and all this would not be causing problems) My assumption is that Xen properly forwards MCEs. There was a presentation by Intel on the topic at one of the last XenSummits. I wasn't there but read through it some time. I guess you'll be able to find it. If needed I can do a short walkthrough of the setup. I just wanna avoid this looking like an advertisement. It's not my fault there's no other good ECC check plugin for Nagios :) 2012/2/3 Luke S. Crawford <lsc@xxxxxxxxx>: > On Fri, Feb 03, 2012 at 11:37:27AM +0800, Sylvain Chevalier wrote: >> Hi, >> >> On one of our servers running xen, we see many instances like this in >> /var/log/messages on dom0: >> >> Feb 2 17:45:11 maradona kernel: [172988.068048] MCE_DOM0_LOG: enter >> dom0 mce vIRQ handler >> Feb 2 17:45:11 maradona kernel: [172988.068050] MCE_DOM0_LOG: No more >> urgent data >> Feb 2 17:45:11 maradona kernel: [172988.068056] [CPU8, BANK12, addr >> cf839a00, state cc0035400001009f] >> Feb 2 17:45:11 maradona kernel: [172988.068059] MCE_DOM0_LOG: No more >> nonurgent data >> >> it is always CPU8, BANK12. And the server will sometimes just abruptly >> reboot after logging this. > >> Does it mean that MCE messages are logged by xen in /var/log/messages >> and that there is a problem with this cpu? Do you know how I can dig >> further and find what the problem is? > > Betcha it is the ram in that bank. > > I'm getting similar errors in a server that I just swapped out, only my > MCE errors say: > > (XEN) MCE: The hardware reports a non fatal, correctable incident occured on > CPU 0. > (XEN) Bank 4: dc0c4000fe080813[c008000401000000] at 363fe9000 > > (this is on my serial console, not /var/log/messages) > > 'non-fatal, correctable incident on cpu0, Bank 4' sure sounds a lot > like it's a correctable ECC error. The crash would then be explained > by an uncorrectable ecc error (commonly in failing ram, you get correctable > errors, then an uncorrectable error.) bingo :> > Now, this was on an ancient garbage nvidia mcp55 motherboard and nothing > like the kernel EDAC/bluesmoke module works with it, xen or no. > > The counter evidence to that theory is that the motherboard system event > log (accessed through the bios setup screen) doesn't show any errors. MCEs are often seen while nothing shows up in iLO or other things. I guess this is since Intel / AMD decide when the cpu sends out an MCE/EDAC event, whereas the HW vendors might even be slightly inclined to not immediately replace stuff because of a single pci crc error. (which aren't even checked in linux as per default... lol) Flo -- the purpose of libvirt is to provide an abstraction layer hiding all xen features added since 2006 until they were finally understood and copied by the kvm devs. _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |