Xen project Mailing List

RE: [Xen-devel] [PATCH] re-work MCA telemetry internals; use common code for Intel/AMD MCA

To: Frank van der Linden <Frank.Vanderlinden@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>

From: "Ke, Liping" <liping.ke@xxxxxxxxx>

Date: Tue, 17 Mar 2009 14:59:20 +0800

Accept-language: en-US

Acceptlanguage: en-US

Cc:

Delivery-date: Tue, 17 Mar 2009 00:00:44 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: AcmmjuuLNBJ4wd8/QmiEPfqXHcdbYwAOEIywAAFyffA=

Thread-topic: [Xen-devel] [PATCH] re-work MCA telemetry internals; use common code for Intel/AMD MCA

Hi, Frank We did some small test here, found the CMCI problem is caused by the "mce_banks_owned" bitmap param passing. When CMCI happened, we print the bitmap (in smp_cmci_interrupt) value, it's correct (For cpu0, 16c). When passing into mcheck_mca_logout, it turned to be "0xFFFF~~FFF" which is wrong. Only for your info -:) Still, I suggest split the patch since this patch is realy big -:) Thanks a lot for your help! Criping -----Original Message----- From: Ke, Liping Sent: 2009年3月17日 14:26 To: 'Frank van der Linden'; xen-devel@xxxxxxxxxxxxxxxxxxx Subject: RE: [Xen-devel] [PATCH] re-work MCA telemetry internals; use common code for Intel/AMD MCA Hi, Frank I am now doing some tests based on latest Intel platform for this patch since CMCI needs some owner_checking and only the owned CPU will report the error. Without the patch, when CMCI happened, since CPU0 is the owner of bank8, so when do checking, Only CPU0 will report the error. Below is the correct log (XEN) CMCI: cmci_intr happen on CPU3 [root@lke-ep inject]# (XEN) CMCI: cmci_intr happen on CPU2 (XEN) CMCI: cmci_intr happen on CPU0 (XEN) CMCI: cmci_intr happen on CPU1 (XEN) mcheck_poll: bank8 CPU0 status[cc0000800001009f] (XEN) mcheck_poll: CPU0, SOCKET0, CORE0, APICID[0], thread[0] (XEN) MCE: The hardware reports a non fatal, correctable incident occured on CPU 0. After applied your patch, I found all CPUs will report the error. Below is the log (XEN) MCE: The hardware reports a non fatal, correctable i ncident occured on CPU 0. (XEN) MCE: The hardware reports a non fatal, correctable incident occured on CPU 2. (XEN) MCE: The hardware reports a non fatal, correctable incident occured on CPU 3. (XEN) MCE: The hardware reports a non fatal, correctable incident occured on CPU 1. (XEN) Bank 8: cc0000c00001009f<1>Bank 8: 8c0000400001009f<1>Bank 8: cc0001c00001 009f<1>MCE: The hardware reports a non fatal, correctable incident occured on CP U 0. I noticed your patch has passed in the cmci_owner mask, I can't see the reason since this is really a big patch. I need some time to figure it out. Also we found the polling mechanism has some changes. My feeling is that this patch is really too big. We can't easily figured out the impaction to our checked-in codes right now. Just wonder whether you could split this big patch into two parts :-) part1: mce log telem mechanism and required mce_intel interfaces changes. So that we can verify easily whether the new interfaces works fine for our CMCI as well as non-fatal polling. I guess this should not be a big work, you can just modify the new telem interfaces machine_check_poll? part2: common handler part. (including both CMCI parts and non-fatal polling parts). How do you think about it :-) Thanks a lot for your help! Criping -----Original Message----- From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Frank van der Linden Sent: 2009年3月17日 7:28 To: xen-devel@xxxxxxxxxxxxxxxxxxx Subject: [Xen-devel] [PATCH] re-work MCA telemetry internals; use common code for Intel/AMD MCA The following patch reworks the MCA error telemetry handling inside Xen, and shares code between the Intel and AMD implementations as much as possible. I've had this patch sitting around for a while, but it wasn't ported to -unstable yet. I finished porting and testing it, and am submitting it now, because the Intel folks want to go ahead and submit their new changes, so we agreed that I should push our changes first. Brief explanation of the telemetry part: previously, the telemetry was accessed in a global array, with index variables used to access it. There were some issues with that: race conditions with regard to new machine checks (or CMCIs) coming in while handling the telemetry, and interaction with domains having been notified or not, which was a bit hairy. Our changes (I should say: Gavin Maltby's changes, as he did the bulk of this work for our 3.1 based tree, I merely ported/extended it to 3.3 and beyond) make telemetry access transactional (think of a database). Also, the internal database updates are atomic, since the final commit is done by a pointer swap. There is a brief explanation of the mechanism in mctelem.h.This patch also removes dom0->domU notification, which is ok, since Intel's upcoming changes will replace domU notification with a vMCE mechanism anyway. The common code part is pretty much what it says. It defines a common MCE handler, with a few hooks for the special needs of the specific CPUs. I've been told that Intel's upcoming patch will need to make some parts of the common code specific to the Intel CPU again, but we'll work together to use as much common code as possible. - Frank

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.