Xen project Mailing List

Re: [Xen-devel] [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

To: "Luck, Tony" <tony.luck@xxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>

From: "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>

Date: Tue, 6 Mar 2012 03:49:10 +0000

Accept-language: en-US

Cc: Olaf Hering <olaf@xxxxxxxxx>, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>, "Dugger, Donald D" <donald.d.dugger@xxxxxxxxx>

Delivery-date: Tue, 06 Mar 2012 03:50:53 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: AQHM69GUIaRbcUEjU0SZ9CYw3vbbqJZcO8DQgAAXteCAAFPiQA==

Thread-topic: [Xen-devel] [PATCH 2/2] x86/vMCE: save/restore MCA capabilities

Luck, Tony wrote: > Meaning of MCi_CTL registers. A number of different h/w structures and > error events within those structures may be reported in a single bank. > The MCi_CTL register for that bank has a bitmask that allows each of > the > errors to be enabled individually. The detail of which bits in the > register > enable which errors are model specific. Architecturally the SDM > recommends > writing 0xffffff...fffff to them to enable all errors. Sometimes if > there > is a problem on a cpu we might disable some bits (usually handled by > BIOS > or microcode quietly clearing, or not ever setting some of the bits). > > MCG_CTL does a similar thing - but at a global scope - for this cpu > ... > each logical cpu has its own copy of the MCG_* registers - so the 'G' > for 'Global' isn't all the way 'global' - just locally global :-) - I > think that the names were given before multi-thread/multi-core. > i.e. it can affect whether an error is reported in any of the banks > that are associated with it). Again the exact meaning of the bits > is model specific, and the architectural recommendation is to enable > everything with a write of 0xffffff....fffff Thanks Tony! so MCG_CTL/MCi_CTL are model specific and main purpose is used to debug. For kernel mca logic, software defaultly sets all 1's and will not change it, right? > > I'm not a virtualization expert - so I'm not sure what the tradeoffs > are on providing almost pass-through access to the host banks are. I'd > have thought that most useful functionality could be provided with > a wholly virtualized approach: > 1) The guests really don't need to see any corrected errors at all. If > there are any predictive actions to be taken (e.g. too many corrected > errors from a memory location -> stop using a page) these can be taken > at the hypervisor level. For guest corrected errors, currently Xen deliver CMCI to guest and let guest itself to make decision. > 2) Recoverable errors - hypervisor could remap these from whatever > bank > they occur in to a virtual bank. It already needs to convert real > physical > addresses to guest physical - so this shouldn't be much extra effort > 3) Fatal errors - whole system (with all guests) is going down - > really > not much value in trying to tell all the guests exactly why. Yes, exactly Xen did so for recoverable and fatal errors. Thanks, Jinsong > > But perhaps I'm missing some subtlety. > > -Tony > > -----Original Message----- > From: Liu, Jinsong > Sent: Monday, March 05, 2012 12:19 PM > To: Luck, Tony; Jan Beulich; xen-devel@xxxxxxxxxxxxxxxxxxx > Cc: Olaf Hering; Dugger, Donald D > Subject: RE: [Xen-devel] [PATCH 2/2] x86/vMCE: save/restore MCA > capabilities > > Jan Beulich wrote: >> This allows migration to a host with less MCA banks than the source >> host had, while without this patch accesses to the excess banks' MSRs >> caused #GP-s in the guest after migration (and it depended on the >> guest kernel whether this would be fatal). >> >> A fundamental question is whether we should also save/restore MCG_CTL >> and MCi_CTL, as the HVM save record would better be defined to the >> complete state that needs saving from the beginning (I'm unsure >> whether the save/restore logic allows for future extension of an >> existing record). > > Not sure this point. I always feel confused about the meaning of > MCG_CTL/MCi_CTL and their defination in SDM looks ambiguous to me. > ASK TONY FOR HELP: what the real h/w meaning of MCG_CTL/MCi_CTL? > seems mce logic seldomly rely on them, especially bit-by-bit of > MCi_CTL. > > Another question is, why in the patch mcg_cap defined as per vcpu > while others (mcg_ctl/ mcg_status/ mci_ctl) defined as per domain? > Semantically it looks some weird anyway. > > Thanks, > Jinsong > >> >> Of course, this change is expected to make migration from new to >> older Xen impossible (again I'm unsure what the save/restore logic >> does with records it doesn't even know about). >> >> The (trivial) tools side change may seem unrelated, but the code >> should have been that way from the beginning to allow the hypervisor >> to look at currently unused ext_vcpucontext fields without risking >> to read garbage when those fields get a meaning assigned in the >> future. This isn't being enforced here - should it be? (Obviously, >> for backwards compatibility, the hypervisor must assume these fields >> to be clear only when the extended context's size exceeds the old >> original one.) >> >> A future addition to this change might be to allow configuration of >> the number of banks and other MCA capabilities for a guest before it >> starts (i.e. to not inherits the values seen on the first host it >> runs on). >> >> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> >> _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.