Xen project Mailing List

RE: [Xen-devel] RFC: MCA/MCE concept

To: "Egger, Christoph" <Christoph.Egger@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx

From: "Petersson, Mats" <Mats.Petersson@xxxxxxx>

Date: Fri, 1 Jun 2007 11:48:31 +0200

Cc: Gavin Maltby <Gavin.Maltby@xxxxxxx>, Keir Fraser <Keir.Fraser@xxxxxxxxxxxxx>

Delivery-date: Fri, 01 Jun 2007 02:46:57 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: AcekL0+GU2TAkKZHREaw3TDSJpwpRgAAFwWA

Thread-topic: [Xen-devel] RFC: MCA/MCE concept

> -----Original Message----- > From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx > [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of > Christoph Egger > Sent: 01 June 2007 10:28 > To: xen-devel@xxxxxxxxxxxxxxxxxxx > Cc: Gavin Maltby; Keir Fraser > Subject: Re: [Xen-devel] RFC: MCA/MCE concept > > On Friday 01 June 2007 10:55:28 Petersson, Mats wrote: > > [snip] > > > > For short, guests with a PV MCA driver will see a certain event > > > (assuming the event mechanism will be used for the notification) > > > and guests w/o a PV MCA driver will see a "General > Protection Fault". > > > Is that right? > > > > Not sure if GP fault is the right thing for non-"MCA PV > driver" domains. I > > think "just killing" the domain is the right thing to do. > > > > We can't gurantee that a GP fault is actually going to > "kill" the guest. > > Let's assume the code that ran on the guest was something > along the lines > > of: > > > > > > int some_function(...) > > { > > ... > > > > try { > > ... > > /* Some code that does quite a lot of "random" > processing that may > > cause, for example, GP fault */ ... > > } catch(Exception e) > > { > > ... > > /* handles GP fault within the kernel code */ > > ... > > } > > } > > > > > > Note that Windows kernel drivers are allowed to use the > kernel exception > > handling, and ARE allowed to "allow" GP faults if they wish > to do so. > > [Don't ask me why MS allows this, but that's the case, so > we have to live > > with it]. > > In that case, it will die sooner or later *after* consuming > the data in error. > That means, the guest continues to live for an unknown time... Yes. What I'm worried about is that if you have a "transient" or "few-bit" error in a rarely used, the guest may well live a LONG time with incorrect data and potentially not get it detected for quite some time again (say it's two bits have stuck to 0, and the data is then written back with the zero's there - next time we read it, no error, since the data has zero's in that location. Also consider the case where one cell (or small block of cells) has gone bad, but it's only used by one single piece of code that is using this try/catch code? I know, this is probably relatively rare, but I'm still worried that it will "break" things... > > > I'm not sure if Linux, Solaris, *BSD, OS/2 or other OS's will allow > > "catching" a Kernel GP fault in a non-precise fashion (I > know Linux has > > exception handling for EXACT positions in the code). But > since at least one > > kernel DOES allow this, we can't be sure that a GPF will > destroy the guest. > > When Linux and *BSD see a GPF while they are in userspace, > then they kill > the process with a SIGSEGV. If they are in kernelspace, then > they panic. > > > Second point to note is of course that if the guest is in > user-mode when > > the GPF happens, then almost all OS's will just kill the > application - and > > there's absolutely no reason to believe that the > application running is > > necessarily where the actual memory problem is - it may be > caused by memory > > scrubbing for example. > > > > Whatever we do to the guest, it should be a "certain > death", unless the > > kernel has told us "I can handle MCE's". > > It is obvious that there is no absolute generic way to handle > all sort of > buggy guests. I vote for: > > If DomU has a PV MCA driver use this or inject a GPF. > Multiplexing all the MSR's related to emulate MCA/MCE for the > guests is much > more complex than just injecting a GPF - and slower. Emulating MCE to the guest wasn't my intended alternative suggestion. Instead, my idea was that if the guest hasn't registered a "PV MCE handler", we just immediately kill the domain as such - e.g similar to "domain_crash_synchronous()". Don't let the guest have any chance to "do something wrong" in the process - it's already broken, and letting it run any further will almost certainly not help matters. This may not be the prettiest solution, but then on the other hand, a "Windows blue-screen" or Linux "oops" saying GP fault happened at some random place in the guest isn't exactly helping the SysAdmin understand the problem either. -- Mats > > Keir, what are your opinions on this thread? > > > Christoph > > -- > AMD Saxony, Dresden, Germany > Operating System Research Center > > Legal Information: > AMD Saxony Limited Liability Company & Co. KG > Sitz (Geschäftsanschrift): > Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland > Registergericht Dresden: HRA 4896 > vertretungsberechtigter Komplementär: > AMD Saxony LLC (Sitz Wilmington, Delaware, USA) > Geschäftsführer der AMD Saxony LLC: > Dr. Hans-R. Deppe, Thomas McCoy > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-devel > > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.