[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] RFC: MCA/MCE concept



 

> -----Original Message-----
> From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx 
> [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of 
> Christoph Egger
> Sent: 01 June 2007 10:28
> To: xen-devel@xxxxxxxxxxxxxxxxxxx
> Cc: Gavin Maltby; Keir Fraser
> Subject: Re: [Xen-devel] RFC: MCA/MCE concept
> 
> On Friday 01 June 2007 10:55:28 Petersson, Mats wrote:
> 
> [snip]
> 
> > > For short, guests with a PV MCA driver will see a certain event
> > > (assuming the event mechanism will be used for the notification)
> > > and guests w/o a PV MCA driver will see a "General 
> Protection Fault".
> > > Is that right?
> >
> > Not sure if GP fault is the right thing for non-"MCA PV 
> driver" domains. I
> > think "just killing" the domain is the right thing to do.
> >
> > We can't gurantee that a GP fault is actually going to 
> "kill" the guest.
> > Let's assume the code that ran on the guest was something 
> along the lines
> > of:
> >
> >
> > int some_function(...)
> > {
> >    ...
> >
> >    try {
> >       ...
> >       /* Some code that does quite a lot of "random" 
> processing that may
> > cause, for example, GP fault */ ...
> >    } catch(Exception e)
> >    {
> >     ...
> >     /* handles GP fault within the kernel code */
> >     ...
> >    }
> > }
> >
> >
> > Note that Windows kernel drivers are allowed to use the 
> kernel exception
> > handling, and ARE allowed to "allow" GP faults if they wish 
> to do so.
> > [Don't ask me why MS allows this, but that's the case, so 
> we have to live
> > with it].
> 
> In that case, it will die sooner or later *after* consuming 
> the data in error.
> That means, the guest continues to live for an unknown time...

Yes. What I'm worried about is that if you have a "transient" or "few-bit" 
error in a rarely used, the guest may well live a LONG time with incorrect data 
and potentially not get it detected for quite some time again (say it's two 
bits have stuck to 0, and the data is then written back with the zero's there - 
next time we read it, no error, since the data has zero's in that location. 

Also consider the case where one cell (or small block of cells) has gone bad, 
but it's only used by one single piece of code that is using this try/catch 
code? I know, this is probably relatively rare, but I'm still worried that it 
will "break" things... 

> 
> > I'm not sure if Linux, Solaris, *BSD, OS/2 or other OS's will allow
> > "catching" a Kernel GP fault in a non-precise fashion (I 
> know Linux has
> > exception handling for EXACT positions in the code). But 
> since at least one
> > kernel DOES allow this, we can't be sure that a GPF will 
> destroy the guest.
> 
> When Linux and *BSD see a GPF while they are in userspace, 
> then they kill
> the process with a SIGSEGV. If they are in kernelspace, then 
> they panic.
> 
> > Second point to note is of course that if the guest is in 
> user-mode when
> > the GPF happens, then almost all OS's will just kill the 
> application - and
> > there's absolutely no reason to believe that the 
> application running is
> > necessarily where the actual memory problem is - it may be 
> caused by memory
> > scrubbing for example.
> >
> > Whatever we do to the guest, it should be a "certain 
> death", unless the
> > kernel has told us "I can handle MCE's".
> 
> It is obvious that there is no absolute generic way to handle 
> all sort of 
> buggy guests. I vote for:
> 
> If DomU has a PV MCA driver use this or inject a GPF.
> Multiplexing all the MSR's related to emulate MCA/MCE for the 
> guests is much
> more complex than just injecting a GPF - and slower.

Emulating MCE to the guest wasn't my intended alternative suggestion. Instead, 
my idea was that if the guest hasn't registered a "PV MCE handler", we just 
immediately kill the domain as such - e.g similar to 
"domain_crash_synchronous()". Don't let the guest have any chance to "do 
something wrong" in the process - it's already broken, and letting it run any 
further will almost certainly not help matters. This may not be the prettiest 
solution, but then on the other hand, a "Windows blue-screen" or Linux "oops" 
saying GP fault happened at some random place in the guest isn't exactly 
helping the SysAdmin understand the problem either. 

--
Mats
> 
> Keir, what are your opinions on this thread?
> 
> 
> Christoph
> 
> -- 
> AMD Saxony, Dresden, Germany
> Operating System Research Center
> 
> Legal Information:
> AMD Saxony Limited Liability Company & Co. KG
> Sitz (Geschäftsanschrift):
>    Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland
> Registergericht Dresden: HRA 4896
> vertretungsberechtigter Komplementär:
>    AMD Saxony LLC (Sitz Wilmington, Delaware, USA)
> Geschäftsführer der AMD Saxony LLC:
>    Dr. Hans-R. Deppe, Thomas McCoy
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
> 
> 
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.