[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] x86: machine check exception handling



On Thursday 21 June 2007 16:15:36 Keir Fraser wrote:
> On 19/6/07 11:06, "Jan Beulich" <jbeulich@xxxxxxxxxx> wrote:
> > Properly handle MCE (connecting the exisiting, but so far unused vendor
> > specific handlers). HVM guests don't own CR4.MCE (and hence can't
> > suppress the exception) anymore, preventing silent machine shutdown.
> >
> > This patch won't apply or work without the patch removing i386's NMI
> > deferral.
>
> Applied with the following changes:
>  1. Pulled out the common parts of the NMI/MCE asm handlers into a common
> subroutine (like all other execption handlers jump at handle_exception to
> do the hard work).
>  2. Kept do_machine_check() as analog of do_nmi(), which can hide
> machine_check_vector definition (and hence I removed all changes inside
> arch/x86/cpu/mcheck). I'd like to keep do_machine_check(), even if it
> remains no more than a direct call at machine_check_vector(). We could
> clean up machine_check_vector() as a separate patch -- not sure if it's
> worth it right now, and maybe we're better off keeping close to original
> Linux files?

That's not possible. The #MC handler and the polling handler (in non-fatal.c)
(are going to) do something completely different than any OS will ever do.
See the discussion with the subject "MCA/MCE concept" for more information.

> 3. Most contentious, I'm sure: removed VMX changes that would 
> keep interrupts disabled across NMI/MCE. The reason is simply that SVM does
> not bother with this. If there is a requirement that NMI/MCE be called with
> particular constraints on EFLAGS, then we should make that clear and fix up
> both VMX and SVM in a separate patch. The pain of this is that it would
> probably require extra checks on critical vmexit paths. Is it *really* that
> bad for #MC to get interrupted?

In opposition to the polling handler, #MC interruption is *very* bad.
A #MC always means, that an uncorrectable ECC error is detected
by the hw. First you have to figure out, who is impacted: Is it Xen, Dom0 or 
DomU?
In case of Xen and Dom0 you can only do something using hw correction
features or crash. In case of DomU, you can kill DomU in the worst case
and keep the rest running.

Again see  the discussion with the subject "MCA/MCE concept" for more 
information.

Christoph


-- 
AMD Saxony, Dresden, Germany
Operating System Research Center

Legal Information:
AMD Saxony Limited Liability Company & Co. KG
Sitz (Geschäftsanschrift):
   Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland
Registergericht Dresden: HRA 4896
vertretungsberechtigter Komplementär:
   AMD Saxony LLC (Sitz Wilmington, Delaware, USA)
Geschäftsführer der AMD Saxony LLC:
   Dr. Hans-R. Deppe, Thomas McCoy



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.