[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Woes of NMIs and MCEs, and possibly how to fix



At 22:55 +0000 on 30 Nov (1354316132), Andrew Cooper wrote:
> Any spinlocking whatsoever is out, until we completely fix the
> re-entrant entry to do_{nmi,mce}(), at which point spinlocks which are
> ok so long as they are exclusively used inside their respective
> handlers.

AFAICT spinlocks are bad news even if they're only in their respective
handlers, as one CPU can get (NMI (MCE)) while the other gets (MCE (NMI)).

> >>> 2) Faults on the MCE path will re-enable NMIs, as will the iret of the
> >>> MCE itself if an MCE interrupts an NMI.
> >> The same questions apply as to #1 (just replace NMI with MCE)
> > Andrew pointed out that some MCE code uses rdmsr_safe().
> >
> > FWIW, I think that constraining MCE and NMI code not to do anything that
> > can fault is perfectly reasonable.  The MCE code has grown a lot
> > recently and probably needs an audit to check for spinlocks, faults &c.
> 
> Yes.  However, being able to deal gracefully with the case were we miss
> things on code review which touches the NMI/MCE paths is certainly
> better than crashing in subtle ways.

Agreed.

> At the moment, the clearing of the MCIP bit is quite early in a few of
> the cpu family specific MCE handlers.  As it is an architectural MSR, I
> was considering moving it outside the family specific handlers, and make
> one of the last things on the MCE path, to help reduce the race
> condition window until we properly fix reentrant MCEs.

Why not have a per-cpu mce-in-progress flag, and clear MCIP early?  That
way you get a panic instead of silently losing a CPU.

> >>> [1] In an effort to prevent a flamewar with my comment, the situation we
> >>> find outself in now is almost certainly the result of unforseen
> >>> interactions of individual features, but we are left to pick up the many
> >>> pieces in way which cant completely be solved.
> >>>
> >> Happy to have my comments completely shot down into little bits, but I'm 
> >> worrying that we're looking to solve a problem that doesn't actually 
> >> need solving - at least as long as the code in the respective handlers 
> >> are "doing the right thing", and if that happens to be broken, then we 
> >> should fix THAT, not build lots of extra code to recover from such a thing.
> > I agree.  The only things we can't fix by DTRT in do_nmi() and do_mce()
> > are:
> >  - IRET in SMM mode re-enabling NMIs; and
> >  - detecting _every_ case where we get a nested NMI/MCE (all we can 
> >    do is detect _most_ cases, but the detection is just so we can print
> >    a message before the crash).
> 
> We would need to modify the asm stubs to detect nesting.

We can detect _almost all_ nesting from C with an in-progress flag.  We
can probably expand that to cover all nesting by pushing the
flag-setting/flag-clearing out to the asm but that'd still be only a
couple of lines of change - a lot simpler than what we'd need to allow
nested MCE/NMIs to continue without crashing.

>  I think it is
> unreasonable to expect the C functions do_{nmi,mce}() to be reentrantly
> safe.

Good. :)  I'm not suggesting that.

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.