[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler

On 15/11/12 16:41, Tim Deegan wrote:
> Hi, 
> At 10:06 +0000 on 14 Nov (1352887560), Jan Beulich wrote:
>>> +            asm volatile("int $2"); /* Real NMI, vector 2: normal 
>>> processing */
>> And I still don't like this use of "int $2" here: An aspect we didn't
>> consider so far is that a nested MCE would break things again
> OK, I think I understand the problem[s], but I'm going to spell it out
> slowly so you can correct me. :)
> [ tl;dr I agree that do_nmi() is better, and we should do that in this
>   patch, but maybe we need to solve the general problem too. ]
> On a PV guest, we have to use dedicated stacks for NMI and MCE in case
> either of those things happens just before SYSRET when we're on the user
> stack (no other interrupt or exception can happen at that point).
> On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're
> running a HVM guest, so the stack issue doesn't apply (but nested NMIs
> are still bad).
> On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM
> guests.  We don't really have to but it saves time in the context switch
> not to update the IDT.  Using do_nmi() here means that the first NMI is
> handled on the normal stack instead.  It's also consistent with the way
> we call do_machine_check() for the MCE case.  But it needs an explicit
> IRET after the call to do_nmi() to make sure that NMIs get re-enabled.
> These dedicated stacks make the general problem of re-entrant MCE/NMI
> worse.  In the general case those handlers don't expect to be called in
> a reentrant way, but blatting the stack turns a possible problem into a
> definite one.

I have made a fairly simple patch which deliberately invokes a
re-entrant NMI.  The result is that a PCPU spins around the NMI handler
until the watchdog takes the host down.  It is also possible to get a
reentrant NMI if there is a pagefault (or handful of other possible
faults) when trying to execute the iret of the NMI itself; NMIs can get
re-enabled from the iret of the pagefault, and we take a new NMI before
attempting to retry the iret from the original NMI.

> ---
> All of this would be moot except for the risk that we might take an MCE
> while in the NMI handler.  The IRET from the MCE handler re-enables NMIs
> while we're still in the NMI handler, and a second NMI arriving could
> break the NMI handler.  In the PV case, it will also clobber the NMI
> handler's stack.  In the VMX case we would need to see something like
> (NMI (MCE) (NMI (MCE) (NMI))) for that to happen, but it could.

There is the MCIP bit in an MCE status MSR which acts as a latch for
MCEs.  If a new MCE is generated while this bit is set, then a triple
fault occurs.  An MCE handler is required to set this bit to 0 to
indicate that it has dealt with the MCE.  However, there is a race
condition window between setting this bit to 0 and leaving the MCE stack
during which another MCE can arrive and corrupt the stack.

> The inverse case, taking an NMI while in the MCE handler, is not very
> interesting.  There's no masking of MCEs so that handler already has to
> deal with nested entry, and the IRET from the NMI handler has no effect.
> We could potentially solve the problem by having the MCE handler check
> whether it's returning to the NMI stack, and do a normal return in that
> case.  It's a bit of extra code but only in the MCE handler, which is
> not performance-critical. 
> If we do that, then the choice of 'int $2' vs 'do_nmi(); fake_iret()'
> is mostly one of taste.  do_nmi() saves an IDT indirection but
> unbalances the call/return stack.  I slightly prefer 'int $2' just
> because it makes the PV and non-PV cases more similar.
> But first, we should take the current fix, with do_nmi() and iret() 
> instead of 'int $2'.  The nested-MCE issue can be handled separately.
> Does that make sense?

I have been looking at appling a similar fix to Linuses fix
to Xen, for both the NMI and MCE stacks.

Work is currently in the preliminary stages at the moment.


> Tim.
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel

Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.