|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
On 15/11/12 16:41, Tim Deegan wrote:
> Hi,
>
> At 10:06 +0000 on 14 Nov (1352887560), Jan Beulich wrote:
>>> + asm volatile("int $2"); /* Real NMI, vector 2: normal
>>> processing */
>> And I still don't like this use of "int $2" here: An aspect we didn't
>> consider so far is that a nested MCE would break things again
> OK, I think I understand the problem[s], but I'm going to spell it out
> slowly so you can correct me. :)
>
> [ tl;dr I agree that do_nmi() is better, and we should do that in this
> patch, but maybe we need to solve the general problem too. ]
>
> On a PV guest, we have to use dedicated stacks for NMI and MCE in case
> either of those things happens just before SYSRET when we're on the user
> stack (no other interrupt or exception can happen at that point).
>
> On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're
> running a HVM guest, so the stack issue doesn't apply (but nested NMIs
> are still bad).
>
> On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM
> guests. We don't really have to but it saves time in the context switch
> not to update the IDT. Using do_nmi() here means that the first NMI is
> handled on the normal stack instead. It's also consistent with the way
> we call do_machine_check() for the MCE case. But it needs an explicit
> IRET after the call to do_nmi() to make sure that NMIs get re-enabled.
>
> These dedicated stacks make the general problem of re-entrant MCE/NMI
> worse. In the general case those handlers don't expect to be called in
> a reentrant way, but blatting the stack turns a possible problem into a
> definite one.
I have made a fairly simple patch which deliberately invokes a
re-entrant NMI. The result is that a PCPU spins around the NMI handler
until the watchdog takes the host down. It is also possible to get a
reentrant NMI if there is a pagefault (or handful of other possible
faults) when trying to execute the iret of the NMI itself; NMIs can get
re-enabled from the iret of the pagefault, and we take a new NMI before
attempting to retry the iret from the original NMI.
>
> ---
>
> All of this would be moot except for the risk that we might take an MCE
> while in the NMI handler. The IRET from the MCE handler re-enables NMIs
> while we're still in the NMI handler, and a second NMI arriving could
> break the NMI handler. In the PV case, it will also clobber the NMI
> handler's stack. In the VMX case we would need to see something like
> (NMI (MCE) (NMI (MCE) (NMI))) for that to happen, but it could.
There is the MCIP bit in an MCE status MSR which acts as a latch for
MCEs. If a new MCE is generated while this bit is set, then a triple
fault occurs. An MCE handler is required to set this bit to 0 to
indicate that it has dealt with the MCE. However, there is a race
condition window between setting this bit to 0 and leaving the MCE stack
during which another MCE can arrive and corrupt the stack.
>
> The inverse case, taking an NMI while in the MCE handler, is not very
> interesting. There's no masking of MCEs so that handler already has to
> deal with nested entry, and the IRET from the NMI handler has no effect.
>
> We could potentially solve the problem by having the MCE handler check
> whether it's returning to the NMI stack, and do a normal return in that
> case. It's a bit of extra code but only in the MCE handler, which is
> not performance-critical.
>
> If we do that, then the choice of 'int $2' vs 'do_nmi(); fake_iret()'
> is mostly one of taste. do_nmi() saves an IDT indirection but
> unbalances the call/return stack. I slightly prefer 'int $2' just
> because it makes the PV and non-PV cases more similar.
>
> But first, we should take the current fix, with do_nmi() and iret()
> instead of 'int $2'. The nested-MCE issue can be handled separately.
>
> Does that make sense?
I have been looking at appling a similar fix to Linuses fix
(http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3f3c8b8c4b2a34776c3470142a7c8baafcda6eb0)
to Xen, for both the NMI and MCE stacks.
Work is currently in the preliminary stages at the moment.
~Andrew
>
> Tim.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |