[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler
On 15/11/12 16:41, Tim Deegan wrote: > Hi, > > At 10:06 +0000 on 14 Nov (1352887560), Jan Beulich wrote: >>> + asm volatile("int $2"); /* Real NMI, vector 2: normal >>> processing */ >> And I still don't like this use of "int $2" here: An aspect we didn't >> consider so far is that a nested MCE would break things again > OK, I think I understand the problem[s], but I'm going to spell it out > slowly so you can correct me. :) > > [ tl;dr I agree that do_nmi() is better, and we should do that in this > patch, but maybe we need to solve the general problem too. ] > > On a PV guest, we have to use dedicated stacks for NMI and MCE in case > either of those things happens just before SYSRET when we're on the user > stack (no other interrupt or exception can happen at that point). > > On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're > running a HVM guest, so the stack issue doesn't apply (but nested NMIs > are still bad). > > On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM > guests. We don't really have to but it saves time in the context switch > not to update the IDT. Using do_nmi() here means that the first NMI is > handled on the normal stack instead. It's also consistent with the way > we call do_machine_check() for the MCE case. But it needs an explicit > IRET after the call to do_nmi() to make sure that NMIs get re-enabled. > > These dedicated stacks make the general problem of re-entrant MCE/NMI > worse. In the general case those handlers don't expect to be called in > a reentrant way, but blatting the stack turns a possible problem into a > definite one. I have made a fairly simple patch which deliberately invokes a re-entrant NMI. The result is that a PCPU spins around the NMI handler until the watchdog takes the host down. It is also possible to get a reentrant NMI if there is a pagefault (or handful of other possible faults) when trying to execute the iret of the NMI itself; NMIs can get re-enabled from the iret of the pagefault, and we take a new NMI before attempting to retry the iret from the original NMI. > > --- > > All of this would be moot except for the risk that we might take an MCE > while in the NMI handler. The IRET from the MCE handler re-enables NMIs > while we're still in the NMI handler, and a second NMI arriving could > break the NMI handler. In the PV case, it will also clobber the NMI > handler's stack. In the VMX case we would need to see something like > (NMI (MCE) (NMI (MCE) (NMI))) for that to happen, but it could. There is the MCIP bit in an MCE status MSR which acts as a latch for MCEs. If a new MCE is generated while this bit is set, then a triple fault occurs. An MCE handler is required to set this bit to 0 to indicate that it has dealt with the MCE. However, there is a race condition window between setting this bit to 0 and leaving the MCE stack during which another MCE can arrive and corrupt the stack. > > The inverse case, taking an NMI while in the MCE handler, is not very > interesting. There's no masking of MCEs so that handler already has to > deal with nested entry, and the IRET from the NMI handler has no effect. > > We could potentially solve the problem by having the MCE handler check > whether it's returning to the NMI stack, and do a normal return in that > case. It's a bit of extra code but only in the MCE handler, which is > not performance-critical. > > If we do that, then the choice of 'int $2' vs 'do_nmi(); fake_iret()' > is mostly one of taste. do_nmi() saves an IDT indirection but > unbalances the call/return stack. I slightly prefer 'int $2' just > because it makes the PV and non-PV cases more similar. > > But first, we should take the current fix, with do_nmi() and iret() > instead of 'int $2'. The nested-MCE issue can be handled separately. > > Does that make sense? I have been looking at appling a similar fix to Linuses fix (http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=3f3c8b8c4b2a34776c3470142a7c8baafcda6eb0) to Xen, for both the NMI and MCE stacks. Work is currently in the preliminary stages at the moment. ~Andrew > > Tim. > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |