Xen project Mailing List

Re: [Xen-devel] [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler

To: "Tim Deegan" <tim@xxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Fri, 16 Nov 2012 08:07:27 +0000

Cc: Malcolm Crossley <malcolm.crossley@xxxxxxxxxx>, eddie.dong@xxxxxxxxx, Ian.Campbell@xxxxxxxxxx, jun.nakajima@xxxxxxxxx, xen-devel <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Fri, 16 Nov 2012 08:07:55 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 15.11.12 at 17:41, Tim Deegan <tim@xxxxxxx> wrote: > At 10:06 +0000 on 14 Nov (1352887560), Jan Beulich wrote: >> > + asm volatile("int $2"); /* Real NMI, vector 2: normal > processing */ >> >> And I still don't like this use of "int $2" here: An aspect we didn't >> consider so far is that a nested MCE would break things again > > OK, I think I understand the problem[s], but I'm going to spell it out > slowly so you can correct me. :) > > [ tl;dr I agree that do_nmi() is better, and we should do that in this > patch, but maybe we need to solve the general problem too. ] > > On a PV guest, we have to use dedicated stacks for NMI and MCE in case > either of those things happens just before SYSRET when we're on the user > stack (no other interrupt or exception can happen at that point). Yes. > On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're > running a HVM guest, so the stack issue doesn't apply (but nested NMIs > are still bad). Yes, albeit that's a potential (separate) problem too (because we don't distinguish the event having its origin in guest or hypervisor context, i.e. an MCE due to a back stack page would be handled on that same stack page, i.e. shutdown). > On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM > guests. We don't really have to but it saves time in the context switch > not to update the IDT. Using do_nmi() here means that the first NMI is > handled on the normal stack instead. It's also consistent with the way > we call do_machine_check() for the MCE case. But it needs an explicit > IRET after the call to do_nmi() to make sure that NMIs get re-enabled. Or have the subsequent VMRESUME take care of this. There is a sentence in "26.6.1 Interruptibility State" to that effect: "NMIs are not blocked in VMX nonroot operation (except for ordinary blocking for other reasons, such as by the MOV SS instruction, the wait-for-SIPI state, etc.)", so NMIs get unblocked implicitly with the VMRESUME. Hence the only problem is with ending the NMI context when not exiting back to VMX guest. I continue to not be in favor of special casing this in VMX code, considering that the problem is generic (i.e. similarly affects PV). I.e. either we handle the other case similarly (special code added also to the PV code path), or we deal with this in a single place, keeping the NMIs masked for an extended period of time. > These dedicated stacks make the general problem of re-entrant MCE/NMI > worse. In the general case those handlers don't expect to be called in > a reentrant way, but blatting the stack turns a possible problem into a > definite one. Yes. I think almost everyone agrees that this is a design flaw. > --- > > All of this would be moot except for the risk that we might take an MCE > while in the NMI handler. The IRET from the MCE handler re-enables NMIs > while we're still in the NMI handler, and a second NMI arriving could > break the NMI handler. In the PV case, it will also clobber the NMI > handler's stack. No - the entry code switches away from the dedicated stacks when the origin was in guest context (see handle_ist_exception in xen/arch/x86/x86_64/entry.S). > In the VMX case we would need to see something like > (NMI (MCE) (NMI (MCE) (NMI))) for that to happen, but it could. > > The inverse case, taking an NMI while in the MCE handler, is not very > interesting. There's no masking of MCEs so that handler already has to > deal with nested entry, and the IRET from the NMI handler has no effect. As already pointed out by Andrew, there is an in-progress bit for this. The thing is that we'd have to switch away from the dedicated stack before clearing that bit (which shouldn't be too difficult; if the MCE was caused by the normal stack, we shouldn't be getting to the point of trying to exit from the MCE handler anyway). > We could potentially solve the problem by having the MCE handler check > whether it's returning to the NMI stack, and do a normal return in that > case. It's a bit of extra code but only in the MCE handler, which is > not performance-critical. Yes, that could solve that nesting case (again not very difficult to implement). > If we do that, then the choice of 'int $2' vs 'do_nmi(); fake_iret()' > is mostly one of taste. do_nmi() saves an IDT indirection but > unbalances the call/return stack. I slightly prefer 'int $2' just > because it makes the PV and non-PV cases more similar. Once the nesting is dealt with properly, and if we decide to do the NMI-disabled-window-exit early, then yes, I agree (apart from the taste aspect, which would still tell me not to use "int $xx" in hypervisor code). And, as I think Mats said, this might then be done for the MCE the same way for consistency. > But first, we should take the current fix, with do_nmi() and iret() > instead of 'int $2'. The nested-MCE issue can be handled separately. Leaving the PV case un-addressed... > Does that make sense? Of course. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.