[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH V2] xen: vmx: Use an INT 2 call to process real NMI's instead of self_nmi() in VMEXIT handler



>>> On 15.11.12 at 17:41, Tim Deegan <tim@xxxxxxx> wrote:
> At 10:06 +0000 on 14 Nov (1352887560), Jan Beulich wrote:
>> > +            asm volatile("int $2"); /* Real NMI, vector 2: normal 
> processing */
>> 
>> And I still don't like this use of "int $2" here: An aspect we didn't
>> consider so far is that a nested MCE would break things again
> 
> OK, I think I understand the problem[s], but I'm going to spell it out
> slowly so you can correct me. :)
> 
> [ tl;dr I agree that do_nmi() is better, and we should do that in this
>   patch, but maybe we need to solve the general problem too. ]
> 
> On a PV guest, we have to use dedicated stacks for NMI and MCE in case
> either of those things happens just before SYSRET when we're on the user
> stack (no other interrupt or exception can happen at that point).

Yes.

> On an AMD CPU we _don't_ have dedicated stacks for NMI or MCE when we're
> running a HVM guest, so the stack issue doesn't apply (but nested NMIs
> are still bad).

Yes, albeit that's a potential (separate) problem too (because we
don't distinguish the event having its origin in guest or hypervisor
context, i.e. an MCE due to a back stack page would be handled
on that same stack page, i.e. shutdown).

> On an Intel CPU, we _do_ use dedicated stacks for NMI and MCE in HVM
> guests.  We don't really have to but it saves time in the context switch
> not to update the IDT.  Using do_nmi() here means that the first NMI is
> handled on the normal stack instead.  It's also consistent with the way
> we call do_machine_check() for the MCE case.  But it needs an explicit
> IRET after the call to do_nmi() to make sure that NMIs get re-enabled.

Or have the subsequent VMRESUME take care of this. There is a
sentence in "26.6.1 Interruptibility State" to that effect: "NMIs are
not blocked in VMX nonroot operation (except for ordinary blocking
for other reasons, such as by the MOV SS instruction, the
wait-for-SIPI state, etc.)", so NMIs get unblocked implicitly with
the VMRESUME.

Hence the only problem is with ending the NMI context when not
exiting back to VMX guest.

I continue to not be in favor of special casing this in VMX code,
considering that the problem is generic (i.e. similarly affects PV).
I.e. either we handle the other case similarly (special code added
also to the PV code path), or we deal with this in a single place,
keeping the NMIs masked for an extended period of time.

> These dedicated stacks make the general problem of re-entrant MCE/NMI
> worse.  In the general case those handlers don't expect to be called in
> a reentrant way, but blatting the stack turns a possible problem into a
> definite one.

Yes. I think almost everyone agrees that this is a design flaw.

> ---
> 
> All of this would be moot except for the risk that we might take an MCE
> while in the NMI handler.  The IRET from the MCE handler re-enables NMIs
> while we're still in the NMI handler, and a second NMI arriving could
> break the NMI handler.  In the PV case, it will also clobber the NMI
> handler's stack.

No - the entry code switches away from the dedicated stacks when
the origin was in guest context (see handle_ist_exception in
xen/arch/x86/x86_64/entry.S).

> In the VMX case we would need to see something like
> (NMI (MCE) (NMI (MCE) (NMI))) for that to happen, but it could.
> 
> The inverse case, taking an NMI while in the MCE handler, is not very
> interesting.  There's no masking of MCEs so that handler already has to
> deal with nested entry, and the IRET from the NMI handler has no effect.

As already pointed out by Andrew, there is an in-progress bit
for this. The thing is that we'd have to switch away from the
dedicated stack before clearing that bit (which shouldn't be too
difficult; if the MCE was caused by the normal stack, we
shouldn't be getting to the point of trying to exit from the MCE
handler anyway).

> We could potentially solve the problem by having the MCE handler check
> whether it's returning to the NMI stack, and do a normal return in that
> case.  It's a bit of extra code but only in the MCE handler, which is
> not performance-critical. 

Yes, that could solve that nesting case (again not very difficult
to implement).

> If we do that, then the choice of 'int $2' vs 'do_nmi(); fake_iret()'
> is mostly one of taste.  do_nmi() saves an IDT indirection but
> unbalances the call/return stack.  I slightly prefer 'int $2' just
> because it makes the PV and non-PV cases more similar.

Once the nesting is dealt with properly, and if we decide to do
the NMI-disabled-window-exit early, then yes, I agree (apart
from the taste aspect, which would still tell me not to use "int $xx"
in hypervisor code). And, as I think Mats said, this might then be
done for the MCE the same way for consistency.

> But first, we should take the current fix, with do_nmi() and iret() 
> instead of 'int $2'.  The nested-MCE issue can be handled separately.

Leaving the PV case un-addressed...

> Does that make sense?

Of course.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.