[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Woes of NMIs and MCEs, and possibly how to fix

To: <xen-devel@xxxxxxxxxxxxx>
From: Mats Petersson <mats.petersson@xxxxxxxxxx>
Date: Fri, 30 Nov 2012 19:12:11 +0000
Delivery-date: Fri, 30 Nov 2012 19:12:37 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 30/11/12 17:34, Andrew Cooper wrote:

Hello,

Yesterday, Tim and myself spent a very long time in front of a
whiteboard trying to develop a fix which covered all the problems, and
sadly it is very hard.  We managed to possibly come up with a long
solution which we think has no race conditions, but relies on very large
sections of reentrant code which cant use the stack or trash registers.
As such, is it is not practical at all (assuming that any of us could
actually code it)


As a result, I thought instead that I would outline all the issues we
currently face.  We can then:
  * Decide which issues need fixing
  * Decide which issues need to at least be detected and crash gracefully
  * Decide which issues we are happy (or perhaps at least willing, if not
happy) to ignore

So, the issues are as follows.  (I have tried to list them in a logical
order, with 1 individual problem per number, but please do point out if
I have missed/miss-attributed entries)

1) Faults on the NMI path will re-enable NMIs before the handler
returns, leading to reentrant behaviour.  We should audit the NMI path
to try and remove any needless cases which might fault, but getting a
fault-free path will be hard (and is not going so solve the reentrant
behaviour itself).

What sort of faults are we expecting on the NMI path? Surely the traphandler isn't paged out? Other faults would be that the code is cause GPfault or illegal instructions, divide by zero or similar - these shouldall cause hypervisor panic anyways, surely? I'm sure I've missedsomething really important here, but I don't really see what faults wecan expect to see within the NMI handler, that are "recoverable".


2) Faults on the MCE path will re-enable NMIs, as will the iret of the
MCE itself if an MCE interrupts an NMI.


The same questions apply as to #1 (just replace NMI with MCE)


3) SMM mode executing an iret will re-enable NMIs.  There is nothing we
can do to prevent this, and as an SMI can interrupt NMIs and MCEs, no
way to predict if/when it may happen.  The best we can do is accept that
it might happen, and try to deal with the after effects.

SMM is a messy thing that can interfere with most things in a system. Wewill have to rely on the BIOS developers to not mess up here. We can'tdo anything else in our code (on AMD hardware, in a HVM guest you couldtrap SMI as a VMEXIT, and then "deal with it in a container", but thatdoesn't fix SMI that happen whilst in the hypervisor, or in a PV kernel,so doesn't really help much).


4) "Fake NMIs" can be caused by hardware with access to the INTR pin
(very unlikely in modern systems with the LAPIC supporting virtual wire
mode), or by software executing an `int $0x2`.  This can cause the NMI
handler to run on the NMI stack, but without the normal hardware NMI
cessation logic being triggered.

5) "Fake MCEs" can be caused by software executing `int $0x18`, and by
any MSI/IOMMU/IOAPIC programmed to deliver vector 0x18.  Normally, this
could only be caused by a bug in Xen, although it is also possible on a
system with out interrupt remapping. (Where the host administrator has
accepted the documented security issue, and decided still to pass-though
a device to a trusted VM, and the VM in question has a buggy driver for
the passed-through hardware)

Surely both 4 & 5 are "bad guest behaviour", and whilst it's a "nice tohave" to catch that, it's no different from running on bare metal doingdaft things with vectors or writing code that doesn't behave at all"friendly". (4 is only available to Xen developers, which we hope aremost of the time sane enough not to try these crazy things in a "live"system that matters). 5 is only available if you have pass throughenabled. I don't think either is a particularly likely cause of real, inthe field, problems.

That said, if it's a trivial fix on top of something that fixes theother problems mentioned, I'm OK with that being added.


6) Because of interrupt stack tables, real NMIs/MCEs can race with their
fake alternatives, where the real interrupt interrupts the fake one and
corrupts the exception frame of the fake one, loosing the original
context to return to.  (This is one of the two core problem of
reentrancy with NMIs and MCEs)

7) Real MCEs can race with each other.  If two real MCEs occur too close
together, the processor shuts down (We cant avoid this).  However, there
is large race condition between the MCE handler clearing the MCIP bit of
IA32_MCG_STATUS and the handler returning during which a new MCE can
occur and the exception frame will be corrupted.

From what I understand, the purpose of this bit is really to ensurethat any data needed from the MCE status registers has been fetchedbefore the processor issues another MCE - otherwise you have a big raceof "what data are we reading, and which of the multiple, in shortsuccession, MCEs does this belong to. If you get two MCEs in such ashort time that the MCE handler doesn't have time to gather the datafrom the status registers, it's likely that the machine isn't going todo very well for much longer anyways. Now, if we have a common stack, weshould not reset the MCIP bit until it is time to return from the MCEhandler - ideally on the last instruction before that, but that may be alittle difficult to achieve, seeing as at that point, no registers willbe available [as we're restoring those to return back to previouscontext], but something close to that should make for a very minimal(but admittedly still existing) window for a race. It is questionable ifthe MCE logic and processor trapping mechanism will react quickly enoughto the MCIP bit being set, without getting to the iret [or whateverinstruction is ending the handler]. If it does, then we die. It is notmuch different from the case where a MCE happens while the MCIP bit isset, which will cause a processor shutdown - that's a reboot foranything with a "PC compatible chipset", as CPU shutdown is pretty mucha useless dead state for the processor, and the chipset therefore pullsthe reset pin as soon as this state is detected.



In addition to the above issues, we have two NMI related bugs in Xen
which need fixing (which shall be part of the series which fixes the above)

8) VMEXIT reason NMI on Intel calls self_nmi() while NMIs are latched,
causing the PCPU to fall into loop of VMEXITs until the VCPU timeslice
has expired, at which point the return-to-guest path decides to schedule
instead of resuming the guest.

The solution to this bug is to call the nmi handler either via the INT 2instruction or via a call to do_nmi() or something similar. There arenot many other options, and code to fix this has been posted a couple ofweeks ago. No, it's not completely "safe", but it's a whole lot betterthan the current non-working code. And that applies regardless of otherissues with MCE and NMI handling.


9) The NMI handler when returning to ring3 will leave NMIs latched, as
it uses the sysret path.

That should also be relatively easy to fix, either by actually using anIRET at the end of NMI handler (and using the "INT 2" solution above),or by making a fake stackframe for the "next instruction after IRET" onthe stack, and then performing an IRET.



As for 1 possible solution which we cant use:

If it were not for the sysret stupidness[1] of requiring the hypervisor
to move to the guest stack before executing the `sysret` instruction, we
could do away with the stack tables for NMIs and MCEs alltogether, and
the above crazyness would be easy to fix.  However, the overhead of
always using iret to return to ring3 is not likely to be acceptable,
meaning that we cannot "fix" the problem by discarding interrupt stacks
and doing everything properly on the main hypervisor stack.


Looking at the above problems, I believe there is a solution if we are
willing to ignore the problem to do with SMM re-enabling NMIs, and if we
are happy to crash gracefully when mixes of NMIs and MCEs interrupt each
other and trash their exception frames (in situations were we could
technically fix up correctly), which is based on the Linux NMI solution.

As questions to the community - have I missed, or misrepresented any
points above which might perhaps influence the design of the solution?
I think the list is complete, but would not be supprised if there is a
case still not considered yet.

~Andrew


[1] In an effort to prevent a flamewar with my comment, the situation we
find outself in now is almost certainly the result of unforseen
interactions of individual features, but we are left to pick up the many
pieces in way which cant completely be solved.

Happy to have my comments completely shot down into little bits, but I'mworrying that we're looking to solve a problem that doesn't actuallyneed solving - at least as long as the code in the respective handlersare "doing the right thing", and if that happens to be broken, then weshould fix THAT, not build lots of extra code to recover from such a thing.


--
Mats

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] Woes of NMIs and MCEs, and possibly how to fix
  - From: Tim Deegan

References:
- [Xen-devel] Woes of NMIs and MCEs, and possibly how to fix
  - From: Andrew Cooper

Prev by Date: [Xen-devel] [xen-unstable test] 14520: regressions - FAIL
Next by Date: Re: [Xen-devel] Mem_event API and MEM_EVENT_REASON_SINGLESTEP
Previous by thread: Re: [Xen-devel] Woes of NMIs and MCEs, and possibly how to fix
Next by thread: Re: [Xen-devel] Woes of NMIs and MCEs, and possibly how to fix
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.