Xen project Mailing List

Re: [Xen-devel] Woes of NMIs and MCEs, and possibly how to fix

To: Tim Deegan <tim@xxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Fri, 30 Nov 2012 21:59:07 +0000

Cc: "Keir $Xen.org$" <keir@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Fri, 30 Nov 2012 21:59:31 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 30/11/2012 17:56, Tim Deegan wrote: > At 17:34 +0000 on 30 Nov (1354296851), Andrew Cooper wrote: >> Hello, >> >> Yesterday, Tim and myself spent a very long time in front of a >> whiteboard trying to develop a fix which covered all the problems, and >> sadly it is very hard. We managed to possibly come up with a long >> solution which we think has no race conditions, but relies on very large >> sections of reentrant code which cant use the stack or trash registers. >> As such, is it is not practical at all (assuming that any of us could >> actually code it) > For the record, we also came up with a much simpler solution, which I > prefer: > - The MCE handler should never return to Xen with IRET. > - The NMI handler should always return with IRET. > - There should be no faulting code in the NMI or MCE handlers. > > That covers all the interesting cases except (3), (4) and (7) below, and > a simple per-cpu {nmi,mce}-in-progress flag will be enough to detect > (and crash) on _almost_ all cases where that bites us (the other cases > will crash less politely from their stacks being smashed). > > Even if we go on to build some more bulletproof solution, I think we > should consider implementing that now, as the baseline. > > Tim. D'oh - I knew I forgot something. Yes - the above solution does fix a large number of the issues. Having no faults on the NMI/MCE paths is a good thing all round, and we should strive for it. But that does not mean that they wont creep back in in the future. What I shall do (assuming no holes are shot in that idea by this thread) is develop this solution first, along with the VMX NMI issue fix. It is substantially easier than the alternative. I will then work on developing the further solution (assuming no holes are shot in that idea). As I alluded to in the parent post, I believe a modification to the Linux solution should allow us to detect most of the reentrant cases and deal with them correctly, and should allow us to detect any smashed stacks and deal with them more politely than we otherwise would. ~Andrew > >> As a result, I thought instead that I would outline all the issues we >> currently face. We can then: >> * Decide which issues need fixing >> * Decide which issues need to at least be detected and crash gracefully >> * Decide which issues we are happy (or perhaps at least willing, if not >> happy) to ignore >> >> So, the issues are as follows. (I have tried to list them in a logical >> order, with 1 individual problem per number, but please do point out if >> I have missed/miss-attributed entries) >> >> 1) Faults on the NMI path will re-enable NMIs before the handler >> returns, leading to reentrant behaviour. We should audit the NMI path >> to try and remove any needless cases which might fault, but getting a >> fault-free path will be hard (and is not going so solve the reentrant >> behaviour itself). >> >> 2) Faults on the MCE path will re-enable NMIs, as will the iret of the >> MCE itself if an MCE interrupts an NMI. >> >> 3) SMM mode executing an iret will re-enable NMIs. There is nothing we >> can do to prevent this, and as an SMI can interrupt NMIs and MCEs, no >> way to predict if/when it may happen. The best we can do is accept that >> it might happen, and try to deal with the after effects. >> >> 4) "Fake NMIs" can be caused by hardware with access to the INTR pin >> (very unlikely in modern systems with the LAPIC supporting virtual wire >> mode), or by software executing an `int $0x2`. This can cause the NMI >> handler to run on the NMI stack, but without the normal hardware NMI >> cessation logic being triggered. >> >> 5) "Fake MCEs" can be caused by software executing `int $0x18`, and by >> any MSI/IOMMU/IOAPIC programmed to deliver vector 0x18. Normally, this >> could only be caused by a bug in Xen, although it is also possible on a >> system with out interrupt remapping. (Where the host administrator has >> accepted the documented security issue, and decided still to pass-though >> a device to a trusted VM, and the VM in question has a buggy driver for >> the passed-through hardware) >> >> 6) Because of interrupt stack tables, real NMIs/MCEs can race with their >> fake alternatives, where the real interrupt interrupts the fake one and >> corrupts the exception frame of the fake one, loosing the original >> context to return to. (This is one of the two core problem of >> reentrancy with NMIs and MCEs) >> >> 7) Real MCEs can race with each other. If two real MCEs occur too close >> together, the processor shuts down (We cant avoid this). However, there >> is large race condition between the MCE handler clearing the MCIP bit of >> IA32_MCG_STATUS and the handler returning during which a new MCE can >> occur and the exception frame will be corrupted. >> >> >> In addition to the above issues, we have two NMI related bugs in Xen >> which need fixing (which shall be part of the series which fixes the above) >> >> 8) VMEXIT reason NMI on Intel calls self_nmi() while NMIs are latched, >> causing the PCPU to fall into loop of VMEXITs until the VCPU timeslice >> has expired, at which point the return-to-guest path decides to schedule >> instead of resuming the guest. >> >> 9) The NMI handler when returning to ring3 will leave NMIs latched, as >> it uses the sysret path. >> >> >> As for 1 possible solution which we cant use: >> >> If it were not for the sysret stupidness[1] of requiring the hypervisor >> to move to the guest stack before executing the `sysret` instruction, we >> could do away with the stack tables for NMIs and MCEs alltogether, and >> the above crazyness would be easy to fix. However, the overhead of >> always using iret to return to ring3 is not likely to be acceptable, >> meaning that we cannot "fix" the problem by discarding interrupt stacks >> and doing everything properly on the main hypervisor stack. >> >> >> Looking at the above problems, I believe there is a solution if we are >> willing to ignore the problem to do with SMM re-enabling NMIs, and if we >> are happy to crash gracefully when mixes of NMIs and MCEs interrupt each >> other and trash their exception frames (in situations were we could >> technically fix up correctly), which is based on the Linux NMI solution. >> >> As questions to the community - have I missed, or misrepresented any >> points above which might perhaps influence the design of the solution? >> I think the list is complete, but would not be supprised if there is a >> case still not considered yet. >> >> ~Andrew >> >> >> [1] In an effort to prevent a flamewar with my comment, the situation we >> find outself in now is almost certainly the result of unforseen >> interactions of individual features, but we are left to pick up the many >> pieces in way which cant completely be solved. >> >> -- >> Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer >> T: +44 (0)1223 225 900, http://www.citrix.com >> _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.