Xen project Mailing List

Re: [Xen-devel] Woes of NMIs and MCEs, and possibly how to fix

To: Tim Deegan <tim@xxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Fri, 30 Nov 2012 22:55:32 +0000

Cc: Mats Petersson <mats.petersson@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Fri, 30 Nov 2012 22:55:58 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 30/11/2012 20:24, Tim Deegan wrote: > At 19:12 +0000 on 30 Nov (1354302731), Mats Petersson wrote: >>> 1) Faults on the NMI path will re-enable NMIs before the handler >>> returns, leading to reentrant behaviour. We should audit the NMI path >>> to try and remove any needless cases which might fault, but getting a >>> fault-free path will be hard (and is not going so solve the reentrant >>> behaviour itself). >> What sort of faults are we expecting on the NMI path? > None, but we have to keep it that way, and if we accidentlly introduce > one there's no immediate indication that we've messed up. Unless we add > 'am I in the NMI handler?' to all the fixup code. Any spinlocking whatsoever is out, until we completely fix the re-entrant entry to do_{nmi,mce}(), at which point spinlocks which are ok so long as they are exclusively used inside their respective handlers. This includes printk (unless preceded by a console_force_unlock(), which is only safe to use on paths where we have decided to crash). There was an audit a little while ago to move printk()s into soft tasks. WARN()s, (and with the same logic BUG()s and ASSERT()s, although these are fatal) are out, both because they directly printk, and make use of ud2 to force a trap. Use of {rd,wr}msr() are problematic, as they might fault if we have buggy code referring to MSRs which are not implemented, and the _safe() variants will still fault. > >>> 2) Faults on the MCE path will re-enable NMIs, as will the iret of the >>> MCE itself if an MCE interrupts an NMI. >> The same questions apply as to #1 (just replace NMI with MCE) > Andrew pointed out that some MCE code uses rdmsr_safe(). > > FWIW, I think that constraining MCE and NMI code not to do anything that > can fault is perfectly reasonable. The MCE code has grown a lot > recently and probably needs an audit to check for spinlocks, faults &c. Yes. However, being able to deal gracefully with the case were we miss things on code review which touches the NMI/MCE paths is certainly better than crashing in subtle ways. > >>> 3) SMM mode executing an iret will re-enable NMIs. There is nothing we >>> can do to prevent this, and as an SMI can interrupt NMIs and MCEs, no >>> way to predict if/when it may happen. The best we can do is accept that >>> it might happen, and try to deal with the after effects. >> SMM is a messy thing that can interfere with most things in a system. We >> will have to rely on the BIOS developers to not mess up here. We can't >> do anything else in our code (on AMD hardware, in a HVM guest you could >> trap SMI as a VMEXIT, and then "deal with it in a container", but that >> doesn't fix SMI that happen whilst in the hypervisor, or in a PV kernel, >> so doesn't really help much). > You can't even do that any more -- some BIOSes turn on the SMM lock, > which disables the VMEXIT hook. > > But yes, if BIOS vendors are executing IRET in SMM mode (and if 'fix > your BIOS' isn't an appropriate response), then the simple 'just don't > IRET anywhere that's rechable fom #NMI' solution can't prevent that, and > we need to handle nested NMIs. > > But I'd be happy to cross that bridge when we come to it. I am fairly sure we the ability to discover this and crash gracefully when we properly fix for the reentrant case. > >> Surely both 4 & 5 are "bad guest behaviour", and whilst it's a "nice to >> have" to catch that, it's no different from running on bare metal doing >> daft things with vectors or writing code that doesn't behave at all >> "friendly". (4 is only available to Xen developers, which we hope are >> most of the time sane enough not to try these crazy things in a "live" >> system that matters). 5 is only available if you have pass through >> enabled. I don't think either is a particularly likely cause of real, in >> the field, problems. > AFAICS both are only available to Xen, unless there's a bug in Xen's MSI > handling, so they'd only be caused by Xen bugs or bad hardware. And > both can be detected (in most cases) with the same simple flag that > would detect real nested NMI/MCEs. Or more completely by a linux-style > solution. Yes - the fake cases are exclusively buggy Xen or buggy hardware (with the exception of the pci-passthough with no interrupt-remapping case, in which one could argue that the host administrator gets to keep both halves of his broken system after they decided to insecurely allow passthrough in the first place) > >>> 7) Real MCEs can race with each other. If two real MCEs occur too close >>> together, the processor shuts down (We cant avoid this). However, there >>> is large race condition between the MCE handler clearing the MCIP bit of >>> IA32_MCG_STATUS and the handler returning during which a new MCE can >>> occur and the exception frame will be corrupted. >> From what I understand, the purpose of this bit is really to ensure >> that any data needed from the MCE status registers has been fetched >> before the processor issues another MCE - otherwise you have a big race >> of "what data are we reading, and which of the multiple, in short >> succession, MCEs does this belong to. If you get two MCEs in such a >> short time that the MCE handler doesn't have time to gather the data >> from the status registers, it's likely that the machine isn't going to >> do very well for much longer anyways. Now, if we have a common stack, we >> should not reset the MCIP bit until it is time to return from the MCE >> handler - ideally on the last instruction before that > We can do the equivalent with a 'mce-being-handled' flag, which allows > us to (try to) print a useful message intead of just losing a CPU. At the moment, the clearing of the MCIP bit is quite early in a few of the cpu family specific MCE handlers. As it is an architectural MSR, I was considering moving it outside the family specific handlers, and make one of the last things on the MCE path, to help reduce the race condition window until we properly fix reentrant MCEs. > >>> [1] In an effort to prevent a flamewar with my comment, the situation we >>> find outself in now is almost certainly the result of unforseen >>> interactions of individual features, but we are left to pick up the many >>> pieces in way which cant completely be solved. >>> >> Happy to have my comments completely shot down into little bits, but I'm >> worrying that we're looking to solve a problem that doesn't actually >> need solving - at least as long as the code in the respective handlers >> are "doing the right thing", and if that happens to be broken, then we >> should fix THAT, not build lots of extra code to recover from such a thing. > I agree. The only things we can't fix by DTRT in do_nmi() and do_mce() > are: > - IRET in SMM mode re-enabling NMIs; and > - detecting _every_ case where we get a nested NMI/MCE (all we can > do is detect _most_ cases, but the detection is just so we can print > a message before the crash). > > Cheers, > > Tim. We would need to modify the asm stubs to detect nesting. I think it is unreasonable to expect the C functions do_{nmi,mce}() to be reentrantly safe. With the proper solution, we can get the asm stubs to fixup nested NMIs/MCEs without reentrantly calling into C (or crashing if we detect the stack has been smashed) ~Andrew > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.