Xen project Mailing List

Re: [Xen-devel] VPMU interrupt unreliability

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Mon, 24 Jul 2017 07:32:47 -0700

Cc: Kevin Tian <kevin.tian@xxxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Dietmar Hahn <dietmar.hahn@xxxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Mon, 24 Jul 2017 14:33:00 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Mon, Jul 24, 2017 at 7:13 AM, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote: > On 22/07/17 21:16, Kyle Huey wrote: >> Last year I reported[0] seeing occasional instability in performance >> counter values when running rr[1], which depends on completely >> deterministic counts of retired conditional branches of userspace >> programs. >> >> I recently identified the cause of this problem. Xen's VPMU code >> contains a workaround for an alleged Nehalem bug that was added in >> 2010[2]. Supposedly if a hardware performance counter reaches 0 >> exactly during a PMI another PMI is generated potentially causing an >> endless loop. The workaround is to set the counter to 1. In 2013 the >> original bug was believed to affect more than just Nehalem and the >> workaround was enabled for all family 6 CPUs.[3] This workaround >> unfortunately disturbs the counter value in non-deterministic ways >> (since the value the counter has in the irq handler depends on >> interrupt latency), which is fatal to rr. >> >> I've verified that the discrepancies we see in the counted values are >> entirely accounted for by the number of times the workaround is used >> in any given run. Furthermore, patching Xen not to use this >> workaround makes the discrepancies in the counts vanish. I've added >> code[4] to rr that reliably detects this problem from guest userspace. >> >> Even with the workaround removed in Xen I see some additional issues >> (but not disturbed counter values) with the PMI, such as interrupts >> occasionally not being delivered to the guest. I haven't done much >> work to track these down, but my working theory is that interrupts >> that "skid" out of the guest that requested them and into Xen itself >> or perhaps even another guest are not being delivered. >> >> Our current plan is to stop depending on the PMI during rr's recording >> phase (which we use for timeslicing tracees primarily because it's >> convenient) to enable producing correct recordings in Xen guests. >> Accurate replay will not be possible under virtualization because of >> the PMI issues; that will require transferring the recording to >> another machine. But that will be sufficient to enable the use cases >> we care about (e.g. record an automated process on a cloud computing >> provider and have an engineer download and replay a failing recording >> later to debug it). >> >> I can think of several possible ways to fix the overcount problem, including: >> 1. Restricting the workaround to apply only to older CPUs and not all >> family 6 Intel CPUs forever. >> 2. Intercepting MSR loads for counters that have the workaround >> applied and giving the guest the correct counter value. >> 3. Or perhaps even changing the workaround to disable the PMI on that >> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works >> on the relevant hardware. >> >> Since I don't have the relevant hardware to test changes to this >> workaround on and rr can avoid these bugs through other means I don't >> expect to work on this myself, but I wanted to apprise you of what >> we've learned. > > Thankyou for this investigation and analysis. > > I think the first action is to try and identify what this mysterious > erratum is. Despite the plethora of perf errata, the best I can find is > AAK135 "Multiple Performance Monitor Interrupts are Possible on Overflow > of IA32_FIXED_CTR2" which still doesn't obviously match the described > symptoms. I think it may be BJ58 "Performance-Counter Overflow Indication May Cause Undesired Behavior". > CC'ing Dietmar who was the author of the original workaround. Do you > recall any other information which might be helpful in tracking this > down? I also don't see any similar workaround in the Linux event > infrastructure, which makes me wonder whether the observed behaviour was > a side effect of something else Xen specific. Haitao Shan wrote "The issue causing interrupt loop is: It seems that on NHM (at that time) when a PMI arrives at CPU, the counter has a value to zero (instead of some other small value, say 3 or 5, seen on Core 2 Duo). In this case, unmasking the PMI via APIC will trigger immediately another PMI. This does not produce problem with native kernel, since it typically programs the counter with another value (as needed by making yet another sampling point) before unmasking. For Xen, PMI handler cannot handle the counter immediately since it should be handled by guests. It just records a virtual PMI to guests and unmasks the PMI before return." https://lists.xen.org/archives/html/xen-devel/2013-03/msg02615.html > Having Xen perturb the counters behind a guests back (in a way contrary > to architectural or errata behaviour) is obviously a bad thing, and we > should fix that. I do have access to hardware, but am lacking vPMU > expertise. - Kyle _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.