Xen project Mailing List

Re: [Xen-devel] VPMU interrupt unreliability

To: Kyle Huey <me@xxxxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

From: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>

Date: Mon, 24 Jul 2017 10:08:17 -0400

Cc: "Tian, Kevin" <kevin.tian@xxxxxxxxx>, Robert O'Callahan <robert@xxxxxxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>

Delivery-date: Mon, 24 Jul 2017 14:06:57 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 07/22/2017 04:16 PM, Kyle Huey wrote: > Last year I reported[0] seeing occasional instability in performance > counter values when running rr[1], which depends on completely > deterministic counts of retired conditional branches of userspace > programs. > > I recently identified the cause of this problem. Xen's VPMU code > contains a workaround for an alleged Nehalem bug that was added in > 2010[2]. Supposedly if a hardware performance counter reaches 0 > exactly during a PMI another PMI is generated potentially causing an > endless loop. The workaround is to set the counter to 1. In 2013 the > original bug was believed to affect more than just Nehalem and the > workaround was enabled for all family 6 CPUs.[3] This workaround > unfortunately disturbs the counter value in non-deterministic ways > (since the value the counter has in the irq handler depends on > interrupt latency), which is fatal to rr. > > I've verified that the discrepancies we see in the counted values are > entirely accounted for by the number of times the workaround is used > in any given run. Furthermore, patching Xen not to use this > workaround makes the discrepancies in the counts vanish. I've added > code[4] to rr that reliably detects this problem from guest userspace. > > Even with the workaround removed in Xen I see some additional issues > (but not disturbed counter values) with the PMI, such as interrupts > occasionally not being delivered to the guest. I haven't done much > work to track these down, but my working theory is that interrupts > that "skid" out of the guest that requested them and into Xen itself > or perhaps even another guest are not being delivered. > > Our current plan is to stop depending on the PMI during rr's recording > phase (which we use for timeslicing tracees primarily because it's > convenient) to enable producing correct recordings in Xen guests. > Accurate replay will not be possible under virtualization because of > the PMI issues; that will require transferring the recording to > another machine. But that will be sufficient to enable the use cases > we care about (e.g. record an automated process on a cloud computing > provider and have an engineer download and replay a failing recording > later to debug it). > > I can think of several possible ways to fix the overcount problem, including: > 1. Restricting the workaround to apply only to older CPUs and not all > family 6 Intel CPUs forever. IIRC the question of which processors this workaround is applicable to was raised and Intel folks (copied here) couldn't find an answer. One thing I noticed is that the workaround doesn't appear to be complete: it is only checking PMC0 status and not other counters (fixed or architectural). Of course, without knowing what the actual problem was it's hard to say whether this was intentional. > 2. Intercepting MSR loads for counters that have the workaround > applied and giving the guest the correct counter value. We'd have to keep track of whether the counter has been reset (by the quirk) since the last MSR write. > 3. Or perhaps even changing the workaround to disable the PMI on that > counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works > on the relevant hardware. MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk runs (in core2_vpmu_do_interrupt()) so we already do this, don't we? Thanks for looking into this. Would also be interesting to see/confirm how some interrupts are (possibly) lost. -boris > > Since I don't have the relevant hardware to test changes to this > workaround on and rr can avoid these bugs through other means I don't > expect to work on this myself, but I wanted to apprise you of what > we've learned. > > - Kyle > > [0] https://lists.xen.org/archives/html/xen-devel/2016-10/msg01288.html > [1] http://rr-project.org/ > [2] > https://xenbits.xen.org/gitweb/?p=xen.git;a=blobdiff;f=xen/arch/x86/hvm/vmx/vpmu_core2.c;h=44aa8e3c47fc02e401f5c382d89b97eef0cd2019;hp=ce4fd2d43e04db5e9b042344dd294cfa11e1f405;hb=3ed6a063d2a5f6197306b030e8c27c36d5f31aa1;hpb=566f83823996cf9c95f9a0562488f6b1215a1052 > [3] > https://xenbits.xen.org/gitweb/?p=xen.git;a=blobdiff;f=xen/arch/x86/hvm/vmx/vpmu_core2.c;h=15b2036c8db1e56d8865ee34c363e7f23aa75e33;hp=9f152b48c26dfeedb6f94189a5fe4a5f7a772d83;hb=75a92f551ade530ebab73a0c3d4934dfb28149b5;hpb=71fc4da1306cec55a42787310b01a1cb52489abc > [4] See > https://github.com/mozilla/rr/blob/a5d23728cd7d01c6be0c79852af26c68160d4405/src/PerfCounters.cc#L313, > which sets up a counter and then does some pointless math in a loop to > reach exactly 500 conditional branches. Xen will report 501 branches > because of this bug. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.