[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] VPMU interrupt unreliability

On 07/22/2017 04:16 PM, Kyle Huey wrote:
> Last year I reported[0] seeing occasional instability in performance
> counter values when running rr[1], which depends on completely
> deterministic counts of retired conditional branches of userspace
> programs.
> I recently identified the cause of this problem.  Xen's VPMU code
> contains a workaround for an alleged Nehalem bug that was added in
> 2010[2].  Supposedly if a hardware performance counter reaches 0
> exactly during a PMI another PMI is generated potentially causing an
> endless loop.  The workaround is to set the counter to 1.  In 2013 the
> original bug was believed to affect more than just Nehalem and the
> workaround was enabled for all family 6 CPUs.[3]  This workaround
> unfortunately disturbs the counter value in non-deterministic ways
> (since the value the counter has in the irq handler depends on
> interrupt latency), which is fatal to rr.
> I've verified that the discrepancies we see in the counted values are
> entirely accounted for by the number of times the workaround is used
> in any given run.  Furthermore, patching Xen not to use this
> workaround makes the discrepancies in the counts vanish.  I've added
> code[4] to rr that reliably detects this problem from guest userspace.
> Even with the workaround removed in Xen I see some additional issues
> (but not disturbed counter values) with the PMI, such as interrupts
> occasionally not being delivered to the guest.  I haven't done much
> work to track these down, but my working theory is that interrupts
> that "skid" out of the guest that requested them and into Xen itself
> or perhaps even another guest are not being delivered.
> Our current plan is to stop depending on the PMI during rr's recording
> phase (which we use for timeslicing tracees primarily because it's
> convenient) to enable producing correct recordings in Xen guests.
> Accurate replay will not be possible under virtualization because of
> the PMI issues; that will require transferring the recording to
> another machine.  But that will be sufficient to enable the use cases
> we care about (e.g. record an automated process on a cloud computing
> provider and have an engineer download and replay a failing recording
> later to debug it).
> I can think of several possible ways to fix the overcount problem, including:
> 1. Restricting the workaround to apply only to older CPUs and not all
> family 6 Intel CPUs forever.

IIRC the question of which processors this workaround is applicable to
was raised and Intel folks (copied here) couldn't find an answer.

One thing I noticed is that the workaround doesn't appear to be
complete: it is only checking PMC0 status and not other counters (fixed
or architectural). Of course, without knowing what the actual problem
was it's hard to say whether this was intentional.

> 2. Intercepting MSR loads for counters that have the workaround
> applied and giving the guest the correct counter value.

We'd have to keep track of whether the counter has been reset (by the
quirk) since the last MSR write.

> 3. Or perhaps even changing the workaround to disable the PMI on that
> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
> on the relevant hardware.

MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?

Thanks for looking into this. Would also be interesting to see/confirm
how some interrupts are (possibly) lost.


> Since I don't have the relevant hardware to test changes to this
> workaround on and rr can avoid these bugs through other means I don't
> expect to work on this myself, but I wanted to apprise you of what
> we've learned.
> - Kyle
> [0] https://lists.xen.org/archives/html/xen-devel/2016-10/msg01288.html
> [1] http://rr-project.org/
> [2] 
> https://xenbits.xen.org/gitweb/?p=xen.git;a=blobdiff;f=xen/arch/x86/hvm/vmx/vpmu_core2.c;h=44aa8e3c47fc02e401f5c382d89b97eef0cd2019;hp=ce4fd2d43e04db5e9b042344dd294cfa11e1f405;hb=3ed6a063d2a5f6197306b030e8c27c36d5f31aa1;hpb=566f83823996cf9c95f9a0562488f6b1215a1052
> [3] 
> https://xenbits.xen.org/gitweb/?p=xen.git;a=blobdiff;f=xen/arch/x86/hvm/vmx/vpmu_core2.c;h=15b2036c8db1e56d8865ee34c363e7f23aa75e33;hp=9f152b48c26dfeedb6f94189a5fe4a5f7a772d83;hb=75a92f551ade530ebab73a0c3d4934dfb28149b5;hpb=71fc4da1306cec55a42787310b01a1cb52489abc
> [4] See 
> https://github.com/mozilla/rr/blob/a5d23728cd7d01c6be0c79852af26c68160d4405/src/PerfCounters.cc#L313,
> which sets up a counter and then does some pointless math in a loop to
> reach exactly 500 conditional branches.  Xen will report 501 branches
> because of this bug.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.