[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] VPMU interrupt unreliability

Last year I reported[0] seeing occasional instability in performance
counter values when running rr[1], which depends on completely
deterministic counts of retired conditional branches of userspace

I recently identified the cause of this problem.  Xen's VPMU code
contains a workaround for an alleged Nehalem bug that was added in
2010[2].  Supposedly if a hardware performance counter reaches 0
exactly during a PMI another PMI is generated potentially causing an
endless loop.  The workaround is to set the counter to 1.  In 2013 the
original bug was believed to affect more than just Nehalem and the
workaround was enabled for all family 6 CPUs.[3]  This workaround
unfortunately disturbs the counter value in non-deterministic ways
(since the value the counter has in the irq handler depends on
interrupt latency), which is fatal to rr.

I've verified that the discrepancies we see in the counted values are
entirely accounted for by the number of times the workaround is used
in any given run.  Furthermore, patching Xen not to use this
workaround makes the discrepancies in the counts vanish.  I've added
code[4] to rr that reliably detects this problem from guest userspace.

Even with the workaround removed in Xen I see some additional issues
(but not disturbed counter values) with the PMI, such as interrupts
occasionally not being delivered to the guest.  I haven't done much
work to track these down, but my working theory is that interrupts
that "skid" out of the guest that requested them and into Xen itself
or perhaps even another guest are not being delivered.

Our current plan is to stop depending on the PMI during rr's recording
phase (which we use for timeslicing tracees primarily because it's
convenient) to enable producing correct recordings in Xen guests.
Accurate replay will not be possible under virtualization because of
the PMI issues; that will require transferring the recording to
another machine.  But that will be sufficient to enable the use cases
we care about (e.g. record an automated process on a cloud computing
provider and have an engineer download and replay a failing recording
later to debug it).

I can think of several possible ways to fix the overcount problem, including:
1. Restricting the workaround to apply only to older CPUs and not all
family 6 Intel CPUs forever.
2. Intercepting MSR loads for counters that have the workaround
applied and giving the guest the correct counter value.
3. Or perhaps even changing the workaround to disable the PMI on that
counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
on the relevant hardware.

Since I don't have the relevant hardware to test changes to this
workaround on and rr can avoid these bugs through other means I don't
expect to work on this myself, but I wanted to apprise you of what
we've learned.

- Kyle

[0] https://lists.xen.org/archives/html/xen-devel/2016-10/msg01288.html
[1] http://rr-project.org/
[4] See 
which sets up a counter and then does some pointless math in a loop to
reach exactly 500 conditional branches.  Xen will report 501 branches
because of this bug.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.