[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] VPMU interrupt unreliability

On Mon, Jul 24, 2017 at 7:08 AM, Boris Ostrovsky
<boris.ostrovsky@xxxxxxxxxx> wrote:
> On 07/22/2017 04:16 PM, Kyle Huey wrote:
>> Last year I reported[0] seeing occasional instability in performance
>> counter values when running rr[1], which depends on completely
>> deterministic counts of retired conditional branches of userspace
>> programs.
>> I recently identified the cause of this problem.  Xen's VPMU code
>> contains a workaround for an alleged Nehalem bug that was added in
>> 2010[2].  Supposedly if a hardware performance counter reaches 0
>> exactly during a PMI another PMI is generated potentially causing an
>> endless loop.  The workaround is to set the counter to 1.  In 2013 the
>> original bug was believed to affect more than just Nehalem and the
>> workaround was enabled for all family 6 CPUs.[3]  This workaround
>> unfortunately disturbs the counter value in non-deterministic ways
>> (since the value the counter has in the irq handler depends on
>> interrupt latency), which is fatal to rr.
>> I've verified that the discrepancies we see in the counted values are
>> entirely accounted for by the number of times the workaround is used
>> in any given run.  Furthermore, patching Xen not to use this
>> workaround makes the discrepancies in the counts vanish.  I've added
>> code[4] to rr that reliably detects this problem from guest userspace.
>> Even with the workaround removed in Xen I see some additional issues
>> (but not disturbed counter values) with the PMI, such as interrupts
>> occasionally not being delivered to the guest.  I haven't done much
>> work to track these down, but my working theory is that interrupts
>> that "skid" out of the guest that requested them and into Xen itself
>> or perhaps even another guest are not being delivered.
>> Our current plan is to stop depending on the PMI during rr's recording
>> phase (which we use for timeslicing tracees primarily because it's
>> convenient) to enable producing correct recordings in Xen guests.
>> Accurate replay will not be possible under virtualization because of
>> the PMI issues; that will require transferring the recording to
>> another machine.  But that will be sufficient to enable the use cases
>> we care about (e.g. record an automated process on a cloud computing
>> provider and have an engineer download and replay a failing recording
>> later to debug it).
>> I can think of several possible ways to fix the overcount problem, including:
>> 1. Restricting the workaround to apply only to older CPUs and not all
>> family 6 Intel CPUs forever.
> IIRC the question of which processors this workaround is applicable to
> was raised and Intel folks (copied here) couldn't find an answer.
> One thing I noticed is that the workaround doesn't appear to be
> complete: it is only checking PMC0 status and not other counters (fixed
> or architectural). Of course, without knowing what the actual problem
> was it's hard to say whether this was intentional.

handle_pmc_quirk appears to loop through all the counters ...

>> 2. Intercepting MSR loads for counters that have the workaround
>> applied and giving the guest the correct counter value.
> We'd have to keep track of whether the counter has been reset (by the
> quirk) since the last MSR write.


>> 3. Or perhaps even changing the workaround to disable the PMI on that
>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>> on the relevant hardware.
> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?

I'm suggesting waiting until the *guest* writes to the (virtualized)

> Thanks for looking into this. Would also be interesting to see/confirm
> how some interrupts are (possibly) lost.

Indeed.  Unfortunately it's not a high priority for me at the moment.

- Kyle

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.