[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] VPMU interrupt unreliability



On 10/10/2017 12:54 PM, Kyle Huey wrote:
> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@xxxxxxxxxxxx> wrote:
>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky
>> <boris.ostrovsky@xxxxxxxxxx> wrote:
>>>>> One thing I noticed is that the workaround doesn't appear to be
>>>>> complete: it is only checking PMC0 status and not other counters (fixed
>>>>> or architectural). Of course, without knowing what the actual problem
>>>>> was it's hard to say whether this was intentional.
>>>> handle_pmc_quirk appears to loop through all the counters ...
>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS
>>> value one by one and so it is looking at all bits.
>>>
>>>>>> 2. Intercepting MSR loads for counters that have the workaround
>>>>>> applied and giving the guest the correct counter value.
>>>>> We'd have to keep track of whether the counter has been reset (by the
>>>>> quirk) since the last MSR write.
>>>> Yes.
>>>>
>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that
>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>>>>>> on the relevant hardware.
>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?
>>>> I'm suggesting waiting until the *guest* writes to the (virtualized)
>>>> GLOBAL_OVF_CTRL.
>>> Wouldn't it be better to wait until the counter is reloaded?
>> Maybe!  I haven't thought through it a lot.  It's still not clear to
>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the
>> interrupt in any way or whether it just resets the bits in
>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is
>> all that's required to reenable it.
>>
>> - Kyle
> I wonder if it would be reasonable to just remove the workaround
> entirely at some point.  The set of people using 1) several year old
> hardware, 2) an up to date Xen, and 3) the off-by-default performance
> counters is probably rather small.

We'd probably want to only enable this for affected processors, not
remove it outright. But the problem is that we still don't know for sure
whether this issue affects NHM only, do we?

(https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html
is the original message)


-boris


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.