[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] VPMU interrupt unreliability



On 19/10/17 16:09, Kyle Huey wrote:
> On Wed, Oct 11, 2017 at 7:09 AM, Boris Ostrovsky
> <boris.ostrovsky@xxxxxxxxxx> wrote:
>> On 10/10/2017 12:54 PM, Kyle Huey wrote:
>>> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@xxxxxxxxxxxx> wrote:
>>>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky
>>>> <boris.ostrovsky@xxxxxxxxxx> wrote:
>>>>>>> One thing I noticed is that the workaround doesn't appear to be
>>>>>>> complete: it is only checking PMC0 status and not other counters (fixed
>>>>>>> or architectural). Of course, without knowing what the actual problem
>>>>>>> was it's hard to say whether this was intentional.
>>>>>> handle_pmc_quirk appears to loop through all the counters ...
>>>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS
>>>>> value one by one and so it is looking at all bits.
>>>>>
>>>>>>>> 2. Intercepting MSR loads for counters that have the workaround
>>>>>>>> applied and giving the guest the correct counter value.
>>>>>>> We'd have to keep track of whether the counter has been reset (by the
>>>>>>> quirk) since the last MSR write.
>>>>>> Yes.
>>>>>>
>>>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that
>>>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works
>>>>>>>> on the relevant hardware.
>>>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk
>>>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we?
>>>>>> I'm suggesting waiting until the *guest* writes to the (virtualized)
>>>>>> GLOBAL_OVF_CTRL.
>>>>> Wouldn't it be better to wait until the counter is reloaded?
>>>> Maybe!  I haven't thought through it a lot.  It's still not clear to
>>>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the
>>>> interrupt in any way or whether it just resets the bits in
>>>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is
>>>> all that's required to reenable it.
>>>>
>>>> - Kyle
>>> I wonder if it would be reasonable to just remove the workaround
>>> entirely at some point.  The set of people using 1) several year old
>>> hardware, 2) an up to date Xen, and 3) the off-by-default performance
>>> counters is probably rather small.
>> We'd probably want to only enable this for affected processors, not
>> remove it outright. But the problem is that we still don't know for sure
>> whether this issue affects NHM only, do we?
>>
>> (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html
>> is the original message)
> Yes, the basic problem is that we don't know where to draw the line.

vPMU is disabled by default for security reasons, and also broken, in a
way which demonstrates that vPMU isn't getting much real-world use.

As far as I'm concerned, all options (including rm -rf and start from
scratch) are acceptable, especially if this ends up giving us a better
overall subsystem.

Do we know how other hypervisors work around this issue?

I'm tempted to suggest just ripping it straight out.  NHM is ancient
these days, and if someone does manage to get a repro, we stand a better
chance of being able to debug it properly.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.