Xen project Mailing List

Re: [Xen-devel] VPMU interrupt unreliability

To: Kyle Huey <me@xxxxxxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Thu, 19 Oct 2017 16:40:43 +0100

Cc: "Tian, Kevin" <kevin.tian@xxxxxxxxx>, Dietmar Hahn <dietmar.hahn@xxxxxxxxxxxxxx>, Robert O'Callahan <robert@xxxxxxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Thu, 19 Oct 2017 15:41:00 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 19/10/17 16:09, Kyle Huey wrote: > On Wed, Oct 11, 2017 at 7:09 AM, Boris Ostrovsky > <boris.ostrovsky@xxxxxxxxxx> wrote: >> On 10/10/2017 12:54 PM, Kyle Huey wrote: >>> On Mon, Jul 24, 2017 at 9:54 AM, Kyle Huey <me@xxxxxxxxxxxx> wrote: >>>> On Mon, Jul 24, 2017 at 8:07 AM, Boris Ostrovsky >>>> <boris.ostrovsky@xxxxxxxxxx> wrote: >>>>>>> One thing I noticed is that the workaround doesn't appear to be >>>>>>> complete: it is only checking PMC0 status and not other counters (fixed >>>>>>> or architectural). Of course, without knowing what the actual problem >>>>>>> was it's hard to say whether this was intentional. >>>>>> handle_pmc_quirk appears to loop through all the counters ... >>>>> Right, I didn't notice that it is shifting MSR_CORE_PERF_GLOBAL_STATUS >>>>> value one by one and so it is looking at all bits. >>>>> >>>>>>>> 2. Intercepting MSR loads for counters that have the workaround >>>>>>>> applied and giving the guest the correct counter value. >>>>>>> We'd have to keep track of whether the counter has been reset (by the >>>>>>> quirk) since the last MSR write. >>>>>> Yes. >>>>>> >>>>>>>> 3. Or perhaps even changing the workaround to disable the PMI on that >>>>>>>> counter until the guest acks via GLOBAL_OVF_CTRL, assuming that works >>>>>>>> on the relevant hardware. >>>>>>> MSR_CORE_PERF_GLOBAL_OVF_CTRL is written immediately after the quirk >>>>>>> runs (in core2_vpmu_do_interrupt()) so we already do this, don't we? >>>>>> I'm suggesting waiting until the *guest* writes to the (virtualized) >>>>>> GLOBAL_OVF_CTRL. >>>>> Wouldn't it be better to wait until the counter is reloaded? >>>> Maybe! I haven't thought through it a lot. It's still not clear to >>>> me whether MSR_CORE_PERF_GLOBAL_OVF_CTRL actually controls the >>>> interrupt in any way or whether it just resets the bits in >>>> MSR_CORE_PERF_GLOBAL_STATUS and acking the interrupt on the APIC is >>>> all that's required to reenable it. >>>> >>>> - Kyle >>> I wonder if it would be reasonable to just remove the workaround >>> entirely at some point. The set of people using 1) several year old >>> hardware, 2) an up to date Xen, and 3) the off-by-default performance >>> counters is probably rather small. >> We'd probably want to only enable this for affected processors, not >> remove it outright. But the problem is that we still don't know for sure >> whether this issue affects NHM only, do we? >> >> (https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg02242.html >> is the original message) > Yes, the basic problem is that we don't know where to draw the line. vPMU is disabled by default for security reasons, and also broken, in a way which demonstrates that vPMU isn't getting much real-world use. As far as I'm concerned, all options (including rm -rf and start from scratch) are acceptable, especially if this ends up giving us a better overall subsystem. Do we know how other hypervisors work around this issue? I'm tempted to suggest just ripping it straight out. NHM is ancient these days, and if someone does manage to get a repro, we stand a better chance of being able to debug it properly. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.