Xen project Mailing List

Re: [PATCH] x86/vpmu: fix race-condition in vpmu_load

From: Tamas K Lengyel <tamas.k.lengyel@xxxxxxxxx>

Date: Mon, 19 Sep 2022 09:24:52 -0400

Cc: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, Tamas K Lengyel <tamas.lengyel@xxxxxxxxx>

Delivery-date: Mon, 19 Sep 2022 13:25:31 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich <jbeulich@xxxxxxxx> wrote: > > On 19.09.2022 14:25, Tamas K Lengyel wrote: > > On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeulich@xxxxxxxx> wrote: > >> > >> On 16.09.2022 23:35, Boris Ostrovsky wrote: > >>> > >>> On 9/16/22 8:52 AM, Jan Beulich wrote: > >>>> On 15.09.2022 16:01, Tamas K Lengyel wrote: > >>>>> While experimenting with the vPMU subsystem an ASSERT failure was > >>>>> observed in vmx_find_msr because the vcpu_runnable state was true. > >>>>> > >>>>> The root cause of the bug appears to be the fact that the vPMU subsystem > >>>>> doesn't save its state on context_switch. > > For the further reply below - is this actually true? What is the > vpmu_switch_from() (resolving to vpmu_save()) doing then early > in context_switch()? > > >>>>> The vpmu_load function will attempt > >>>>> to gather the PMU state if its still loaded two different ways: > >>>>> 1. if the current pcpu is not where the vcpu ran before doing a > >>>>> remote save > >>>>> 2. if the current pcpu had another vcpu active before doing a > >>>>> local save > >>>>> > >>>>> However, in case the prev vcpu is being rescheduled on another pcpu its > >>>>> state > >>>>> has already changed and vcpu_runnable is returning true, thus #2 will > >>>>> trip the > >>>>> ASSERT. The only way to avoid this race condition is to make sure the > >>>>> prev vcpu is paused while being checked and its context saved. Once the > >>>>> prev > >>>>> vcpu is resumed and does #1 it will find its state already saved. > >>>> While I consider this explanation plausible, I'm worried: > >>>> > >>>>> --- a/xen/arch/x86/cpu/vpmu.c > >>>>> +++ b/xen/arch/x86/cpu/vpmu.c > >>>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t from_guest) > >>>>> vpmu = vcpu_vpmu(prev); > >>>>> > >>>>> /* Someone ran here before us */ > >>>>> + vcpu_pause(prev); > >>>>> vpmu_save_force(prev); > >>>>> vpmu_reset(vpmu, VPMU_CONTEXT_LOADED); > >>>>> + vcpu_unpause(prev); > >>>>> > >>>>> vpmu = vcpu_vpmu(v); > >>>>> } > >>>> We're running with IRQs off here, yet vcpu_pause() waits for the vcpu > >>>> to actually be de-scheduled. Even with IRQs on this is already a > >>>> relatively heavy operation (also including its impact on the remote > >>>> side). Additionally the function is called from context_switch(), and > >>>> I'm unsure of the usability of vcpu_pause() on such a path. In > >>>> particular: Is there a risk of two CPUs doing this mutually to one > >>>> another? If so, is deadlocking excluded? > >>>> > >>>> Hence at the very least I think the description wants extending, to > >>>> discuss the safety of the change. > >>>> > >>>> Boris - any chance you could comment here? Iirc that's code you did > >>>> introduce. > >>> > >>> > >>> Is the assertion in vmx_find_msr() really needs to be for runnable vcpu > >>> or can it be a check on whether vcpu is actually running (e.g. > >>> RUNSTATE_running)? > >> > >> You cannot safely check for "running", as "runnable" may transition > >> to/from "running" behind your back. > > > > The more I look at this the more I think the only sensible solution is > > to have the vPMU state be saved on vmexit for all vCPUs. > > Do you really mean vmexit? It would suffice if state was reliably > saved during context-switch-out, wouldn't it? At that point the > vCPU can't be resumed on another pCPU, yet. > > > That way all > > this having to figure out where and when a context needs saving during > > scheduling goes away. Yes, it adds a bit of overhead for cases where > > the vCPU will resume on the same pCPU and that context saved could > > have been skipped, > > If you really mean vmexit, then I'm inclined to say that's more > than just "a bit of overhead". I'd agree if you really meant > context-switch-out, but as said further up it looks to me as if > that was already occurring. Apparently I'm overlooking something > crucial ... Yes, the current setup is doing exactly that, saving the vPMU context on context-switch-out, and that's where the ASSERT failure occurs because the vCPU it needs to save the context for is already runnable on another pCPU. Tamas

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.