[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] x86/vpmu: fix race-condition in vpmu_load
On 20.09.2022 00:42, Boris Ostrovsky wrote: > > > On 9/19/22 10:56 AM, Jan Beulich wrote: >> On 19.09.2022 16:11, Tamas K Lengyel wrote: >>> On Mon, Sep 19, 2022 at 9:58 AM Jan Beulich <jbeulich@xxxxxxxx> wrote: >>> >>>> On 19.09.2022 15:24, Tamas K Lengyel wrote: >>>>> On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich <jbeulich@xxxxxxxx> wrote: >>>>>> >>>>>> On 19.09.2022 14:25, Tamas K Lengyel wrote: >>>>>>> On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeulich@xxxxxxxx> wrote: >>>>>>>> >>>>>>>> On 16.09.2022 23:35, Boris Ostrovsky wrote: >>>>>>>>> >>>>>>>>> On 9/16/22 8:52 AM, Jan Beulich wrote: >>>>>>>>>> On 15.09.2022 16:01, Tamas K Lengyel wrote: >>>>>>>>>>> While experimenting with the vPMU subsystem an ASSERT failure was >>>>>>>>>>> observed in vmx_find_msr because the vcpu_runnable state was true. >>>>>>>>>>> >>>>>>>>>>> The root cause of the bug appears to be the fact that the vPMU >>>> subsystem >>>>>>>>>>> doesn't save its state on context_switch. >>>>>> >>>>>> For the further reply below - is this actually true? What is the >>>>>> vpmu_switch_from() (resolving to vpmu_save()) doing then early >>>>>> in context_switch()? >>>>>> >>>>>>>>>>> The vpmu_load function will attempt >>>>>>>>>>> to gather the PMU state if its still loaded two different ways: >>>>>>>>>>> 1. if the current pcpu is not where the vcpu ran before doing >>>> a remote save >>>>>>>>>>> 2. if the current pcpu had another vcpu active before doing a >>>> local save >>>>>>>>>>> >>>>>>>>>>> However, in case the prev vcpu is being rescheduled on another >>>> pcpu its state >>>>>>>>>>> has already changed and vcpu_runnable is returning true, thus #2 >>>> will trip the >>>>>>>>>>> ASSERT. The only way to avoid this race condition is to make sure >>>> the >>>>>>>>>>> prev vcpu is paused while being checked and its context saved. >>>> Once the prev >>>>>>>>>>> vcpu is resumed and does #1 it will find its state already saved. >>>>>>>>>> While I consider this explanation plausible, I'm worried: >>>>>>>>>> >>>>>>>>>>> --- a/xen/arch/x86/cpu/vpmu.c >>>>>>>>>>> +++ b/xen/arch/x86/cpu/vpmu.c >>>>>>>>>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t >>>> from_guest) >>>>>>>>>>> vpmu = vcpu_vpmu(prev); >>>>>>>>>>> >>>>>>>>>>> /* Someone ran here before us */ >>>>>>>>>>> + vcpu_pause(prev); >>>>>>>>>>> vpmu_save_force(prev); >>>>>>>>>>> vpmu_reset(vpmu, VPMU_CONTEXT_LOADED); >>>>>>>>>>> + vcpu_unpause(prev); >>>>>>>>>>> >>>>>>>>>>> vpmu = vcpu_vpmu(v); >>>>>>>>>>> } >>>>>>>>>> We're running with IRQs off here, yet vcpu_pause() waits for the >>>> vcpu >>>>>>>>>> to actually be de-scheduled. Even with IRQs on this is already a >>>>>>>>>> relatively heavy operation (also including its impact on the remote >>>>>>>>>> side). Additionally the function is called from context_switch(), >>>> and >>>>>>>>>> I'm unsure of the usability of vcpu_pause() on such a path. In >>>>>>>>>> particular: Is there a risk of two CPUs doing this mutually to one >>>>>>>>>> another? If so, is deadlocking excluded? >>>>>>>>>> >>>>>>>>>> Hence at the very least I think the description wants extending, to >>>>>>>>>> discuss the safety of the change. >>>>>>>>>> >>>>>>>>>> Boris - any chance you could comment here? Iirc that's code you did >>>>>>>>>> introduce. >>>>>>>>> >>>>>>>>> >>>>>>>>> Is the assertion in vmx_find_msr() really needs to be for runnable >>>> vcpu or can it be a check on whether vcpu is actually running (e.g. >>>> RUNSTATE_running)? >>>>>>>> >>>>>>>> You cannot safely check for "running", as "runnable" may transition >>>>>>>> to/from "running" behind your back. >>>>>>> >>>>>>> The more I look at this the more I think the only sensible solution is >>>>>>> to have the vPMU state be saved on vmexit for all vCPUs. >>>>>> >>>>>> Do you really mean vmexit? It would suffice if state was reliably >>>>>> saved during context-switch-out, wouldn't it? At that point the >>>>>> vCPU can't be resumed on another pCPU, yet. >>>>>> >>>>>>> That way all >>>>>>> this having to figure out where and when a context needs saving during >>>>>>> scheduling goes away. Yes, it adds a bit of overhead for cases where >>>>>>> the vCPU will resume on the same pCPU and that context saved could >>>>>>> have been skipped, >>>>>> >>>>>> If you really mean vmexit, then I'm inclined to say that's more >>>>>> than just "a bit of overhead". I'd agree if you really meant >>>>>> context-switch-out, but as said further up it looks to me as if >>>>>> that was already occurring. Apparently I'm overlooking something >>>>>> crucial ... >>>>> >>>>> Yes, the current setup is doing exactly that, saving the vPMU context >>>>> on context-switch-out, and that's where the ASSERT failure occurs >>>>> because the vCPU it needs to save the context for is already runnable >>>>> on another pCPU. >>>> >>>> Well, if that's the scenario (sorry for not understanding it that >>>> way earlier on), then the assertion is too strict: While in context >>>> switch, the vCPU may be runnable, but certainly won't actually run >>>> anywhere. Therefore I'd be inclined to suggest to relax the >>>> condition just enough to cover this case, without actually going to >>>> checking for "running". >>>> >>> >>> What ensures the vCPU won't actually run anywhere if it's in the runnable >>> state? >> >> The fact that the vCPU is the subject of context_switch(). >> >>> And how do I relax the condition just enough to cover just this case? >> >> That's the more difficult question. The immediate solution, passing a >> boolean or flag to vpmu_switch_from(), may not be practical, as it >> would likely mean passing this through many layers. >> >> Utilizing that I have Jürgen sitting next to me, I've discussed this >> with him, and he suggested to simply check for v == current. And >> indeed - set_current() in context_switch() happens a few lines after >> vpmu_switch_from(). > > > > It is saving vpmu data from current pcpu's MSRs for a remote vcpu so @v > in vmx_find_msr() is not @current: > > vpmu_load() > ... > prev = per_cpu(last_vcpu, pcpu); > vpmu_save_force(prev) > core2_vpmu_save() > __core2_vpmu_save() > vmx_read_guest_msr() > vmx_find_msr() > > > The call to vmx_find_msr() was introduced by 755087eb9b10c. I wonder though > whether > this call is needed when code path above is executed (i.e. when we are saving > remove vcpu) How could it not be needed? We need to obtain the guest value. The thing I don't understand is why this forced saving is necessary, when context_switch() unconditionally calls vpmu_switch_from(). Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |