[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] x86/vpmu: fix race-condition in vpmu_load
On 19.09.2022 16:11, Tamas K Lengyel wrote: > On Mon, Sep 19, 2022 at 9:58 AM Jan Beulich <jbeulich@xxxxxxxx> wrote: > >> On 19.09.2022 15:24, Tamas K Lengyel wrote: >>> On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich <jbeulich@xxxxxxxx> wrote: >>>> >>>> On 19.09.2022 14:25, Tamas K Lengyel wrote: >>>>> On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeulich@xxxxxxxx> wrote: >>>>>> >>>>>> On 16.09.2022 23:35, Boris Ostrovsky wrote: >>>>>>> >>>>>>> On 9/16/22 8:52 AM, Jan Beulich wrote: >>>>>>>> On 15.09.2022 16:01, Tamas K Lengyel wrote: >>>>>>>>> While experimenting with the vPMU subsystem an ASSERT failure was >>>>>>>>> observed in vmx_find_msr because the vcpu_runnable state was true. >>>>>>>>> >>>>>>>>> The root cause of the bug appears to be the fact that the vPMU >> subsystem >>>>>>>>> doesn't save its state on context_switch. >>>> >>>> For the further reply below - is this actually true? What is the >>>> vpmu_switch_from() (resolving to vpmu_save()) doing then early >>>> in context_switch()? >>>> >>>>>>>>> The vpmu_load function will attempt >>>>>>>>> to gather the PMU state if its still loaded two different ways: >>>>>>>>> 1. if the current pcpu is not where the vcpu ran before doing >> a remote save >>>>>>>>> 2. if the current pcpu had another vcpu active before doing a >> local save >>>>>>>>> >>>>>>>>> However, in case the prev vcpu is being rescheduled on another >> pcpu its state >>>>>>>>> has already changed and vcpu_runnable is returning true, thus #2 >> will trip the >>>>>>>>> ASSERT. The only way to avoid this race condition is to make sure >> the >>>>>>>>> prev vcpu is paused while being checked and its context saved. >> Once the prev >>>>>>>>> vcpu is resumed and does #1 it will find its state already saved. >>>>>>>> While I consider this explanation plausible, I'm worried: >>>>>>>> >>>>>>>>> --- a/xen/arch/x86/cpu/vpmu.c >>>>>>>>> +++ b/xen/arch/x86/cpu/vpmu.c >>>>>>>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t >> from_guest) >>>>>>>>> vpmu = vcpu_vpmu(prev); >>>>>>>>> >>>>>>>>> /* Someone ran here before us */ >>>>>>>>> + vcpu_pause(prev); >>>>>>>>> vpmu_save_force(prev); >>>>>>>>> vpmu_reset(vpmu, VPMU_CONTEXT_LOADED); >>>>>>>>> + vcpu_unpause(prev); >>>>>>>>> >>>>>>>>> vpmu = vcpu_vpmu(v); >>>>>>>>> } >>>>>>>> We're running with IRQs off here, yet vcpu_pause() waits for the >> vcpu >>>>>>>> to actually be de-scheduled. Even with IRQs on this is already a >>>>>>>> relatively heavy operation (also including its impact on the remote >>>>>>>> side). Additionally the function is called from context_switch(), >> and >>>>>>>> I'm unsure of the usability of vcpu_pause() on such a path. In >>>>>>>> particular: Is there a risk of two CPUs doing this mutually to one >>>>>>>> another? If so, is deadlocking excluded? >>>>>>>> >>>>>>>> Hence at the very least I think the description wants extending, to >>>>>>>> discuss the safety of the change. >>>>>>>> >>>>>>>> Boris - any chance you could comment here? Iirc that's code you did >>>>>>>> introduce. >>>>>>> >>>>>>> >>>>>>> Is the assertion in vmx_find_msr() really needs to be for runnable >> vcpu or can it be a check on whether vcpu is actually running (e.g. >> RUNSTATE_running)? >>>>>> >>>>>> You cannot safely check for "running", as "runnable" may transition >>>>>> to/from "running" behind your back. >>>>> >>>>> The more I look at this the more I think the only sensible solution is >>>>> to have the vPMU state be saved on vmexit for all vCPUs. >>>> >>>> Do you really mean vmexit? It would suffice if state was reliably >>>> saved during context-switch-out, wouldn't it? At that point the >>>> vCPU can't be resumed on another pCPU, yet. >>>> >>>>> That way all >>>>> this having to figure out where and when a context needs saving during >>>>> scheduling goes away. Yes, it adds a bit of overhead for cases where >>>>> the vCPU will resume on the same pCPU and that context saved could >>>>> have been skipped, >>>> >>>> If you really mean vmexit, then I'm inclined to say that's more >>>> than just "a bit of overhead". I'd agree if you really meant >>>> context-switch-out, but as said further up it looks to me as if >>>> that was already occurring. Apparently I'm overlooking something >>>> crucial ... >>> >>> Yes, the current setup is doing exactly that, saving the vPMU context >>> on context-switch-out, and that's where the ASSERT failure occurs >>> because the vCPU it needs to save the context for is already runnable >>> on another pCPU. >> >> Well, if that's the scenario (sorry for not understanding it that >> way earlier on), then the assertion is too strict: While in context >> switch, the vCPU may be runnable, but certainly won't actually run >> anywhere. Therefore I'd be inclined to suggest to relax the >> condition just enough to cover this case, without actually going to >> checking for "running". >> > > What ensures the vCPU won't actually run anywhere if it's in the runnable > state? The fact that the vCPU is the subject of context_switch(). > And how do I relax the condition just enough to cover just this case? That's the more difficult question. The immediate solution, passing a boolean or flag to vpmu_switch_from(), may not be practical, as it would likely mean passing this through many layers. Utilizing that I have Jürgen sitting next to me, I've discussed this with him, and he suggested to simply check for v == current. And indeed - set_current() in context_switch() happens a few lines after vpmu_switch_from(). However, going back to vmx_find_msr() I find that the v == current case is already included there. Which makes me wonder again - what exactly is the scenario that you're observing the assertion triggering? Would you mind spelling out the call chain, perhaps by way of the call stack from the assertion? Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |