Xen project Mailing List

Re: [PATCH] x86/vpmu: fix race-condition in vpmu_load

To: Tamas K Lengyel <tamas.k.lengyel@xxxxxxxxx>

Date: Mon, 19 Sep 2022 16:56:32 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=fYa2cO6OR55n+mZdOPe8nbfglTyz7w+xNjEemBx/x1g=; b=KA1vl+I5AJD8fU9hDy9+3Fl5qhAFGGCjBqyycExp4OV6fFXTjMGk5VME6cs+sYhrc9dlsV4PPl2ldNJUftcIpzzB/tsKfntqtcMp11/iHsC6HNcUkcgOoPHjYHAgkpn+6/Y5OyfZts1UO0xw1xJbzPKU/5W3GP6fYytD/c4FQSutPGPcqLoeg2/Q+BELwZo9OOg/PzfFlc3uRRBdAc7REPYECdQrf0l7ug9c6e9bnv6q329TiWQy5XVZzFahCsYwapnmfPjGIzemP7YGxNzqU4oGkQG9FKotDNtXopxsAu3NU6DIPVkBkogLp8C40BeOq+sO+nzwR0cWvdKEhkfwXw==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Y3EaANv2CHxoj4V2/3ffJF1W6d/GwfY2X9kLdoA0M7pGPoTGb3jhcnJXHBTW3LNGLduvUjIZ+yBNIL7MMiJnH8BnjSiF+BXSAeUyz3DN1x1ZyXI+nSmSsBv+SuACTqK3TPJlZFaUfbfcNVIKdU1F8o5sWgJC7Zu4D7Y7iOwGJQj064AvT1QPkIp+flM9VxZygXlDwz6nbMezKEuecLxYhYAFNHhkz9IAumDugiF1i6YsUOCBYflB2DJxYtosSrlQpxqvOXt3tM4u/qXr3J70Tek+Jt+W6htOWaRSby299rGALCAunDswnkO+2aThy9MwY+MS5l1NVF1t+8Np42worA==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, Tamas K Lengyel <tamas.lengyel@xxxxxxxxx>

Delivery-date: Mon, 19 Sep 2022 14:56:53 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 19.09.2022 16:11, Tamas K Lengyel wrote: > On Mon, Sep 19, 2022 at 9:58 AM Jan Beulich <jbeulich@xxxxxxxx> wrote: > >> On 19.09.2022 15:24, Tamas K Lengyel wrote: >>> On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich <jbeulich@xxxxxxxx> wrote: >>>> >>>> On 19.09.2022 14:25, Tamas K Lengyel wrote: >>>>> On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich <jbeulich@xxxxxxxx> wrote: >>>>>> >>>>>> On 16.09.2022 23:35, Boris Ostrovsky wrote: >>>>>>> >>>>>>> On 9/16/22 8:52 AM, Jan Beulich wrote: >>>>>>>> On 15.09.2022 16:01, Tamas K Lengyel wrote: >>>>>>>>> While experimenting with the vPMU subsystem an ASSERT failure was >>>>>>>>> observed in vmx_find_msr because the vcpu_runnable state was true. >>>>>>>>> >>>>>>>>> The root cause of the bug appears to be the fact that the vPMU >> subsystem >>>>>>>>> doesn't save its state on context_switch. >>>> >>>> For the further reply below - is this actually true? What is the >>>> vpmu_switch_from() (resolving to vpmu_save()) doing then early >>>> in context_switch()? >>>> >>>>>>>>> The vpmu_load function will attempt >>>>>>>>> to gather the PMU state if its still loaded two different ways: >>>>>>>>> 1. if the current pcpu is not where the vcpu ran before doing >> a remote save >>>>>>>>> 2. if the current pcpu had another vcpu active before doing a >> local save >>>>>>>>> >>>>>>>>> However, in case the prev vcpu is being rescheduled on another >> pcpu its state >>>>>>>>> has already changed and vcpu_runnable is returning true, thus #2 >> will trip the >>>>>>>>> ASSERT. The only way to avoid this race condition is to make sure >> the >>>>>>>>> prev vcpu is paused while being checked and its context saved. >> Once the prev >>>>>>>>> vcpu is resumed and does #1 it will find its state already saved. >>>>>>>> While I consider this explanation plausible, I'm worried: >>>>>>>> >>>>>>>>> --- a/xen/arch/x86/cpu/vpmu.c >>>>>>>>> +++ b/xen/arch/x86/cpu/vpmu.c >>>>>>>>> @@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t >> from_guest) >>>>>>>>> vpmu = vcpu_vpmu(prev); >>>>>>>>> >>>>>>>>> /* Someone ran here before us */ >>>>>>>>> + vcpu_pause(prev); >>>>>>>>> vpmu_save_force(prev); >>>>>>>>> vpmu_reset(vpmu, VPMU_CONTEXT_LOADED); >>>>>>>>> + vcpu_unpause(prev); >>>>>>>>> >>>>>>>>> vpmu = vcpu_vpmu(v); >>>>>>>>> } >>>>>>>> We're running with IRQs off here, yet vcpu_pause() waits for the >> vcpu >>>>>>>> to actually be de-scheduled. Even with IRQs on this is already a >>>>>>>> relatively heavy operation (also including its impact on the remote >>>>>>>> side). Additionally the function is called from context_switch(), >> and >>>>>>>> I'm unsure of the usability of vcpu_pause() on such a path. In >>>>>>>> particular: Is there a risk of two CPUs doing this mutually to one >>>>>>>> another? If so, is deadlocking excluded? >>>>>>>> >>>>>>>> Hence at the very least I think the description wants extending, to >>>>>>>> discuss the safety of the change. >>>>>>>> >>>>>>>> Boris - any chance you could comment here? Iirc that's code you did >>>>>>>> introduce. >>>>>>> >>>>>>> >>>>>>> Is the assertion in vmx_find_msr() really needs to be for runnable >> vcpu or can it be a check on whether vcpu is actually running (e.g. >> RUNSTATE_running)? >>>>>> >>>>>> You cannot safely check for "running", as "runnable" may transition >>>>>> to/from "running" behind your back. >>>>> >>>>> The more I look at this the more I think the only sensible solution is >>>>> to have the vPMU state be saved on vmexit for all vCPUs. >>>> >>>> Do you really mean vmexit? It would suffice if state was reliably >>>> saved during context-switch-out, wouldn't it? At that point the >>>> vCPU can't be resumed on another pCPU, yet. >>>> >>>>> That way all >>>>> this having to figure out where and when a context needs saving during >>>>> scheduling goes away. Yes, it adds a bit of overhead for cases where >>>>> the vCPU will resume on the same pCPU and that context saved could >>>>> have been skipped, >>>> >>>> If you really mean vmexit, then I'm inclined to say that's more >>>> than just "a bit of overhead". I'd agree if you really meant >>>> context-switch-out, but as said further up it looks to me as if >>>> that was already occurring. Apparently I'm overlooking something >>>> crucial ... >>> >>> Yes, the current setup is doing exactly that, saving the vPMU context >>> on context-switch-out, and that's where the ASSERT failure occurs >>> because the vCPU it needs to save the context for is already runnable >>> on another pCPU. >> >> Well, if that's the scenario (sorry for not understanding it that >> way earlier on), then the assertion is too strict: While in context >> switch, the vCPU may be runnable, but certainly won't actually run >> anywhere. Therefore I'd be inclined to suggest to relax the >> condition just enough to cover this case, without actually going to >> checking for "running". >> > > What ensures the vCPU won't actually run anywhere if it's in the runnable > state? The fact that the vCPU is the subject of context_switch(). > And how do I relax the condition just enough to cover just this case? That's the more difficult question. The immediate solution, passing a boolean or flag to vpmu_switch_from(), may not be practical, as it would likely mean passing this through many layers. Utilizing that I have Jürgen sitting next to me, I've discussed this with him, and he suggested to simply check for v == current. And indeed - set_current() in context_switch() happens a few lines after vpmu_switch_from(). However, going back to vmx_find_msr() I find that the v == current case is already included there. Which makes me wonder again - what exactly is the scenario that you're observing the assertion triggering? Would you mind spelling out the call chain, perhaps by way of the call stack from the assertion? Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.