Xen project Mailing List

Re: [PATCH v3] xen/arm: Convert runstate address during hypcall

To: Stefano Stabellini <sstabellini@xxxxxxxxxx>

From: Bertrand Marquis <Bertrand.Marquis@xxxxxxx>

Date: Fri, 14 Aug 2020 09:12:52 +0000

Accept-language: en-GB, en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=XNVUvGeqL65h8hoc3SNuFA9F1wP7HAY0b/e9pmSGbEI=; b=j4Cw5Qi4/mf/ZcNabsHZKMTsgyUXutstAbnCJWTn4gDKTxSDtEm1MxF9RteDniW7tPeooEBNOUHuJ214/PGE+qs7hzZbjUaCRCMvBGhJHmSxVCR+maYNTIePksPaDP0KCUXfD9R5D9pY6KNMGSCRCK6z2J6mqhmXm/TfJB8J3ufX6OljDFSxQPgjdzzEKaPF0RSKBpFeDPiXISLtzp3mtsrdf1ZPmFkWQe1QLoMks5bxNyedbc0fBRn38pMgwaCVo5WdZA4uBZhBMsr6j4xsrQcX/SV5XSxjserXzkw1VbU/THcnM3Bad4UZdWhAEZAIZPqvAWe/ZY/gJG6eO0K7cQ==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=SYJ33kFu91RHA/IsWk2sOnQuDmcB+EuCXpYhWtz+KddtClwZ0NgwykZo79CunW1Z8lZqbzRcFrgzWuyceYZtjNxybWvI7f93qybnnJ6YOAMOePtMjADnssQBmz+7VQqw7KeklOYlWVBSGyv+k3AIPyw0gaCMBbjL4nf4RJLi7KVwmkUeF3mLdd4O5nXt+2Ptq+GTfep5sKrEIWejR/Xn2cqRGjbN/EyUFpQ3iBoIJLVNa0Y/KP2HIy5P3a47a68uvNkjMhSKjA70hfyDAVWTdJU07lkVDzLEzSf/ztmvZJeLZdzfbJxOf1lrfjMhJ9ZO3FrQSkKEqSQh78fhPKCpIQ==

Authentication-results-original: kernel.org; dkim=none (message not signed) header.d=none;kernel.org; dmarc=none action=none header.from=arm.com;

Cc: Julien Grall <julien@xxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, nd <nd@xxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Ian Jackson <ian.jackson@xxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>

Delivery-date: Fri, 14 Aug 2020 09:13:54 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Nodisclaimer: true

Original-authentication-results: kernel.org; dkim=none (message not signed) header.d=none;kernel.org; dmarc=none action=none header.from=arm.com;

Thread-index: AQHWZluWPRKsKyuZRkCQVDwO60tsvakgmVkAgAEWTwCAAKE8gIAVGJCA

Thread-topic: [PATCH v3] xen/arm: Convert runstate address during hypcall

> On 1 Aug 2020, at 00:03, Stefano Stabellini <sstabellini@xxxxxxxxxx> wrote: > > On Fri, 31 Jul 2020, Bertrand Marquis wrote: >> Sorry missed some points in my previous answer. >> >>> On 30 Jul 2020, at 22:50, Julien Grall <julien@xxxxxxx> wrote: >>> >>> Hi Bertrand, >>> >>> To avoid extra work on your side, I would recommend to wait a bit before >>> sending a new version. It would be good to at least settle the conversation >>> in v2 regarding the approach taken. >>> >>> On 30/07/2020 11:24, Bertrand Marquis wrote: >>>> At the moment on Arm, a Linux guest running with KTPI enabled will >>>> cause the following error when a context switch happens in user mode: >>>> (XEN) p2m.c:1890: d1v0: Failed to walk page-table va 0xffffff837ebe0cd0 >>>> The error is caused by the virtual address for the runstate area >>>> registered by the guest only being accessible when the guest is running >>>> in kernel space when KPTI is enabled. >>>> To solve this issue, this patch is doing the translation from virtual >>>> address to physical address during the hypercall and mapping the >>>> required pages using vmap. This is removing the conversion from virtual >>>> to physical address during the context switch which is solving the >>>> problem with KPTI. >>> >>> To echo what Jan said on the previous version, this is a change in a stable >>> ABI and therefore may break existing guest. FAOD, I agree in principle with >>> the idea. However, we want to explain why breaking the ABI is the *only* >>> viable solution. >>> >>> From my understanding, it is not possible to fix without an ABI breakage >>> because the hypervisor doesn't know when the guest will switch back from >>> userspace to kernel space. The risk is the information provided by the >>> runstate wouldn't contain accurate information and could affect how the >>> guest handle stolen time. >>> >>> Additionally there are a few issues with the current interface: >>> 1) It is assuming the virtual address cannot be re-used by the userspace. >>> Thanksfully Linux have a split address space. But this may change with KPTI >>> in place. >>> 2) When update the page-tables, the guest has to go through an invalid >>> mapping. So the translation may fail at any point. >>> >>> IOW, the existing interface can lead to random memory corruption and >>> inacurracy of the stolen time. >>> >>>> This is done only on arm architecture, the behaviour on x86 is not >>>> modified by this patch and the address conversion is done as before >>>> during each context switch. >>>> This is introducing several limitations in comparison to the previous >>>> behaviour (on arm only): >>>> - if the guest is remapping the area at a different physical address Xen >>>> will continue to update the area at the previous physical address. As >>>> the area is in kernel space and usually defined as a global variable this >>>> is something which is believed not to happen. If this is required by a >>>> guest, it will have to call the hypercall with the new area (even if it >>>> is at the same virtual address). >>>> - the area needs to be mapped during the hypercall. For the same reasons >>>> as for the previous case, even if the area is registered for a different >>>> vcpu. It is believed that registering an area using a virtual address >>>> unmapped is not something done. >>> >>> This is not clear whether the virtual address refer to the current vCPU or >>> the vCPU you register the runstate for. From the past discussion, I think >>> you refer to the former. It would be good to clarify. >>> >>> Additionally, all the new restrictions should be documented in the public >>> interface. So an OS developper can find the differences between the >>> architectures. >>> >>> To answer Jan's concern, we certainly don't know all the guest OSes >>> existing, however we also need to balance the benefit for a large majority >>> of the users. >>> >>> From previous discussion, the current approach was deemed to be acceptable >>> on Arm and, AFAICT, also x86 (see [1]). >>> >>> TBH, I would rather see the approach to be common. For that, we would an >>> agreement from Andrew and Jan in the approach here. Meanwhile, I think this >>> is the best approach to address the concern from Arm users. >>> >>>> inline functions in headers could not be used as the architecture >>>> domain.h is included before the global domain.h making it impossible >>>> to use the struct vcpu inside the architecture header. >>>> This should not have any performance impact as the hypercall is only >>>> called once per vcpu usually. >>>> Signed-off-by: Bertrand Marquis <bertrand.marquis@xxxxxxx> >>>> --- >>>> Changes in v2 >>>> - use vmap to map the pages during the hypercall. >>>> - reintroduce initial copy during hypercall. >>>> Changes in v3 >>>> - Fix Coding style >>>> - Fix vaddr printing on arm32 >>>> - use write_atomic to modify state_entry_time update bit (only >>>> in guest structure as the bit is not used inside Xen copy) >>>> --- >>>> xen/arch/arm/domain.c | 161 ++++++++++++++++++++++++++++++----- >>>> xen/arch/x86/domain.c | 29 ++++++- >>>> xen/arch/x86/x86_64/domain.c | 4 +- >>>> xen/common/domain.c | 19 ++--- >>>> xen/include/asm-arm/domain.h | 9 ++ >>>> xen/include/asm-x86/domain.h | 16 ++++ >>>> xen/include/xen/domain.h | 5 ++ >>>> xen/include/xen/sched.h | 16 +--- >>>> 8 files changed, 206 insertions(+), 53 deletions(-) >>>> diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c >>>> index 31169326b2..8b36946017 100644 >>>> --- a/xen/arch/arm/domain.c >>>> +++ b/xen/arch/arm/domain.c >>>> @@ -19,6 +19,7 @@ >>>> #include <xen/sched.h> >>>> #include <xen/softirq.h> >>>> #include <xen/wait.h> >>>> +#include <xen/vmap.h> >>>> #include <asm/alternative.h> >>>> #include <asm/cpuerrata.h> >>>> @@ -275,36 +276,156 @@ static void ctxt_switch_to(struct vcpu *n) >>>> virt_timer_restore(n); >>>> } >>>> -/* Update per-VCPU guest runstate shared memory area (if registered). */ >>>> -static void update_runstate_area(struct vcpu *v) >>>> +static void cleanup_runstate_vcpu_locked(struct vcpu *v) >>>> { >>>> - void __user *guest_handle = NULL; >>>> + if ( v->arch.runstate_guest ) >>>> + { >>>> + vunmap((void *)((unsigned long)v->arch.runstate_guest & >>>> PAGE_MASK)); >>>> + >>>> + put_page(v->arch.runstate_guest_page[0]); >>>> + >>>> + if ( v->arch.runstate_guest_page[1] ) >>>> + put_page(v->arch.runstate_guest_page[1]); >>>> + >>>> + v->arch.runstate_guest = NULL; >>>> + } >>>> +} >>>> + >>>> +void arch_vcpu_cleanup_runstate(struct vcpu *v) >>>> +{ >>>> + spin_lock(&v->arch.runstate_guest_lock); >>>> + >>>> + cleanup_runstate_vcpu_locked(v); >>>> + >>>> + spin_unlock(&v->arch.runstate_guest_lock); >>>> +} >>>> + >>>> +static int setup_runstate_vcpu_locked(struct vcpu *v, vaddr_t vaddr) >>>> +{ >>>> + unsigned int offset; >>>> + mfn_t mfn[2]; >>>> + struct page_info *page; >>>> + unsigned int numpages; >>>> struct vcpu_runstate_info runstate; >>>> + void *p; >>>> - if ( guest_handle_is_null(runstate_guest(v)) ) >>>> - return; >>>> + /* user can pass a NULL address to unregister a previous area */ >>>> + if ( vaddr == 0 ) >>>> + return 0; >>>> + >>>> + offset = vaddr & ~PAGE_MASK; >>>> + >>>> + /* provided address must be aligned to a 64bit */ >>>> + if ( offset % alignof(struct vcpu_runstate_info) ) >>> >>> This new restriction wants to be explained in the commit message and public >>> header. >> >> ok >> >>> >>>> + { >>>> + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at >>>> 0x%"PRIvaddr >>>> + ": Invalid offset\n", vaddr); >>> >>> We usually enforce 80 character per lines except for format string. So it >>> is easier to grep them in the code. >> >> Ok i will fix this one and the following ones. >> But here PRIvaddr would break any attempt to grep something in fact. >> >>> >>>> + return -EINVAL; >>>> + } >>>> + >>>> + page = get_page_from_gva(v, vaddr, GV2M_WRITE); >>>> + if ( !page ) >>>> + { >>>> + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at >>>> 0x%"PRIvaddr >>>> + ": Page is not mapped\n", vaddr); >>>> + return -EINVAL; >>>> + } >>>> + >>>> + mfn[0] = page_to_mfn(page); >>>> + v->arch.runstate_guest_page[0] = page; >>>> + >>>> + if ( offset > (PAGE_SIZE - sizeof(struct vcpu_runstate_info)) ) >>>> + { >>>> + /* guest area is crossing pages */ >>>> + page = get_page_from_gva(v, vaddr + PAGE_SIZE, GV2M_WRITE); >>>> + if ( !page ) >>>> + { >>>> + put_page(v->arch.runstate_guest_page[0]); >>>> + gprintk(XENLOG_WARNING, >>>> + "Cannot map runstate pointer at 0x%"PRIvaddr >>>> + ": 2nd Page is not mapped\n", vaddr); >>>> + return -EINVAL; >>>> + } >>>> + mfn[1] = page_to_mfn(page); >>>> + v->arch.runstate_guest_page[1] = page; >>>> + numpages = 2; >>>> + } >>>> + else >>>> + { >>>> + v->arch.runstate_guest_page[1] = NULL; >>>> + numpages = 1; >>>> + } >>>> - memcpy(&runstate, &v->runstate, sizeof(runstate)); >>>> + p = vmap(mfn, numpages); >>>> + if ( !p ) >>>> + { >>>> + put_page(v->arch.runstate_guest_page[0]); >>>> + if ( numpages == 2 ) >>>> + put_page(v->arch.runstate_guest_page[1]); >>>> - if ( VM_ASSIST(v->domain, runstate_update_flag) ) >>>> + gprintk(XENLOG_WARNING, "Cannot map runstate pointer at >>>> 0x%"PRIvaddr >>>> + ": vmap error\n", vaddr); >>>> + return -EINVAL; >>>> + } >>>> + >>>> + v->arch.runstate_guest = p + offset; >>>> + >>>> + if (v == current) >>>> + memcpy(v->arch.runstate_guest, &v->runstate, sizeof(v->runstate)); >>>> + else >>>> { >>>> - guest_handle = &v->runstate_guest.p->state_entry_time + 1; >>>> - guest_handle--; >>>> - runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; >>>> - __raw_copy_to_guest(guest_handle, >>>> - (void *)(&runstate.state_entry_time + 1) - 1, >>>> 1); >>>> - smp_wmb(); >>>> + vcpu_runstate_get(v, &runstate); >>>> + memcpy(v->arch.runstate_guest, &runstate, sizeof(v->runstate)); >>>> } >>>> - __copy_to_guest(runstate_guest(v), &runstate, 1); >>>> + return 0; >>>> +} >>>> + >>>> +int arch_vcpu_setup_runstate(struct vcpu *v, >>>> + struct vcpu_register_runstate_memory_area >>>> area) >>>> +{ >>>> + int rc; >>>> + >>>> + spin_lock(&v->arch.runstate_guest_lock); >>>> + >>>> + /* cleanup if we are recalled */ >>>> + cleanup_runstate_vcpu_locked(v); >>>> + >>>> + rc = setup_runstate_vcpu_locked(v, (vaddr_t)area.addr.v); >>>> + >>>> + spin_unlock(&v->arch.runstate_guest_lock); >>>> - if ( guest_handle ) >>>> + return rc; >>>> +} >>>> + >>>> + >>>> +/* Update per-VCPU guest runstate shared memory area (if registered). */ >>>> +static void update_runstate_area(struct vcpu *v) >>>> +{ >>>> + spin_lock(&v->arch.runstate_guest_lock); >>>> + >>>> + if ( v->arch.runstate_guest ) >>>> { >>>> - runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE; >>>> - smp_wmb(); >>>> - __raw_copy_to_guest(guest_handle, >>>> - (void *)(&runstate.state_entry_time + 1) - 1, >>>> 1); >>>> + if ( VM_ASSIST(v->domain, runstate_update_flag) ) >>>> + { >>>> + v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE; >>>> + write_atomic(&(v->arch.runstate_guest->state_entry_time), >>>> + v->runstate.state_entry_time); >>> >>> NIT: You want to indent v-> at the same level as the argument from the >>> first line. >> >> Ok >> >>> >>> Also, I think you are missing a smp_wmb() here. >> >> The atomic operation itself would not need a barrier. >> I do not see why you think a barrier is needed here. >> For the internal structure ? > > We need to make sure the other-end sees the XEN_RUNSTATE_UPDATE change > before other changes. Otherwise, due to cpu reordering, the writes could > be seen in reverse order. (Technically the reader would have to use a > read-barrier but that's a separate topic.) I will add a barrier before the atomic. Cheers Bertrand

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.