Xen project Mailing List

Re: [PATCH v2 1/2] x86/mm: add API for marking only part of a MMIO page read only

To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>

Date: Tue, 30 May 2023 13:56:34 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=vU40Dezc96F761Mz9OWPNYoAWgUp4gZJNgrxzDcONpE=; b=XSEYLqEYmDrSjcJgdQGJJTA37Z3tziIB4BHJtM/z6B7BN1/qzb6w7R9KJJV+yFS0zvRvPmh9gi/iWDeEWFRa3ofSXs6RNMM/RujHzRu116NjqClzLDbxxkNT+xNzg+bFCuXUyWjriBhAWCSZ7tfuLD0LJeIRVT3i/3OyH24kAJxjC57090OFxWgEqojYkCm//QCp+eI2Og2JOQiyRYbnF/GlpvxaISFohvmmxkobfXR/uGy8QlGh9M40WtobDB64i+LklAOxt3LHzFw4Xjene2cNmknzfUPrZ7Rlc1Pr386wwo2Cba4nhP7cs3/Jf9PQZIGFwJ/y9511VPVzLU0+tg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=JbYWkEB4C+fxvBElOK4WezzAvrsKWJVuZAXuJCvxIkl7CZ2XxHd4FISi6Ae24hoKojg/rmD5PQL23Hk46XX4wQsAFRRXWGHTGyakQsRDfHmLvxt+spDnlAPZrkpoLn1cb0j9AiT9J2YfuCGC9yw2zMOtApjUD+gz5EXPdhVoG9tXJqQGWd8Z0bcUfYsGIYV67gmgWY0YyDjIXxxXpffpuvazJMIJrWKsD9AWmlnf3S4gmyvzKe09+BFsCy672uAgqVtozxSmr4ggZjicTKOOAwc8KOIn6TrUFKvYtxDQxm3kIbDUl58WzUIbcjrrbsytJ5REQxfvkOC+C10CuCRgIQ==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: Paul Durrant <paul@xxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Tue, 30 May 2023 11:56:53 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 05.05.2023 23:25, Marek Marczykowski-Górecki wrote: > In some cases, only few registers on a page needs to be write-protected. > Examples include USB3 console (64 bytes worth of registers) or MSI-X's > PBA table (which doesn't need to span the whole table either), although > in the latter case the spec forbids placing other registers on the same > page. Current API allows only marking whole pages pages read-only, > which sometimes may cover other registers that guest may need to > write into. > > Currently, when a guest tries to write to an MMIO page on the > mmio_ro_ranges, it's either immediately crashed on EPT violation - if > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was > from userspace (like, /dev/mem), it will try to fixup by updating page > tables (that Xen again will force to read-only) and will hit that #PF > again (looping endlessly). Both behaviors are undesirable if guest could > actually be allowed the write. > > Introduce an API that allows marking part of a page read-only. Since > sub-page permissions are not a thing in page tables (they are in EPT, > but not granular enough), do this via emulation (or simply page fault > handler for PV) that handles writes that are supposed to be allowed. > The new subpage_mmio_ro_add() takes a start physical address and the > region size in bytes. Both start address and the size need to be 8-byte 8-byte (aka qword) here, but ... > aligned, as a practical simplification (allows using smaller bitmask, > and a smaller granularity isn't really necessary right now). > It will internally add relevant pages to mmio_ro_ranges, but if either > start or end address is not page-aligned, it additionally adds that page > to a list for sub-page R/O handling. The list holds a bitmask which > dwords are supposed to be read-only and an address where page is mapped ... dwords here? > for write emulation - this mapping is done only on the first access. A > plain list is used instead of more efficient structure, because there > isn't supposed to be many pages needing this precise r/o control. > > The mechanism this API is plugged in is slightly different for PV and > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV, > it's already called for #PF on read-only MMIO page. For HVM however, EPT > violation on p2m_mmio_direct page results in a direct domain_crash(). > To reach mmio_ro_emulated_write(), change how write violations for > p2m_mmio_direct are handled - specifically, check if they relate to such > partially protected page via subpage_mmio_write_accept() and if so, call > hvm_emulate_one_mmio() for them too. This decodes what guest is trying > write and finally calls mmio_ro_emulated_write(). Note that hitting EPT > write violation for p2m_mmio_direct page can only happen if the page was > on mmio_ro_ranges (see ept_p2m_type_to_flags()), so there is no need for > checking that again. Yet that's then putting us on a certain risk wrt potential errata. You also specifically talk about "guests", i.e. more than just hwdom. Adding another easy access to the emulator (for HVM) comes with a certain risk of future XSAs, too. > --- a/xen/arch/x86/hvm/hvm.c > +++ b/xen/arch/x86/hvm/hvm.c > @@ -1990,6 +1990,14 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned > long gla, > goto out_put_gfn; > } > > + if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present && > + subpage_mmio_write_accept(mfn, gla) && > + (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) ) > + { > + rc = 1; > + goto out_put_gfn; > + } But npfec.write_access set doesn't mean it was a write permission violation, does it? May I ask that this be accompanied by a comment discussing the correctness/safety? > --- a/xen/arch/x86/include/asm/mm.h > +++ b/xen/arch/x86/include/asm/mm.h > @@ -522,9 +522,24 @@ extern struct rangeset *mmio_ro_ranges; > void memguard_guard_stack(void *p); > void memguard_unguard_stack(void *p); > > +/* > + * Add more precise r/o marking for a MMIO page. Bytes range specified here > + * will still be R/O, but the rest of the page (nor marked as R/O via another > + * call) will have writes passed through. > + * The start address and the size must be aligned to SUBPAGE_MMIO_RO_ALIGN. With this alignment constraint, the earlier sentence can be read as controdictory. How about "Byte-granular ranges ..." or "Ranges (using byte granularity) ..."? I admit even that doesn't resolve the issue fully, though. > @@ -4882,6 +4895,243 @@ long arch_memory_op(unsigned long cmd, > XEN_GUEST_HANDLE_PARAM(void) arg) > return 0; > } > > +/* This needs subpage_ro_lock already taken */ > +static int __init subpage_mmio_ro_add_page( > + mfn_t mfn, unsigned int offset_s, unsigned int offset_e) > +{ > + struct subpage_ro_range *entry = NULL, *iter; > + int i; unsigned int please (as almost always for induction variables). > + list_for_each_entry(iter, &subpage_ro_ranges, list) > + { > + if ( mfn_eq(iter->mfn, mfn) ) > + { > + entry = iter; > + break; > + } > + } > + if ( !entry ) > + { > + /* iter==NULL marks it was a newly allocated entry */ Nit: Even in a comment I think it would be nice if style rules were followed, and hence == was surrounded by blanks. > + iter = NULL; > + entry = xzalloc(struct subpage_ro_range); > + if ( !entry ) > + return -ENOMEM; > + entry->mfn = mfn; > + } > + > + for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN ) > + set_bit(i / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords); You're holding a spin lock, so won't __set_bit() suffice here? And then __clear_bit() below? > + if ( !iter ) > + list_add_rcu(&entry->list, &subpage_ro_ranges); > + > + return 0; > +} Since you mark the qwords which are to be protected, how is one to set up safely two discontiguous ranges on the same page? I think I had asked for v1 already why you don't do things the other way around: Initially the entire page is protected, and then writable regions are carved out. I guess I shouldn't further ask about overlapping r/o ranges and their cleaning up. But at least a comment towards the restriction would be nice. Perhaps even check upon registration that no part of the range is already marked r/o. > +static void __init subpage_mmio_ro_free(struct rcu_head *rcu) > +{ > + struct subpage_ro_range *entry = container_of( > + rcu, struct subpage_ro_range, rcu); > + > + ASSERT(bitmap_empty(entry->ro_qwords, PAGE_SIZE / > SUBPAGE_MMIO_RO_ALIGN)); > + > + if ( entry->mapped ) > + iounmap(entry->mapped); > + xfree(entry); > +} > + > +/* This needs subpage_ro_lock already taken */ > +static int __init subpage_mmio_ro_remove_page( > + mfn_t mfn, > + int offset_s, > + int offset_e) > +{ > + struct subpage_ro_range *entry = NULL, *iter; > + int rc, i; > + > + list_for_each_entry_rcu(iter, &subpage_ro_ranges, list) > + { > + if ( mfn_eq(iter->mfn, mfn) ) > + { > + entry = iter; > + break; > + } > + } > + if ( !entry ) > + return -ENOENT; Yet the sole caller doesn't care at all, not even by an assertion. > + for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN ) > + clear_bit(i / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords); > + > + if ( !bitmap_empty(entry->ro_qwords, PAGE_SIZE / SUBPAGE_MMIO_RO_ALIGN) ) > + return 0; > + > + list_del_rcu(&entry->list); > + call_rcu(&entry->rcu, subpage_mmio_ro_free); This being an __init function, what is the RCU-ness intended to guard against? > + return rc; DYM "return 0" here, or maybe even invert the if()'s condition to have just a single return? "rc" was never written afaics, and the compiler not spotting it is likely because the caller doesn't inspect the return value. > +} > + > + Nit: No double blanks lines please. > +int __init subpage_mmio_ro_add( > + paddr_t start, > + size_t size) > +{ > + mfn_t mfn_start = maddr_to_mfn(start); > + paddr_t end = start + size - 1; > + mfn_t mfn_end = maddr_to_mfn(end); > + int offset_end = 0; unsigned int again, afaics. Also this can be declared in the more narrow scope it's used in. > + int rc; > + > + ASSERT(IS_ALIGNED(start, SUBPAGE_MMIO_RO_ALIGN)); > + ASSERT(IS_ALIGNED(size, SUBPAGE_MMIO_RO_ALIGN)); Not meeting the first assertion's condition (thinking of a release build) is kind of okay, as too large a range will be protected. But for the 2nd too small a range would be covered aiui, so this may want dealing with in a release-build-safe way. > + if ( !size ) > + return 0; > + > + rc = rangeset_add_range(mmio_ro_ranges, mfn_x(mfn_start), > mfn_x(mfn_end)); > + if ( rc ) > + return rc; > + > + spin_lock(&subpage_ro_lock); > + > + if ( PAGE_OFFSET(start) || > + (mfn_eq(mfn_start, mfn_end) && PAGE_OFFSET(end) != PAGE_SIZE - 1) ) > + { > + offset_end = mfn_eq(mfn_start, mfn_end) ? > + PAGE_OFFSET(end) : > + (PAGE_SIZE - 1); > + rc = subpage_mmio_ro_add_page(mfn_start, > + PAGE_OFFSET(start), > + offset_end); > + if ( rc ) > + goto err_unlock; > + } > + > + if ( !mfn_eq(mfn_start, mfn_end) && PAGE_OFFSET(end) != PAGE_SIZE - 1 ) > + { > + rc = subpage_mmio_ro_add_page(mfn_end, 0, PAGE_OFFSET(end)); > + if ( rc ) > + goto err_unlock_remove; > + } > + > + spin_unlock(&subpage_ro_lock); > + > + return 0; > + > + err_unlock_remove: > + if ( offset_end ) > + subpage_mmio_ro_remove_page(mfn_start, PAGE_OFFSET(start), > offset_end); > + > + err_unlock: > + spin_unlock(&subpage_ro_lock); > + if ( rangeset_remove_range(mmio_ro_ranges, mfn_x(mfn_start), > mfn_x(mfn_end)) ) > + printk(XENLOG_ERR "Failed to cleanup on failed > subpage_mmio_ro_add()\n"); > + return rc; > +} None of the failures here is particularly likely, so perhaps all is fine as you have it. But there would be an alternative of retaining the mmio_ro_ranges entry/entries, allowing the caller to "ignore" the error. > +static void __iomem *subpage_mmio_get_page(struct subpage_ro_range *entry) > +{ > + if ( entry->mapped ) > + return entry->mapped; > + > + spin_lock(&subpage_ro_lock); > + /* Re-check under the lock */ > + if ( entry->mapped ) > + goto out_unlock; > + > + entry->mapped = ioremap(mfn_x(entry->mfn) << PAGE_SHIFT, PAGE_SIZE); > + > + out_unlock: > + spin_unlock(&subpage_ro_lock); > + return entry->mapped; > +} This is easy to deal with without any "goto". I'm further inclined to request that the ioremap() occur without the lock held, followed by an iounmap() (after dropping the lock) if in fact the mapping wasn't needed (anymore). > +static void subpage_mmio_write_emulate( > + mfn_t mfn, > + unsigned int offset, > + const void *data, > + unsigned int len) > +{ > + struct subpage_ro_range *entry; > + void __iomem *addr; > + > + rcu_read_lock(&subpage_ro_rcu); > + > + list_for_each_entry_rcu(entry, &subpage_ro_ranges, list) > + { > + if ( mfn_eq(entry->mfn, mfn) ) > + { > + if ( test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) ) > + goto write_ignored; I think you can get away with just a single "goto" by putting the gprintk() (and its label) here. > + addr = subpage_mmio_get_page(entry); > + if ( !addr ) > + { > + gprintk(XENLOG_ERR, > + "Failed to map page for MMIO write at > 0x%"PRI_mfn"%03x\n", > + mfn_x(mfn), offset); > + goto out_unlock; > + } > + > + switch ( len ) > + { > + case 1: > + writeb(*(uint8_t*)data, addr); > + break; > + case 2: > + writew(*(uint16_t*)data, addr); > + break; > + case 4: > + writel(*(uint32_t*)data, addr); > + break; > + case 8: > + writeq(*(uint64_t*)data, addr); > + break; Please avoid casting away const-ness. > + default: > + /* mmio_ro_emulated_write() already validated the size */ > + ASSERT_UNREACHABLE(); > + goto write_ignored; > + } > + goto out_unlock; > + } > + } > + /* Do not print message for pages without any writable parts. */ > + goto out_unlock; > + > + write_ignored: > + gprintk(XENLOG_WARNING, > + "ignoring write to R/O MMIO 0x%"PRI_mfn"%03x len %u\n", > + mfn_x(mfn), offset, len); Nit: Indentation. > + out_unlock: > + rcu_read_unlock(&subpage_ro_rcu); > +} > + > +bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla) > +{ > + unsigned int offset = PAGE_OFFSET(gla); > + const struct subpage_ro_range *entry; > + > + rcu_read_lock(&subpage_ro_rcu); > + > + list_for_each_entry_rcu(entry, &subpage_ro_ranges, list) > + if ( mfn_eq(entry->mfn, mfn) && > + !test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords) ) > + { > + /* > + * We don't know the write seize at this point yet, so it could > be Nit: "size" I assume. > + * an unalligned write, but accept it here anyway and deal with > it > + * later. > + */ Have I overlooked where unaligned writes are dealt with? Also nit: "unaligned". Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.