Xen project Mailing List

Re: [PATCH 1/2] x86/mm: add API for marking only part of a MMIO page read only

To: Roger Pau Monné <roger.pau@xxxxxxxxxx>

From: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>

Date: Tue, 28 Mar 2023 16:49:36 +0200

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Paul Durrant <paul@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>

Delivery-date: Tue, 28 Mar 2023 14:49:52 +0000

Feedback-id: i1568416f:Fastmail

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Tue, Mar 28, 2023 at 04:04:47PM +0200, Roger Pau Monné wrote: > On Mon, Mar 27, 2023 at 12:09:15PM +0200, Marek Marczykowski-Górecki wrote: > > In some cases, only few registers on a page needs to be write-protected. > > Examples include USB3 console (64 bytes worth of registers) or MSI-X's > > PBA table (which doesn't need to span the whole table either). > > Current API allows only marking whole pages pages read-only, which > > sometimes may cover other registers that guest may need to write into. > > > > Currently, when a guest tries to write to an MMIO page on the > > mmio_ro_ranges, it's either immediately crashed on EPT violation - if > > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was > > from userspace (like, /dev/mem), it will try to fixup by updating page > > tables (that Xen again will force to read-only) and will hit that #PF > > again (looping endlessly). Both behaviors are undesirable if guest could > > actually be allowed the write. > > > > Introduce an API that allows marking part of a page read-only. Since > > sub-page permissions are not a thing in page tables, do this via > > emulation (or simply page fault handler for PV) that handles writes that > > are supposed to be allowed. Those writes require the page to be mapped > > to Xen, so subpage_mmio_ro_add() function takes fixmap index of the > > page. The page needs to be added to mmio_ro_ranges, first anyway. > > Sub-page ranges are stored using rangeset for each added page, and those > > pages are stored on a plain list (as there isn't supposed to be many > > pages needing this precise r/o control). > > Since mmio_ro_ranges is x86 only, it is possible to mutate > it to track address ranges instead of page frames. The current type > is unsigned long, so that should be fine, and would avoid having to > create a per-page rangeset to just track offsets. I was thinking about it, but rangeset doesn't allow attaching extra data (fixmap index, or mapped address as you propose with ioremap()). Changing all the places where mmio_ro_ranges is used will be a bit tedious, but that isn't really a problem. > > The mechanism this API is plugged in is slightly different for PV and > > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV, > > it's already called for #PF on read-only MMIO page. For HVM however, EPT > > violation on p2m_mmio_direct page results in a direct domain_crash(). > > To reach mmio_ro_emulated_write(), change how write violations for > > p2m_mmio_direct are handled - specifically, treat them similar to > > p2m_ioreq_server. This makes relevant ioreq handler being called, > > that finally end up calling mmio_ro_emulated_write(). > > Both of those paths need an MFN to which guest tried to write (to check > > which part of the page is supposed to be read-only, and where > > the page is mapped for writes). This information currently isn't > > available directly in mmio_ro_emulated_write(), but in both cases it is > > already resolved somewhere higher in the call tree. Pass it down to > > mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field. > > > > Signed-off-by: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx> > > --- > > Shadow mode is not tested, but I don't expect it to work differently than > > HAP in areas related to this patch. > > The used locking should make it safe to use similar to mmio_ro_ranges, > > but frankly the only use (introduced in the next patch) could go without > > locking at all, as subpage_mmio_ro_add() is called only before any > > domain is constructed and subpage_mmio_ro_remove() is never called. > > --- > > xen/arch/x86/hvm/emulate.c | 2 +- > > xen/arch/x86/hvm/hvm.c | 3 +- > > xen/arch/x86/include/asm/mm.h | 22 ++++- > > xen/arch/x86/mm.c | 181 +++++++++++++++++++++++++++++++++- > > xen/arch/x86/pv/ro-page-fault.c | 1 +- > > 5 files changed, 207 insertions(+), 2 deletions(-) > > > > diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c > > index 95364deb1996..311102724dea 100644 > > --- a/xen/arch/x86/hvm/emulate.c > > +++ b/xen/arch/x86/hvm/emulate.c > > @@ -2733,7 +2733,7 @@ int hvm_emulate_one_mmio(unsigned long mfn, unsigned > > long gla) > > .write = mmio_ro_emulated_write, > > .validate = hvmemul_validate, > > }; > > - struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla }; > > + struct mmio_ro_emulate_ctxt mmio_ro_ctxt = { .cr2 = gla, .mfn = > > _mfn(mfn) }; > > struct hvm_emulate_ctxt ctxt; > > const struct x86_emulate_ops *ops; > > unsigned int seg, bdf; > > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c > > index d326fa1c0136..f1c928e3e4ee 100644 > > --- a/xen/arch/x86/hvm/hvm.c > > +++ b/xen/arch/x86/hvm/hvm.c > > @@ -1942,7 +1942,8 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned > > long gla, > > */ > > if ( (p2mt == p2m_mmio_dm) || > > (npfec.write_access && > > - (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server))) ) > > + (p2m_is_discard_write(p2mt) || (p2mt == p2m_ioreq_server) || > > + p2mt == p2m_mmio_direct)) ) > > { > > if ( !handle_mmio_with_translation(gla, gpa >> PAGE_SHIFT, npfec) ) > > hvm_inject_hw_exception(TRAP_gp_fault, 0); > > diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h > > index db29e3e2059f..91937d556bac 100644 > > --- a/xen/arch/x86/include/asm/mm.h > > +++ b/xen/arch/x86/include/asm/mm.h > > @@ -522,9 +522,31 @@ extern struct rangeset *mmio_ro_ranges; > > void memguard_guard_stack(void *p); > > void memguard_unguard_stack(void *p); > > > > +/* > > + * Add more precise r/o marking for a MMIO page. Bytes range specified here > > + * will still be R/O, but the rest of the page (nor marked as R/O via > > another > > + * call) will have writes passed through. The write passthrough requires > > + * providing fixmap entry by the caller. > > + * Since multiple callers can mark different areas of the same page, they > > might > > + * provide different fixmap entries (although that's very unlikely in > > + * practice). Only the one provided by the first caller will be used. > > Return value > > + * indicates whether this fixmap entry will be used, or a different one > > + * provided earlier (in which case the caller might decide to release it). > > Why not use ioremap() to map the page instead of requiring a fixmap > entry? In all the cases this feature is used (for now), I do have a fixmap anyway. So, I don't need to worry if I can call ioremap() at that boot stage (I think it's okay in console_init_postirq(), but that may not be obvious in other places). > > + * > > + * Return values: > > + * - negative: error > > + * - 0: success, fixmap entry is claimed > > + * - 1: success, fixmap entry set earlier will be used > > + */ > > +int subpage_mmio_ro_add(mfn_t mfn, unsigned long offset_s, > > + unsigned long offset_e, int fixmap_idx); > > +int subpage_mmio_ro_remove(mfn_t mfn, unsigned long offset_s, > > + unsigned long offset_e); > > + > > struct mmio_ro_emulate_ctxt { > > unsigned long cr2; > > unsigned int seg, bdf; > > + mfn_t mfn; > > }; > > > > int cf_check mmio_ro_emulated_write( > > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c > > index 0fe14faa5fa7..b50bdee40b6b 100644 > > --- a/xen/arch/x86/mm.c > > +++ b/xen/arch/x86/mm.c > > @@ -165,6 +165,19 @@ bool __read_mostly machine_to_phys_mapping_valid; > > > > struct rangeset *__read_mostly mmio_ro_ranges; > > > > +/* Handling sub-page read-only MMIO regions */ > > +struct subpage_ro_range { > > + struct list_head list; > > + mfn_t mfn; > > + int fixmap_idx; > > + struct rangeset *ro_bytes; > > + struct rcu_head rcu; > > +}; > > + > > +static LIST_HEAD(subpage_ro_ranges); > > +static DEFINE_RCU_READ_LOCK(subpage_ro_rcu); > > +static DEFINE_SPINLOCK(subpage_ro_lock); > > + > > static uint32_t base_disallow_mask; > > /* Global bit is allowed to be set on L1 PTEs. Intended for user mappings. > > */ > > #define L1_DISALLOW_MASK ((base_disallow_mask | _PAGE_GNTTAB) & > > ~_PAGE_GLOBAL) > > @@ -4893,6 +4906,172 @@ long arch_memory_op(unsigned long cmd, > > XEN_GUEST_HANDLE_PARAM(void) arg) > > return 0; > > } > > > > +int subpage_mmio_ro_add( > > + mfn_t mfn, > > + unsigned long offset_s, > > + unsigned long offset_e, > > Since those are page offset, you can likely use unsigned int rather > than long? > > I also wonder why not use [paddr_t start, paddr_t end] (or start and > size) and drop the mfn parameter? You can certainly get the mfn from > the full address, and it seems more natural that having the caller > pass an mfn and two offsets. That would work for the function parameters indeed, regardless of what's really stored. > > + int fixmap_idx) > > +{ > > + struct subpage_ro_range *entry = NULL, *iter; > > + int rc; > > + > > + ASSERT(rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn))); > > + ASSERT(offset_s < PAGE_SIZE); > > + ASSERT(offset_e < PAGE_SIZE); > > + > > + spin_lock(&subpage_ro_lock); > > + > > + list_for_each_entry( iter, &subpage_ro_ranges, list ) > > + { > > + if ( mfn_eq(iter->mfn, mfn) ) > > + { > > + entry = iter; > > + break; > > + } > > + } > > + if ( !entry ) > > + { > > + /* iter==NULL marks it was a newly allocated entry */ > > + iter = NULL; > > + entry = xmalloc(struct subpage_ro_range); > > + rc = -ENOMEM; > > + if ( !entry ) > > + goto err_unlock; > > + entry->mfn = mfn; > > + entry->fixmap_idx = fixmap_idx; > > + entry->ro_bytes = rangeset_new(NULL, "subpage r/o mmio", > > + RANGESETF_prettyprint_hex); > > + rc = -ENOMEM; > > rc will already be -ENOMEM, albeit doing error handling this way is > tricky... > > > + if ( !entry->ro_bytes ) > > + goto err_unlock; > > + } > > + > > + rc = rangeset_add_range(entry->ro_bytes, offset_s, offset_e); > > + if ( rc < 0 ) > > + goto err_unlock; > > + > > + if ( !iter ) > > + list_add_rcu(&entry->list, &subpage_ro_ranges); > > + > > + spin_unlock(&subpage_ro_lock); > > + > > + if ( !iter || entry->fixmap_idx == fixmap_idx ) > > + return 0; > > + else > > + return 1; > > + > > +err_unlock: > > + spin_unlock(&subpage_ro_lock); > > + if ( !iter ) > > + { > > + if ( entry ) > > + { > > + if ( entry->ro_bytes ) > > + rangeset_destroy(entry->ro_bytes); > > + xfree(entry); > > + } > > + } > > + return rc; > > +} > > + > > +static void subpage_mmio_ro_free(struct rcu_head *rcu) > > +{ > > + struct subpage_ro_range *entry = container_of(rcu, struct > > subpage_ro_range, rcu); > > + > > + rangeset_destroy(entry->ro_bytes); > > + xfree(entry); > > +} > > + > > +int subpage_mmio_ro_remove( > > + mfn_t mfn, > > + unsigned long offset_s, > > + unsigned long offset_e) > > +{ > > + struct subpage_ro_range *entry = NULL, *iter; > > + int rc; > > + > > + ASSERT(offset_s < PAGE_SIZE); > > + ASSERT(offset_e < PAGE_SIZE); > > + > > + spin_lock(&subpage_ro_lock); > > + > > + list_for_each_entry_rcu( iter, &subpage_ro_ranges, list ) > > + { > > + if ( mfn_eq(iter->mfn, mfn) ) > > + { > > + entry = iter; > > + break; > > + } > > + } > > + rc = -ENOENT; > > + if ( !entry ) > > + goto out_unlock; > > + > > + rc = rangeset_remove_range(entry->ro_bytes, offset_s, offset_e); > > + if ( rc < 0 ) > > + goto out_unlock; > > + > > + rc = 0; > > You can use `if ( rc ) goto out_unlock;` and that avoids having to > explicitly set rc to 0. > > > + > > + if ( !rangeset_is_empty(entry->ro_bytes) ) > > + goto out_unlock; > > + > > + list_del_rcu(&entry->list); > > + call_rcu(&entry->rcu, subpage_mmio_ro_free); > > + > > +out_unlock: > > + spin_unlock(&subpage_ro_lock); > > + return rc; > > +} > > + > > +static void subpage_mmio_write_emulate( > > + mfn_t mfn, > > + unsigned long offset, > > + void *data, > > + unsigned int len) > > +{ > > + struct subpage_ro_range *entry; > > const. > > > + void __iomem *addr; > > Do we care about the access being aligned? I don't think Xen cares about it when page is mapped R/W to the guest, so why should it care when it's partially R/W only? > > > + > > + rcu_read_lock(&subpage_ro_rcu); > > + > > + list_for_each_entry_rcu( entry, &subpage_ro_ranges, list ) > > + { > > + if ( mfn_eq(entry->mfn, mfn) ) > > + { > > You need to take the spinlock at this point, since the contents of > entry->ro_bytes are not protected by the RCU. The RCU only provides > safe iteration over the list, but not the content of the elements on > the list. mfn is not supposed to change ever on the specific list element, and IIUC, rangeset does locking itself. Am I missing something? -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab

Attachment: signature.asc
Description: PGP signature

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.