[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 6/7] xen/riscv: page table handling



On Tue, 2024-08-27 at 17:00 +0200, Jan Beulich wrote:
> On 21.08.2024 18:06, Oleksii Kurochko wrote:
> > Implement map_pages_to_xen() which requires several
> > functions to manage page tables and entries:
> > - pt_update()
> > - pt_mapping_level()
> > - pt_update_entry()
> > - pt_next_level()
> > - pt_check_entry()
> > 
> > To support these operations, add functions for creating,
> > mapping, and unmapping Xen tables:
> > - create_table()
> > - map_table()
> > - unmap_table()
> > 
> > Introduce internal macros starting with PTE_* for convenience.
> > These macros closely resemble PTE bits, with the exception of
> > PTE_SMALL, which indicates that 4KB is needed.
> 
> What macros are you talking about here? Is this partially stale, as
> only PTE_SMALL and PTE_POPULATE (and a couple of masks) are being
> added?
I am speaking about macros connected to masks:
     #define PTE_R_MASK(x)   ((x) & PTE_READABLE)
     #define PTE_W_MASK(x)   ((x) & PTE_WRITABLE)
     #define PTE_X_MASK(x)   ((x) & PTE_EXECUTABLE)
   
     #define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE |
   PTE_EXECUTABLE))

> 
> > --- a/xen/arch/riscv/include/asm/flushtlb.h
> > +++ b/xen/arch/riscv/include/asm/flushtlb.h
> > @@ -5,12 +5,24 @@
> >  #include <xen/bug.h>
> >  #include <xen/cpumask.h>
> >  
> > +#include <asm/sbi.h>
> > +
> >  /* Flush TLB of local processor for address va. */
> >  static inline void flush_tlb_one_local(vaddr_t va)
> >  {
> >      asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
> >  }
> >  
> > +/*
> > + * Flush a range of VA's hypervisor mappings from the TLB of all
> > + * processors in the inner-shareable domain.
> > + */
> 
> Isn't inner-sharable an Arm term? Don't you simply mean "all" here?
Yes, this is Arm term. It should used "all" instead. Thanks.

> 
> > @@ -68,6 +111,20 @@ static inline bool pte_is_valid(pte_t p)
> >      return p.pte & PTE_VALID;
> >  }
> >  
> > +inline bool pte_is_table(const pte_t p)
> > +{
> > +    return ((p.pte & (PTE_VALID |
> > +                      PTE_READABLE |
> > +                      PTE_WRITABLE |
> > +                      PTE_EXECUTABLE)) == PTE_VALID);
> > +}
> 
> In how far is the READABLE check valid here? You (imo correctly) ...
> 
> > +static inline bool pte_is_mapping(const pte_t p)
> > +{
> > +    return (p.pte & PTE_VALID) &&
> > +           (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE));
> > +}
> 
> ... don't consider this bit here.
pte_is_mapping() seems to me is correct as according to the RISC-V
privileged spec:
   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to step 
   5. Otherwise, this PTE is a pointer to the next level of the page   
   table.
   5. A leaf PTE has been found. ...

and regarding pte_is_table() READABLE check is valid as we have to
check only that pte.r = pte.x = 0. WRITABLE check should be dropped. Or
just use define pte_is_table() as:
   inline bool pte_is_table(const pte_t p)
   {
        return !pte_is_mapping(p);
   }


> 
> > --- a/xen/arch/riscv/include/asm/riscv_encoding.h
> > +++ b/xen/arch/riscv/include/asm/riscv_encoding.h
> > @@ -164,6 +164,7 @@
> >  #define SSTATUS_SD                 SSTATUS64_SD
> >  #define SATP_MODE                  SATP64_MODE
> >  #define SATP_MODE_SHIFT                    SATP64_MODE_SHIFT
> > +#define SATP_PPN_MASK                      _UL(0x00000FFFFFFFFFFF)
> >  
> >  #define HGATP_PPN                  HGATP64_PPN
> >  #define HGATP_VMID_SHIFT           HGATP64_VMID_SHIFT
> 
> This looks odd, padding-wise, but that's because hard tabs are being
> used here. Is that intentional?
I use tabs here because riscv_encoding.h was copied from Linux kernel
which uses hard tabs and definitions above use 3 tabs so I used 3 hard
tabs too.

> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/pt.c
> > @@ -0,0 +1,420 @@
> > +#include <xen/bug.h>
> > +#include <xen/domain_page.h>
> > +#include <xen/errno.h>
> > +#include <xen/mm.h>
> > +#include <xen/mm-frame.h>
> > +#include <xen/pmap.h>
> > +#include <xen/spinlock.h>
> > +
> > +#include <asm/flushtlb.h>
> > +#include <asm/page.h>
> > +
> > +static inline const mfn_t get_root_page(void)
> > +{
> > +    paddr_t root_maddr = (csr_read(CSR_SATP) & SATP_PPN_MASK) <<
> > PAGE_SHIFT;
> > +
> > +    return maddr_to_mfn(root_maddr);
> > +}
> > +
> > +/* Sanity check of the entry. */
> > +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int
> > flags)
> > +{
> > +    /*
> > +     * See the comment about the possible combination of (mfn,
> > flags) in
> > +     * the comment above pt_update().
> > +     */
> > +
> > +    /* Sanity check when modifying an entry. */
> > +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
> > +    {
> > +        /* We don't allow modifying an invalid entry. */
> > +        if ( !pte_is_valid(entry) )
> > +        {
> > +            printk("Modifying invalid entry is not allowed.\n");
> 
> Perhaps all of these printk()s should be dprintk()?
It could be dprintk() but at the same time I don't see any issue if it
will be printed once.

>  And not have a full
> stop?
By "full stop," do you mean something like panic() or BUG_ON()? The
error is propagated up to the caller, which then calls panic().
Anexample of this is:
       if ( (offset + size) > MB(2) )
       {
           rc = map_pages_to_xen(BOOT_FDT_VIRT_START + MB(2),
                                 maddr_to_mfn(base_paddr + MB(2)),
                                 MB(2) >> PAGE_SHIFT,
                                 PAGE_HYPERVISOR_RO);
           if ( rc )
               panic("Unable to map the device-tree\n");
       }
If it would be better for some reason to call panic() or BUG_ON() as
soon as pt_check_entry() returns false, I can do it that way as well.

> 
> > +            return false;
> > +        }
> > +
> > +        /* We don't allow modifying a table entry */
> > +        if ( pte_is_table(entry) )
> > +        {
> > +            printk("Modifying a table entry is not allowed.\n");
> > +            return false;
> > +        }
> > +    }
> > +    /* Sanity check when inserting a mapping */
> > +    else if ( flags & PTE_VALID )
> > +    {
> > +        /* We should be here with a valid MFN. */
> > +        ASSERT(!mfn_eq(mfn, INVALID_MFN));
> 
> This is odd to have here, considering the if() further up.
Agree, ASSERT() could be drop.

> 
> > +        /*
> > +         * We don't allow replacing any valid entry.
> > +         *
> > +         * Note that the function pt_update() relies on this
> > +         * assumption and will skip the TLB flush (when Svvptc
> > +         * extension will be ratified). The function will need
> > +         * to be updated if the check is relaxed.
> > +         */
> > +        if ( pte_is_valid(entry) )
> > +        {
> > +            if ( pte_is_mapping(entry) )
> > +                printk("Changing MFN for a valid entry is not
> > allowed (%#"PRI_mfn" -> %#"PRI_mfn").\n",
> > +                       mfn_x(mfn_from_pte(entry)), mfn_x(mfn));
> > +            else
> > +                printk("Trying to replace a table with a
> > mapping.\n");
> > +            return false;
> > +        }
> > +    }
> > +    /* Sanity check when removing a mapping. */
> > +    else if ( (flags & (PTE_VALID | PTE_POPULATE)) == 0 )
> 
> The PTE_VALID part of the check is pointless considering the earlier
> if(). I guess you may want to have it for doc purposes ...
Yes, it just helps to read the code and understand "confusing" if's()
above.

> 
> Since further up you're using "else if ( flags & PTE_VALID )" imo
> here you want to use "else if ( !(flags & ...) )".
> 
> > +    {
> > +        /* We should be here with an invalid MFN. */
> > +        ASSERT(mfn_eq(mfn, INVALID_MFN));
> > +
> > +        /* We don't allow removing a table */
> > +        if ( pte_is_table(entry) )
> > +        {
> > +            printk("Removing a table is not allowed.\n");
> > +            return false;
> > +        }
> 
> Is this restriction temporary?
Yes.

> 
> > +    }
> > +    /* Sanity check when populating the page-table. No check so
> > far. */
> > +    else
> > +    {
> > +        ASSERT(flags & PTE_POPULATE);
> 
> This again is redundant with earlier if() conditions.
> 
> > +#define XEN_TABLE_MAP_FAILED 0
> > +#define XEN_TABLE_SUPER_PAGE 1
> > +#define XEN_TABLE_NORMAL 2
> > +
> > +/*
> > + * Take the currently mapped table, find the corresponding entry,
> > + * and map the next table, if available.
> > + *
> > + * The alloc_tbl parameters indicates whether intermediate tables
> > should
> > + * be allocated when not present.
> > + *
> > + * Return values:
> > + *  XEN_TABLE_MAP_FAILED: Either alloc_only was set and the entry
> > + *  was empty, or allocating a new page failed.
> > + *  XEN_TABLE_NORMAL: next level or leaf mapped normally
> > + *  XEN_TABLE_SUPER_PAGE: The next entry points to a superpage.
> > + */
> > +static int pt_next_level(bool alloc_tbl, pte_t **table, unsigned
> > int offset)
> 
> Having the boolean first is unusual, but well - it's your choice.
> 
> > +{
> > +    pte_t *entry;
> > +    int ret;
> > +    mfn_t mfn;
> > +
> > +    entry = *table + offset;
> > +
> > +    if ( !pte_is_valid(*entry) )
> > +    {
> > +        if ( alloc_tbl )
> > +            return XEN_TABLE_MAP_FAILED;
> 
> Is this condition meant to be inverted?
if alloc_tbl = true we shouldn't allocatetable as:
     * The intermediate page table shouldn't be allocated when MFN
isn't
     * valid and we are not populating page table.
...
    */
    bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags &
PTE_POPULATE);

So if mfn = INVALID_MFN and flags.PTE_POPULATE=0 it means that this
table shouldn't be allocated and thereby pt_next_level() should return
XEN_TABLE_MAP_FAILED.

Or to invert if ( alloc_tbl )it will be needed to invert defintion of
alloc_tbl:
 bool alloc_tbl = !mfn_eq(mfn, INVALID_MFN) || (flags & PTE_POPULATE);
> 
> > +        ret = create_table(entry);
> > +        if ( ret )
> > +            return XEN_TABLE_MAP_FAILED;
> 
> You don't really use "ret". Why not omit the local variable, even
> more so that it has too wide scope?
I'll omit that, it is really useless.

> 
> > +/* Update an entry at the level @target. */
> > +static int pt_update_entry(mfn_t root, unsigned long virt,
> > +                           mfn_t mfn, unsigned int target,
> > +                           unsigned int flags)
> > +{
> > +    int rc;
> > +    unsigned int level = HYP_PT_ROOT_LEVEL;
> > +    pte_t *table;
> > +    /*
> > +     * The intermediate page table shouldn't be allocated when MFN
> > isn't
> > +     * valid and we are not populating page table.
> > +     * This means we either modify permissions or remove an entry,
> > or
> > +     * inserting brand new entry.
> > +     *
> > +     * See the comment above pt_update() for an additional
> > explanation about
> > +     * combinations of (mfn, flags).
> > +    */
> > +    bool alloc_tbl = mfn_eq(mfn, INVALID_MFN) && !(flags &
> > PTE_POPULATE);
> 
> Is this meant to be inverted, too (to actually match variable name
> and
> comment)?
Oh, you mentioned that too. I wrote the similar above. I think it would
be better to invert if we want to use alloc_tbl variable name.

> 
> > +            break;
> > +    }
> > +
> > +    if ( level != target )
> > +    {
> > +        printk("%s: Shattering superpage is not supported\n",
> > __func__);
> > +        rc = -EOPNOTSUPP;
> > +        goto out;
> > +    }
> > +
> > +    entry = table + offsets[level];
> > +
> > +    rc = -EINVAL;
> > +    if ( !pt_check_entry(*entry, mfn, flags) )
> > +        goto out;
> > +
> > +    /* We are removing the page */
> > +    if ( !(flags & PTE_VALID) )
> > +        memset(&pte, 0x00, sizeof(pte));
> > +    else
> > +    {
> > +        /* We are inserting a mapping => Create new pte. */
> > +        if ( !mfn_eq(mfn, INVALID_MFN) )
> > +            pte = pte_from_mfn(mfn, PTE_VALID);
> > +        else /* We are updating the permission => Copy the current
> > pte. */
> > +            pte = *entry;
> > +
> > +        /* update permission according to the flags */
> > +        pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY;
> 
> When updating an entry, don't you also need to clear (some of) the
> flags?
I am not sure why some flags should be cleared. Here we are taking only
necessary for pte flags such as R, W, X or other bits in flags are
ignored.

> 
> > +/* Return the level where mapping should be done */
> > +static int pt_mapping_level(unsigned long vfn, mfn_t mfn, unsigned
> > long nr,
> > +                            unsigned int flags)
> > +{
> > +    unsigned int level = 0;
> > +    unsigned long mask;
> > +    unsigned int i;
> > +
> > +    /* Use blocking mapping unless the caller requests 4K mapping
> > */
> > +    if ( unlikely(flags & PTE_SMALL) )
> > +        return level;
> > +
> > +    /*
> > +     * Don't take into account the MFN when removing mapping (i.e
> > +     * MFN_INVALID) to calculate the correct target order.
> > +     *
> > +     * `vfn` and `mfn` must be both superpage aligned.
> > +     * They are or-ed together and then checked against the size
> > of
> > +     * each level.
> > +     *
> > +     * `left` is not included and checked separately to allow
> > +     * superpage mapping even if it is not properly aligned (the
> > +     * user may have asked to map 2MB + 4k).
> 
> What is this about? There's nothing named "left" here.
It refer to "remaining" pages or "leftover" space after trying to align
a mapping to a superpage boundary.
> 
> > +     */
> > +    mask = !mfn_eq(mfn, INVALID_MFN) ? mfn_x(mfn) : 0;
> > +    mask |= vfn;
> > +
> > +    for ( i = HYP_PT_ROOT_LEVEL; i != 0; i-- )
> > +    {
> > +        if ( !(mask & (BIT(XEN_PT_LEVEL_ORDER(i), UL) - 1)) &&
> > +             (nr >= BIT(XEN_PT_LEVEL_ORDER(i), UL)) )
> > +        {
> > +            level = i;
> > +            break;
> > +        }
> > +    }
> > +
> > +    return level;
> > +}
> > +
> > +static DEFINE_SPINLOCK(xen_pt_lock);
> 
> Another largely meaningless xen_ prefix?
Thanks. I'll drop it.

> 
> > +/*
> > + * If `mfn` equals `INVALID_MFN`, it indicates that the following
> > page table
> > + * update operation might be related to either populating the
> > table (
> > + * PTE_POPULATE will be set additionaly), destroying a mapping, or
> > modifying
> > + * an existing mapping.
> 
> And the latter two are distinguished by? PTE_VALID?
inserting -> (PTE_VALID=1 + (mfn=something valid))
destroying-> ( PTE_VALID=0 )

> 
> > + * If `mfn` is valid and flags has PTE_VALID bit set then it means
> > that
> > + * inserting will be done.
> > + */
> 
> What about mfn != INVALID_MFN and PTE_VALID clear?
PTE_VALID=0 will be always considered as destroying and no matter what
is mfn value as in this case the removing is done in the way where mfn
isn't used:
        memset(&pte, 0x00, sizeof(pte));


>  Also note that "`mfn` is
> valid" isn't the same as "mfn != INVALID_MFN". You want to be precise
> here,
> to avoid confusion later on. (I say that knowing that we're still
> fighting
> especially shadow paging code on x86 not having those properly
> separated.)
If it is needed to be precise and mfn is valid isn't the same as "mfn
!= INVALID_MFN" only for the case of shadow paging?

> > 
> 
> > +    unsigned long left = nr_mfns;
> > +
> > +    const mfn_t root = get_root_page();
> > +
> > +    /*
> > +     * It is bad idea to have mapping both writeable and
> > +     * executable.
> > +     * When modifying/creating mapping (i.e PTE_VALID is set),
> > +     * prevent any update if this happen.
> > +     */
> > +    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) &&
> > PTE_X_MASK(flags) )
> 
> Seeing them in use, I wonder about the naming of those PTE_?_MASK()
> macros. Along with the lhs, why not simply (flags & PTE_...)?
Hmm, good point. They can be really dropped with simplification of the
mentioned if(...).

Thanks.

~ Oleksii



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.