[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v2 14/17] xen/riscv: implement p2m_next_level()
- To: Jan Beulich <jbeulich@xxxxxxxx>
- From: Oleksii Kurochko <oleksii.kurochko@xxxxxxxxx>
- Date: Wed, 16 Jul 2025 13:32:01 +0200
- Cc: Alistair Francis <alistair.francis@xxxxxxx>, Bob Eshleman <bobbyeshleman@xxxxxxxxx>, Connor Davis <connojdavis@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
- Delivery-date: Wed, 16 Jul 2025 11:32:06 +0000
- List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
On 7/2/25 10:35 AM, Jan Beulich wrote:
On 10.06.2025 15:05, Oleksii Kurochko wrote:
--- a/xen/arch/riscv/p2m.c
+++ b/xen/arch/riscv/p2m.c
@@ -387,6 +387,17 @@ static inline bool p2me_is_valid(struct p2m_domain *p2m, pte_t pte)
return p2m_type_radix_get(p2m, pte) != p2m_invalid;
}
+/*
+ * pte_is_* helpers are checking the valid bit set in the
+ * PTE but we have to check p2m_type instead (look at the comment above
+ * p2me_is_valid())
+ * Provide our own overlay to check the valid bit.
+ */
+static inline bool p2me_is_mapping(struct p2m_domain *p2m, pte_t pte)
+{
+ return p2me_is_valid(p2m, pte) && (pte.pte & PTE_ACCESS_MASK);
+}
Same question as on the earlier patch - does P2M type apply to intermediate
page tables at all? (Conceptually it shouldn't.)
It doesn't matter whether it is an intermediate page table or a leaf PTE pointing
to a page — PTE should be valid. Considering that in the current implementation
it’s possible for PTE.v = 0 but P2M.v = 1, it is better to check P2M.v instead
of PTE.v.
@@ -492,6 +503,70 @@ static pte_t p2m_entry_from_mfn(struct p2m_domain *p2m, mfn_t mfn, p2m_type_t t,
return e;
}
+/* Generate table entry with correct attributes. */
+static pte_t page_to_p2m_table(struct p2m_domain *p2m, struct page_info *page)
+{
+ /*
+ * Since this function generates a table entry, according to "Encoding
+ * of PTE R/W/X fields," the entry's r, w, and x fields must be set to 0
+ * to point to the next level of the page table.
+ * Therefore, to ensure that an entry is a page table entry,
+ * `p2m_access_n2rwx` is passed to `mfn_to_p2m_entry()` as the access value,
+ * which overrides whatever was passed as `p2m_type_t` and guarantees that
+ * the entry is a page table entry by setting r = w = x = 0.
+ */
+ return p2m_entry_from_mfn(p2m, page_to_mfn(page), p2m_ram_rw, p2m_access_n2rwx);
Similarly P2M access shouldn't apply to intermediate page tables. (Moot
with that, but (ab)using p2m_access_n2rwx would also look wrong: You did
read what it means, didn't you?)
p2m_access_n2rwx was chosen not really because of the description mentioned near
its declaration, but because it sets r=w=x=0, which RISC-V expects for a PTE that
points to the next-level page table.
Generally, I agree that P2M access shouldn't be applied to intermediate page tables.
What I can suggest in this case is to use p2m_access_rwx instead of p2m_access_n2rwx,
which will ensure that the P2M access type isn't applied when p2m_entry_from_mfn()
is called, and then, after calling p2m_entry_from_mfn(), simply set PTE.r,w,x=0.
So this function will look like:
/* Generate table entry with correct attributes. */
static pte_t page_to_p2m_table(struct p2m_domain *p2m, struct page_info *page)
{
/*
* p2m_ram_rw is chosen for a table entry as p2m table should be valid
* from both P2M and hardware point of view.
*
* p2m_access_rwx is chosen to restrict access permissions, what mean
* do not apply access permission for a table entry
*/
pte_t pte = p2m_pte_from_mfn(p2m, page_to_mfn(page), _gfn(0), p2m_ram_rw,
p2m_access_rwx);
/*
* Since this function generates a table entry, according to "Encoding
* of PTE R/W/X fields," the entry's r, w, and x fields must be set to 0
* to point to the next level of the page table.
*/
pte.pte &= ~PTE_ACCESS_MASK;
return pte;
}
Does this make sense? Or would it be better to keep the current version of
page_to_p2m_table() and just improve the comment explaining why p2m_access_n2rwx
is used for a table entry?
+}
+
+static struct page_info *p2m_alloc_page(struct domain *d)
+{
+ struct page_info *pg;
+
+ /*
+ * For hardware domain, there should be no limit in the number of pages that
+ * can be allocated, so that the kernel may take advantage of the extended
+ * regions. Hence, allocate p2m pages for hardware domains from heap.
+ */
+ if ( is_hardware_domain(d) )
+ {
+ pg = alloc_domheap_page(d, MEMF_no_owner);
+ if ( pg == NULL )
+ printk(XENLOG_G_ERR "Failed to allocate P2M pages for hwdom.\n");
+ }
The comment looks to have been taken verbatim from Arm. Whatever "extended
regions" are, does the same concept even exist on RISC-V?
Initially, I missed that it’s used only for Arm. Since it was mentioned in
doc/misc/xen-command-line.pandoc, I assumed it applied to all architectures.
But now I see that it’s Arm-specific:: ### ext_regions (Arm)
Also, special casing Dom0 like this has benefits, but also comes with a
pitfall: If the system's out of memory, allocations will fail. A pre-
populated pool would avoid that (until exhausted, of course). If special-
casing of Dom0 is needed, I wonder whether ...
+ else
+ {
+ spin_lock(&d->arch.paging.lock);
+ pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
+ spin_unlock(&d->arch.paging.lock);
+ }
... going this path but with a Dom0-only fallback to general allocation
wouldn't be the better route.
IIUC, then it should be something like:
static struct page_info *p2m_alloc_page(struct domain *d)
{
struct page_info *pg;
spin_lock(&d->arch.paging.lock);
pg = page_list_remove_head(&d->arch.paging.p2m_freelist);
spin_unlock(&d->arch.paging.lock);
if ( !pg && is_hardware_domain(d) )
{
/* Need to allocate more memory from domheap */
pg = alloc_domheap_page(d, MEMF_no_owner);
if ( pg == NULL )
{
printk(XENLOG_ERR "Failed to allocate pages.\n");
return pg;
}
ACCESS_ONCE(d->arch.paging.total_pages)++;
page_list_add_tail(pg, &d->arch.paging.freelist);
}
return pg;
}
And basically use d->arch.paging.freelist for both dom0less and dom0 domains,
with the only difference being that in the case of Dom0, d->arch.paging.freelist
could be extended.
Do I understand your idea correctly?
(
Probably, this is the reply you’re referring to:
https://lore.kernel.org/xen-devel/43e89225-5e69-49a6-a8c8-bda6d120d8ff@xxxxxxxx/,
at the moment, I can't find a better one.
)
+ return pg;
+}
+
+/* Allocate a new page table page and hook it in via the given entry. */
+static int p2m_create_table(struct p2m_domain *p2m, pte_t *entry)
+{
+ struct page_info *page;
+ pte_t *p;
+
+ ASSERT(!p2me_is_valid(p2m, *entry));
+
+ page = p2m_alloc_page(p2m->domain);
+ if ( page == NULL )
+ return -ENOMEM;
+
+ page_list_add(page, &p2m->pages);
+
+ p = __map_domain_page(page);
+ clear_page(p);
+
+ unmap_domain_page(p);
clear_domain_page()? Or actually clear_and_clean_page()?
Agree, clear_and_clean_page() would be better here.
@@ -516,9 +591,33 @@ static int p2m_next_level(struct p2m_domain *p2m, bool alloc_tbl,
unsigned int level, pte_t **table,
unsigned int offset)
{
- panic("%s: hasn't been implemented yet\n", __func__);
+ pte_t *entry;
+ int ret;
+ mfn_t mfn;
+
+ entry = *table + offset;
+
+ if ( !p2me_is_valid(p2m, *entry) )
+ {
+ if ( !alloc_tbl )
+ return GUEST_TABLE_MAP_NONE;
+
+ ret = p2m_create_table(p2m, entry);
+ if ( ret )
+ return GUEST_TABLE_MAP_NOMEM;
+ }
+
+ /* The function p2m_next_level() is never called at the last level */
+ ASSERT(level != 0);
Logically you would perhaps better do this ahead of trying to allocate a
page table. Calls here with level == 0 are invalid in all cases aiui, not
just when you make it here.
It makes sense. I will move ASSERT() to the start of the function
p2m_next_level().
+ if ( p2me_is_mapping(p2m, *entry) )
+ return GUEST_TABLE_SUPER_PAGE;
+
+ mfn = mfn_from_pte(*entry);
+
+ unmap_domain_page(*table);
+ *table = map_domain_page(mfn);
Just to mention it (may not need taking care of right away), there's an
inefficiency here: In p2m_create_table() you map the page to clear it.
Then you tear down that mapping, just to re-establish it here.
I will add:
/*
* TODO: There's an inefficiency here:
* In p2m_create_table(), the page is mapped to clear it.
* Then that mapping is torn down in p2m_create_table(),
* only to be re-established here.
*/
*table = map_domain_page(mfn);
Thanks.
~ Oleksii
|