|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v4 10/21] AMD/IOMMU: allow use of superpage mappings
On Mon, Apr 25, 2022 at 10:38:06AM +0200, Jan Beulich wrote:
> No separate feature flags exist which would control availability of
> these; the only restriction is HATS (establishing the maximum number of
> page table levels in general), and even that has a lower bound of 4.
> Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
> non-default page sizes the implementation in principle permits arbitrary
> size mappings, but these require multiple identical leaf PTEs to be
> written, which isn't all that different from having to write multiple
> consecutive PTEs with increasing frame numbers. IMO that's therefore
> beneficial only on hardware where suitable TLBs exist; I'm unaware of
> such hardware.)
>
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
Reviewed-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> ---
> I'm not fully sure about allowing 512G mappings: The scheduling-for-
> freeing of intermediate page tables would take quite a while when
> replacing a tree of 4k mappings by a single 512G one. Yet then again
> there's no present code path via which 512G chunks of memory could be
> allocated (and hence mapped) anyway, so this would only benefit huge
> systems where 512 1G mappings could be re-coalesced (once suitable code
> is in place) into a single L4 entry. And re-coalescing wouldn't result
> in scheduling-for-freeing of full trees of lower level pagetables.
I would think part of this should go into the commit message, as to
why enabling 512G superpages is fine.
> ---
> v4: Change type of queue_free_pt()'s 1st parameter. Re-base.
> v3: Rename queue_free_pt()'s last parameter. Replace "level > 1" checks
> where possible.
>
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -32,12 +32,13 @@ static unsigned int pfn_to_pde_idx(unsig
> }
>
> static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
> - unsigned long dfn)
> + unsigned long dfn,
> + unsigned int level)
> {
> union amd_iommu_pte *table, *pte, old;
>
> table = map_domain_page(_mfn(l1_mfn));
> - pte = &table[pfn_to_pde_idx(dfn, 1)];
> + pte = &table[pfn_to_pde_idx(dfn, level)];
> old = *pte;
>
> write_atomic(&pte->raw, 0);
> @@ -351,11 +352,32 @@ static int iommu_pde_from_dfn(struct dom
> return 0;
> }
>
> +static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int
> level)
> +{
> + if ( level > 1 )
> + {
> + union amd_iommu_pte *pt = map_domain_page(mfn);
> + unsigned int i;
> +
> + for ( i = 0; i < PTE_PER_TABLE_SIZE; ++i )
> + if ( pt[i].pr && pt[i].next_level )
> + {
> + ASSERT(pt[i].next_level < level);
> + queue_free_pt(hd, _mfn(pt[i].mfn), pt[i].next_level);
> + }
> +
> + unmap_domain_page(pt);
> + }
> +
> + iommu_queue_free_pgtable(hd, mfn_to_page(mfn));
> +}
> +
> int cf_check amd_iommu_map_page(
> struct domain *d, dfn_t dfn, mfn_t mfn, unsigned int flags,
> unsigned int *flush_flags)
> {
> struct domain_iommu *hd = dom_iommu(d);
> + unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
> int rc;
> unsigned long pt_mfn = 0;
> union amd_iommu_pte old;
> @@ -384,7 +406,7 @@ int cf_check amd_iommu_map_page(
> return rc;
> }
>
I think it might be helpful to assert or otherwise check that the
input order is supported by the IOMMU, just to be on the safe side.
> - if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
> + if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags,
> true) ||
> !pt_mfn )
> {
> spin_unlock(&hd->arch.mapping_lock);
> @@ -394,8 +416,8 @@ int cf_check amd_iommu_map_page(
> return -EFAULT;
> }
>
> - /* Install 4k mapping */
> - old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
> + /* Install mapping */
> + old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level,
> (flags & IOMMUF_writable),
> (flags & IOMMUF_readable));
>
> @@ -403,8 +425,13 @@ int cf_check amd_iommu_map_page(
>
> *flush_flags |= IOMMU_FLUSHF_added;
> if ( old.pr )
> + {
> *flush_flags |= IOMMU_FLUSHF_modified;
>
> + if ( IOMMUF_order(flags) && old.next_level )
> + queue_free_pt(hd, _mfn(old.mfn), old.next_level);
> + }
> +
> return 0;
> }
>
> @@ -413,6 +440,7 @@ int cf_check amd_iommu_unmap_page(
> {
> unsigned long pt_mfn = 0;
> struct domain_iommu *hd = dom_iommu(d);
> + unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1;
> union amd_iommu_pte old = {};
>
> spin_lock(&hd->arch.mapping_lock);
> @@ -423,7 +451,7 @@ int cf_check amd_iommu_unmap_page(
> return 0;
> }
>
> - if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
> + if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags,
> false) )
> {
> spin_unlock(&hd->arch.mapping_lock);
> AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",
> @@ -435,14 +463,19 @@ int cf_check amd_iommu_unmap_page(
> if ( pt_mfn )
> {
> /* Mark PTE as 'page not present'. */
> - old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
> + old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
> }
>
> spin_unlock(&hd->arch.mapping_lock);
>
> if ( old.pr )
> + {
> *flush_flags |= IOMMU_FLUSHF_modified;
>
> + if ( order && old.next_level )
> + queue_free_pt(hd, _mfn(old.mfn), old.next_level);
> + }
> +
> return 0;
> }
>
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -747,7 +747,7 @@ static void cf_check amd_dump_page_table
> }
>
> static const struct iommu_ops __initconst_cf_clobber _iommu_ops = {
> - .page_sizes = PAGE_SIZE_4K,
> + .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G |
> PAGE_SIZE_512G,
As mentioned on a previous email, I'm worried if we ever get to
replace an entry populated with 4K pages with a 512G superpage, as the
freeing cost of the associated pagetables would be quite high.
I guess we will have to implement a more preemptive freeing behavior
if issues arise.
Thanks, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |