[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [XEN RFC PATCH v5 3/5] xen/public: Introduce PV-IOMMU hypercall interface



On Tue, 4 Feb 2025, Teddy Astie wrote:
> Hello Stefano,
> 
> Le 03/02/2025 à 18:47, Stefano Stabellini a écrit :
> > On Mon, 3 Feb 2025, Teddy Astie wrote:
> >> Hello Jason,
> >>
> >> Le 30/01/2025 à 21:17, Jason Andryuk a écrit :
> >>> Hi Teddy,
> >>>
> >>> Thanks for working on this.  I'm curious about your plans for this:
> >>>
> >>> On 2025-01-21 11:13, Teddy Astie wrote:
> >>>> +/**
> >>>> + * IOMMU_alloc_nested
> >>>> + * Create a nested IOMMU context (needs IOMMUCAP_nested).
> >>>> + *
> >>>> + * This context uses a platform-specific page table from domain
> >>>> address space
> >>>> + * specified in pgtable_gfn and use it for nested translations.
> >>>> + *
> >>>> + * Explicit flushes needs to be submited with IOMMU_flush_nested on
> >>>> + * modification of the nested pagetable to ensure coherency between
> >>>> IOTLB and
> >>>> + * nested page table.
> >>>> + *
> >>>> + * This context can be destroyed using IOMMU_free_context.
> >>>> + * This context cannot be modified using map_pages, unmap_pages.
> >>>> + */
> >>>> +struct pv_iommu_alloc_nested {
> >>>> +    /* OUT: allocated IOMMU context number */
> >>>> +    uint16_t ctx_no;
> >>>> +
> >>>> +    /* IN: guest frame number of the nested page table */
> >>>> +    uint64_aligned_t pgtable_gfn;
> >>>> +
> >>>> +    /* IN: nested mode flags */
> >>>> +    uint64_aligned_t nested_flags;
> >>>> +};
> >>>> +typedef struct pv_iommu_alloc_nested pv_iommu_alloc_nested_t;
> >>>> +DEFINE_XEN_GUEST_HANDLE(pv_iommu_alloc_nested_t);
> >>>
> >>> Is this command intended to be used for GVA -> GPA translation?  Would
> >>> you need some way to associate with another iommu context for GPA -> HPA
> >>> translation?
> >>>
> >>
> >> It's intended to be used for accelerating IOMMU emulation for the guest.
> >> So in this case the other GPA->HPA translation is domain's p2m
> >> page-table (or something similar) such as the translations made from
> >> this nested context is meaningful from guest point of view.
> >>
> >> The idea to use it is to use the "remote_op" sub-command to let the
> >> device model (e.g QEMU) alter the IOMMU behavior for the affected domain
> >> (e.g by reattaching devices, making new IOMMU contexts, ...).
> >>
> >> I think it can also be used for virtio-iommu pagetable.
> >>
> >>> Maybe more broadly, what are your goals for enabling PV-IOMMU?  The
> >>> examples on your blog post cover a domain restrict device access to
> >>> specific portions of the the GPA space.  Are you also interested in GVA
> >>> -> GPA?  Does VFIO required GVA -> GPA?
> >>>
> >>
> >> The current way of enabling and using PV-IOMMU is with the dedicated
> >> Linux IOMMU driver [1] that implements Linux's IOMMU subsystem with this
> >> proposed interface.
> >> This in practice in the PV case replaces the xen-swiotlb with dma-iommu
> >> and do all DMA through the paravirtualized IOMMU (e.g creating DMA
> >> domains, moving devices to it).
> >>
> >> Regarding GVA->GPA, this is what this interface provides, and
> >> restricting device access to memory is one way of using it. This is a
> >> requirement for VFIO (as it is also one for Linux IOMMU), and I managed
> >> to make SPDK and DPDK work in Dom0 using VFIO and these patches [2].
> >
> > Thanks for the explanation, Teddy. It certainly seems that this work is
> > moving in the right direction. However, I have a couple of questions on
> > this point, as I admit that I have not fully understood it.
> >
> > Modern IOMMUs support two stages of translation. Using ARM terminology,
> > these are referred to as stage2 and stage1. The stage2 translation is
> > used by Xen for the domain's GPA->PA (p2m) mapping. The pagetable used
> > for stage2 could potentially be shared with the pagetable used by Xen
> > for the p2m. Stage1 is typically controlled by the guest for its own
> > address translations, mapping guest virtual addresses (GVA) to guest
> > physical addresses (GPA).
> >
> > I can see that this patch series introduces an interface that allows
> > Dom0 to modify its own stage2 mappings.
> >
> > My question is: how do we allow the domain to also set up stage1
> > mappings? I would assume that the interface would take the form of a
> > hypercall that allows a domain to pass a stage1 pagetable pointer, which
> > Xen would then use to configure the IOMMU stage1. However, I do not see
> > such a hypercall in your series. I was under the impression that
> > GVA-to-GPA translation was left as future work, so I am confused by your
> > statement that this patch series already provides it.
> >
> 
> There are 2 interfaces for the guest (and device model) to configure its
> GVA-to-GPA mappings. There are map/unmap subcommands and (depending on
> hardware support) nested iommu contexts where the guest/device model
> provides the GPA of the stage1 pagetable (this is meant to be a way of
> accelerating IOMMU emulation through QEMU).
> 
> Here is the hypercall subcommands for map/unmap where you can map and
> unmap pages to the "paravirtualized IOMMU context" (making the mapped
> region visible to devices of the context attached through reattach_device).
> 
> /**
>   * IOMMU_map_pages
>   * Map pages on a IOMMU context.
>   *
>   * pgsize must be supported by pgsize_mask.
>   * Fails with -EINVAL if mapping on top of another mapping.
>   * Report actually mapped page count in mapped field (regardless of
> failure).
>   */
> struct pv_iommu_map_pages {
>      /* IN: IOMMU context number */
>      uint16_t ctx_no;
> 
>      /* IN: Guest frame number */
>      uint64_aligned_t gfn;
> 
>      /* IN: Device frame number */
>      uint64_aligned_t dfn;
> 
>      /* IN: Map flags */
>      uint32_t map_flags;
> 
>      /* IN: Size of pages to map */
>      uint32_t pgsize;
> 
>      /* IN: Number of pages to map */
>      uint32_t nr_pages;
> 
>      /* OUT: Number of pages actually mapped */
>      uint32_t mapped;
> };
> typedef struct pv_iommu_map_pages pv_iommu_map_pages_t;
> DEFINE_XEN_GUEST_HANDLE(pv_iommu_map_pages_t);
> 
> /**
>   * IOMMU_unmap_pages
>   * Unmap pages on a IOMMU context.
>   *
>   * pgsize must be supported by pgsize_mask.
>   * Report actually unmapped page count in mapped field (regardless of
> failure).
>   * Fails with -ENOENT when attempting to unmap a page without any mapping
>   */
> struct pv_iommu_unmap_pages {
>      /* IN: IOMMU context number */
>      uint16_t ctx_no;
> 
>      /* IN: Device frame number */
>      uint64_aligned_t dfn;
> 
>      /* IN: Size of pages to unmap */
>      uint32_t pgsize;
> 
>      /* IN: Number of pages to unmap */
>      uint32_t nr_pages;
> 
>      /* OUT: Number of pages actually unmapped */
>      uint32_t unmapped;
> };
> typedef struct pv_iommu_unmap_pages pv_iommu_unmap_pages_t;
> DEFINE_XEN_GUEST_HANDLE(pv_iommu_unmap_pages_t);
> 
> If the hardware supports it, there is a alternative (still being
> drafted) interface to allow the guest to directly provide native pagetables.

OK, this is the important one. I am glad you already thought about it.
If possible, I would suggest to have a rough PoC in place just to prove
that the interface would work for the intended usecase.

For interfaces, sometimes it is hard to spot if there are any issues
during review, so having a barebone PoC to validate the concept would be
ideal.

Thanks for your work on this.



> This is exposed through the "_nested" subcommands, there is no
> implementation of this feature in this patch series yet.
> 
> /**
>   * IOMMU_alloc_nested
>   * Create a nested IOMMU context (needs IOMMUCAP_nested).
>   *
>   * This context uses a platform-specific page table from domain address
> space
>   * specified in pgtable_gfn and use it for nested translations.
>   *
>   * Explicit flushes needs to be submited with IOMMU_flush_nested on
>   * modification of the nested pagetable to ensure coherency between
> IOTLB and
>   * nested page table.
>   *
>   * This context can be destroyed using IOMMU_free_context.
>   * This context cannot be modified using map_pages, unmap_pages.
>   */
> struct pv_iommu_alloc_nested {
>      /* OUT: allocated IOMMU context number */
>      uint16_t ctx_no;
> 
>      /* IN: guest frame number of the nested page table */
>      uint64_aligned_t pgtable_gfn;
> 
>      /* IN: nested mode flags */
>      uint64_aligned_t nested_flags;
> };
> typedef struct pv_iommu_alloc_nested pv_iommu_alloc_nested_t;
> DEFINE_XEN_GUEST_HANDLE(pv_iommu_alloc_nested_t);
> 
> /**
>   * IOMMU_flush_nested (needs IOMMUCAP_nested)
>   * Flush the IOTLB for nested translation.
>   */
> struct pv_iommu_flush_nested {
>      /* TODO */
> };
> typedef struct pv_iommu_flush_nested pv_iommu_flush_nested_t;
> DEFINE_XEN_GUEST_HANDLE(pv_iommu_flush_nested_t);
> 
> 
> >
> >
> >
> >> [1] Originally
> >> https://lists.xen.org/archives/html/xen-devel/2024-06/msg01145.html but
> >> this patch got quite old and probably doesn't work anymore with this new
> >> Xen patch series.
> >> I have a updated patch in my xen-pviommu branch
> >> https://gitlab.com/xen-project/people/tsnake41/linux/-/commit/125d67a09fa9f66a32f9175641cfccca22dbbdb6
> >>
> >> [2] You also need to set "vfio_iommu_type1.allow_unsafe_interrupts=1" to
> >> make VFIO work for now.
> 
> Thanks
> Teddy
> 
> 
> 
> Teddy Astie | Vates XCP-ng Developer
> 
> XCP-ng & Xen Orchestra - Vates solutions
> 
> web: https://vates.tech
> 

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.