[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen PV IOMMU interface draft D



> From: Malcolm Crossley [mailto:malcolm.crossley@xxxxxxxxxx]
> Sent: Wednesday, February 10, 2016 6:09 PM

As Konrad commented, it's better to add this doc as 1st patch in your series
then it's easier to review it with other patches together. Also it's always
good to include such design doc in the repo.

Other comments embedded.

[...]
> 
> Clarification of GFN and BFN fields for different guest types
> -------------------------------------------------------------
> 
[...]
> Bus Frame Numbers (BFN) refer to the address presented on the physical bus
> before being translated by the IOMMU.
> 
> Diagram below details memory accesses originating from physical device.
> 
>     Physical Device
>           |
>         (BFN)
>           |
>          IOMMU-PT
>           |
>         (MFN)
>           |
>          RAM

Curious what IOMMU-'PT' means here?

[...]
> General principles for PV IOMMU interface
> =========================================
> 
> There are two different usage models for the BFN address space of a calling
> guest based upon the two purposes specified in the section above.
> 
> A calling guest may use their BFN address space for only one of the purposes
> detailed above and so the PV IOMMU interface has a subop per usage model.
> Furthermore, the IOMMU mapping of foreign domains memory is more complex than
> IOMMU mapping local domain memory and seperating the subops allows for the
> complexity to be split in the implementation.
> 
> The PV IOMMU design allows the calling domain to control it's BFN memory map.
> Thus the design also assigns the responsiblity of ensuring a BFN address
> mapped for local domain memory mappings are not reused for foreign domain
> memory mappings without an explict unmap of BFN address first. This simplifies
> the usage of the API and the extra overhead for the calling domains should be
> minimal as they should be already tracking the BFN address space usage 
> already.

It might be clearer if you can add a separate section for BFN itself, i.e.
how it is managed/allocated in different scenarios. I know most info is
already provided in this text, but not centralized so far. :-)

> 
> 
> Emulator usage of PV IOMMU interface
> ====================================

I'd suggest moving this and later sections to behind basic API introduction.
Otherwise insufficient background on so many API references at this point.

> 
> Emulators which require bus address mapping of guest RAM must first determine 
> if
> it's possible for the domain to control the bus addresses themselves.
> 
> A IOMMUOP_query_caps subop will return the IOMMU_QUERY_map_cap flag. If this
> flag is set then the emulator may specify the BFN address it wishes guest RAM 
> to
> be mapped to via the IOMMUOP_map_foreign_page subop.  If the flag is not set
> then the emulator must use BFN addresses supplied by the Xen via the
> IOMMUOP_lookup_foreign_page.

IOMMU_QUERY_map_cap is a bit confusing here. Above paragraph is about
whether emulator is allowed to allocate/specify BFN itself. However this
capability name is more read as whether the calling domain can map foreign
pages which is actually true regardless of how BFN is allocated.

> 
> Operating systems which use the IOMMUOP_map_page subop are expected to 
> provide a
> common interface for emulators to use. Otherwise emulators will not be aware
> of existing BFN mappings created by operating system and will get failed
> subops due to conflicts in the BFN address space for the domain.

Do you mean that emulator needs to detect whether OS is using
IOMMUOP_map_page? If yes, then emulator calls a common interface
provided by OS. If not, then emulator just directly invoke raw IOMMUOP 
itself. I'm not certain whether there is common mechanism to detect
this so far. Could you elaborate your thought here?

> 
> Emulators should unmap unused GFN mappings as often as possible using
> IOMMUOP_unmap_foreign_page subops so that guest domains can balloon pages
> quickly and efficiently.

Following earlier analysis then this only applies when OS doesn't use IOMMUOP.
Otherwise emulator needs call a 'OS common interface' right?

> 
> Emulators should conform to the ballooning behaviour described section
> "IOMMUOP_*_foreign_page interactions with guest domain ballooning" so that 
> guest
> domains are able to effectively balloon out and in memory.
> 
> Emulators must unmap any active BFN mappings when they shutdown.
> 
> IOMMUOP_*_foreign_page interactions with guest domain ballooning
> =====================================================
> ===========
> 
> Guest domains can balloon out a set of GFN mappings at any time and render the
> BFN to GFN mapping invalid.
> 
> When a BFN to GFN mapping becomes invalid, Xen will issue a buffered I/O 
> request
> of type IOREQ_TYPE_INVALIDATE to the affected IOREQ servers with the now 
> invalid
> BFN address in the data field. If the buffered I/O request ring is full then a
> standard (synchronous) I/O request of type IOREQ_TYPE_INVALIDATE will be 
> issued
> to the affected IOREQ server the with just invalidated BFN address in the data
> field.
> 
> The BFN mappings cannot be simply unmapped at the point of the balloon 
> hypercall
> otherwise a malicious guest could specifically balloon out an in use GFN 
> address
> in use by an emulator and trigger IOMMU faults for the domains with BFN
> mappings.

Is it a real problem? Today for PCI passthru, what will happen if guest programs
assigned device with a bad GPA which is not mapped in IOMMU? I think IOMMU
fault should be fine, and we can just leverage existing IOMMU fault handling 
after
the fault is triggered.

> 
> For hosts with no IOMMU support: The affected emulator(s) must specifically
> issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN address so 
> that
> the references to the underlying MFN are removed and the MFN can be freed back
> to the Xen memory allocator.
> 
> For hosts with IOMMU support:
> If the BFN was mapped without the IOMMUOP_swap_mfn flag set in the
> IOMMUOP_map_foreign_page then the affected affected emulator(s) must
> specifically issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN
> address so that the references to the underlying MFN are removed.
> 
> If the BFN was mapped with the IOMMUOP_swap_mfn flag set in the
> IOMMUOP_map_foreign_page subop for all emulators with mappings of that GFN 
> then
> the BFN mapping will be swapped to point at a scratch MFN page and all BFN
> references to the invalid MFN will be removed by Xen after the BFN mapping has
> been updated to point at the scratch MFN page.

I don't understand why for 'swap' case you don't need emulator to do 
explicit unmap. You can think 'noswap' (page-A to invalid) as a special 
example of 'swap' (page-A to scratch page), since they both move
away from page-A reference. If there is a reason that emulator needs
to do some cleanup internally before dropping the reference, does 
'swap_mfn' breaks that situation then?

> 
> The rationale for swapping the BFN mapping to point at scratch pages is to
> enable guest domains to balloon quickly without requiring hypercall(s) from
> emulators.
> 
> Not all BFN mappings can be swapped without potentially causing problems for 
> the
> hardware itself (command rings etc.) so the IOMMUOP_swap_mfn flag is used to
> allow per BFN control of Xen ballooning behaviour.

Who will judge whether a BFN mapping can be swapped then?

[...]
> Xen PV IOMMU hypercall interface
> --------------------------------
> A two argument hypercall interface (do_iommu_op).
> 
>     ret_t do_iommu_op(XEN_GUEST_HANDLE_PARAM(void) arg, unsigned int count)
> 
> First argument, guest handle pointer to array of `struct pv_iommu_op`
> 
> Second argument, unsigned integer count of `struct pv_iommu_op` elements in 
> array.
> 
> Definition of `struct pv_iommu_op`:
> 
>     struct pv_iommu_op {
> 
>         uint16_t subop_id;
>         uint16_t flags;
>         int32_t status;
> 
>         union {
>             struct {
>                 uint64_t bfn;
>                 uint64_t gfn;
>             } map_page;
> 
>             struct {
>                 uint64_t bfn;
>             } unmap_page;
> 
>             struct {
>                 uint64_t bfn;
>                 uint64_t gfn;
>                 uint16_t domid;
>                 ioservid_t ioserver;
>             } map_foreign_page;
> 
>             struct {
>                 uint64_t bfn;
>                 uint64_t gfn;
>                 uint16_t domid;
>                 ioservid_t ioserver;
>             } lookup_foreign_page;
> 
>             struct {
>                 uint64_t bfn;
>                 ioservid_t ioserver;
>             } unmap_foreign_page;
>         } u;
>     };

Do we really need such ioserver ID here? Could it be simple
as looping all ioreq servers with INVALIDATE notifications?


[...]
> 
> IOMMUOP_map_page
> ----------------------
> This subop uses `struct map_page` part of the `struct pv_iommu_op`.
> 
> If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
> allowed to map all GFNs except for Xen owned MFNs else the hardware
> domain will only be allowed to map GFNs which it owns.

"map all GFNs" -> "map all MFNs" since you use "except for Xen owned MFNs"
later. Since you have a capability called IOMMU_QUERY_map_all_mfns, should
you add such condition in above description?

> 
> If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
> allowed to map all GFNs without taking a reference to the MFN backing the GFN
> by setting the IOMMU_MAP_OP_no_ref_cnt flag.

could you elaborate when no_ref_cnt is required?

[...]
> 
> IOMMUOP_unmap_page
> ------------------
> This subop uses `struct unmap_page` part of the `struct pv_iommu_op`.
> 
> The subop usage of the `struct pv_iommu_op` and `struct unmap_page` fields
> are detailed below:
> 
> --------------------------------------------------------------------
> Field          Purpose
> -----          -----------------------------------------------------
> `bfn`          [in] Bus address frame number to be unmapped in DOMID_SELF
> 
> `flags`        [in] Flags for signalling page order of unmap operation
> 
> `status`       [out] Mapping status of this unmap operation, 0 indicates 
> success
> --------------------------------------------------------------------
> 
> Defined bits for flags field:
> 
> Name                        Bit                Definition
> ----                       -----      ----------------------------------
> IOMMU_UNMAP_OP_remove_m2b    0        Wildcard M2B mapping removed for
>                                       lookup_foreign_page use

Is it explicitly required? Should it be implicit as long as a valid M2B entry 
existing?


[...]
> IOMMUOP_map_foreign_page
> ------------------------
> This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`.
> 
> It is not valid to use a domid representing the calling domain.

Then what's being used here to represent the calling domain?

> 
> The hypercall will only succeed if calling domain has sufficient privilege 
> over
> the specified domid.

How is this privilege check being done? Is there existing mechanism, or 
something
new to add?

> 
> The M2B mechanism is an MFN to (BFN,domid,ioserver) tuple.
> 
> Each successful subop will add to the M2B if there was not an existing 
> identical
> M2B entry.
> 
> Every new M2B entry will take a reference to the MFN backing the GFN.
> 
> All the following conditions are required to be true for PV IOMMU map_foreign
> subop to succeed:
> 
> 1. IOMMU detected and supported by Xen
> 2. The domain has IOMMU controlled hardware allocated to it
> 3. The domain is the hardware_domain and the following Xen IOMMU options are
>    NOT enabled: dom0-passthrough

4. the domain has sufficient privilege over the specified domid;

[...]
> 
> IOMMU_lookup_foreign_page
> -------------------------
> This subop uses `struct lookup_foreign_page` part of the `struct pv_iommu_op`.
> 
> This subop lookups up a BFN mapping for a ioserver + gfn + target domid
> combination.
> 
> The hypercall will only succeed if calling domain has sufficient privilege 
> over
> the specified domid.
> 
> If a 1:1 mapping exists of BFN to MFN then a M2B entry is added and a
> reference is taken to the underlying MFN. If an existing mapping is present

Then when will this very reference be dropped?

> then the BFN is returned and no additional reference's will be taken to the
> underlying MFN.
> 
> A 1:1 mapping will exist if there is no IOMMU support or if the PV hardware
> domain was booted in dom0-relaxed mode or in dom0-passthrough mode.

what about hardware domain using IOMMUOPS in the meantime? In that
case, from your earlier description it's hardware domain to manage BFN
addr space, while here 1:1 mapping is some hard assumption in hypervisor,
so two things together may conflict. There needs to be a mechanism
that once Xen sees any explicit BFN passed from hardware domain, then
such 1:1 mapping scheme should be disabled.

> 
> If there is no IOMMU support then the MFN is returned in the BFN field (that 
> is
> the only valid bus address for the GFN + domid combination).
> 

[...]
> 
> Linux kernel architecture
> =========================
> 
> The Linux kernel will use the PV-IOMMU hypercalls to map its PFN address
> space into the IOMMU. It will map the PFNs to the IOMMU address space using
> a 1:1 mapping, it does this by programming a BFN to GFN mapping which matches
> the PFN to GFN mapping.
> 
> The native SWIOTLB will be used to handle devices which cannot DMA to all of
> the kernel's PFN address space.
> 
> An interface shall be provided for emulator usage of IOMMUOP_*_foreign_page
> subops which will allow the Linux kernel to centrally manage that domain's BFN
> resource and ensure there are no unexpected conflicts.

One open here. When IOMMU is enabled, there is supposed to be a
IOVA space created in Linux kernel. How does this BFN space play
with that one?

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.