[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [RFC] Xen PV IOMMU interface draft B
Hi All, Here is a design for allowing guests to control the IOMMU. This allows for the guest GFN mapping to be programmed into the IOMMU and avoid using the SWIOTLB bounce buffer technique in the Linux kernel (except for legacy 32 bit DMA IO devices). Draft B has been expanded to include Bus Address mapping/lookup for Mediated pass-through emulators. The pandoc markdown format of the document is provided below to allow for easier inline comments: % Xen PV IOMMU interface % Malcolm Crossley <<malcolm.crossley@xxxxxxxxxx>> Paul Durrant <<paul.durrant@xxxxxxxxxx>> % Draft B Introduction ============ Revision History ---------------- -------------------------------------------------------------------- Version Date Changes ------- ----------- ---------------------------------------------- Draft A 10 Apr 2014 Initial draft. Draft B 12 Jun 2015 Second draft. -------------------------------------------------------------------- Background ========== Linux kernel SWIOTLB -------------------- Xen PV guests use a Pseudophysical Frame Number(PFN) address space which is decoupled from the host Machine Frame Number(MFN) address space. PV guest hardware drivers are only aware of the PFN address space only and assume that if PFN addresses are contiguous then the hardware addresses would be contiguous as well. The decoupling between PFN and MFN address spaces means PFN and MFN addresses may not be contiguous across page boundaries and thus a buffer allocated in GFN address space which spans a page boundary may not be contiguous in MFN address space. PV hardware drivers cannot tolerate this behaviour and so a special "bounce buffer" region is used to hide this issue from the drivers. A bounce buffer region is a special part of the PFN address space which has been made to be contiguous in both PFN and MFN address spaces. When a driver requests a buffer which spans a page boundary be made available for hardware to read the core operating system code copies the buffer into a temporarily reserved part of the bounce buffer region and then returns the MFN address of the reserved part of the bounce buffer region back to the driver itself. The driver then instructs the hardware to read the copy of the buffer in the bounce buffer. Similarly if the driver requests a buffer is made available for hardware to write to the first a region of the bounce buffer is reserved and then after the hardware completes writing then the reserved region of bounce buffer is copied to the originally allocated buffer. The overheard of memory copies to/from the bounce buffer region is high and damages performance. Furthermore, there is a risk the fixed size bounce buffer region will become exhausted and it will not be possible to return an hardware address back to the driver. The Linux kernel drivers do not tolerate this failure and so the kernel is forced to crash, as an uncorrectable error has occurred. Input/Output Memory Management Units (IOMMU) allow for an inbound address mapping to be created from the I/O Bus address space (typically PCI) to the machine frame number address space. IOMMU's typically use a page table mechanism to manage the mappings and therefore can create mappings of page size granularity or larger. The I/O Bus address space will be referred to as the Bus Frame Number (BFN) address space for the rest of this document. Mediated Pass-through Emulators ------------------------------- Mediated Pass-through emulators allow guest domains to interact with hardware devices via emulator mediation. The emulator runs in a domain separate to the guest domain and it is used to enforce security of guest access to the hardware devices and isolation of different guests accessing the same hardware device. The emulator requires a mechanism to map guest address's to a bus address that the hardware devices can access. Clarification of GFN and BFN fields for different guest types ------------------------------------------------------------- Guest Frame Numbers (GFN) definition varies depending on the guest type. Diagram below details the memory accesses originating from CPU, per guest type: HVM guest PV guest (VA) (VA) | | MMU MMU | | (GFN) | | | (GFN) HAP a.k.a EPT/NPT | | | (MFN) (MFN) | | RAM RAM For PV guests GFN is equal to MFN for a single page but not for a contiguous range of pages. Bus Frame Numbers (BFN) refer to the address presented on the physical bus before being translated by the IOMMU. Diagram below details memory accesses originating from physical device. Physical Device | (BFN) | IOMMU-PT | (MFN) | RAM Purpose ======= 1. Allow Xen guests to create/modify/destroy IOMMU mappings for hardware devices that the PV guests has access to. This enables the PV guest to program a bus address space mapping which matches it's GFN mapping. Once a 1-1 mapping of PFN to bus address space is created then a bounce buffer region is not required for the IO devices connected to the IOMMU. 2. Allow for Xen guests to lookup/create/modify/destroy IOMMU mappings for guest memory of domains the calling Xen guest has sufficient privilege over. This enables domains to provide mediated hardware acceleration to other guest domains. Xen Architecture ================ The Xen architecture consists of a new hypercall interface and changes to the grant map interface. The existing IOMMU mappings setup at domain creation time will be preserved so that PV domains unaware of this feature will continue to function with no changes required. Memory ballooning will be supported by taking an additional reference on the MFN backing the GFN for each successful IOMMU mapping created. An M2B tracking structure will be used to ensure all reference's to a MFN can be located easily. Xen PV IOMMU hypercall interface -------------------------------- A two argument hypercall interface (do_iommu_op). ret_t do_iommu_op(XEN_GUEST_HANDLE_PARAM(void) arg, unsigned int count) First argument, guest handle pointer to array of `struct pv_iommu_op` Second argument, unsigned integer count of `struct pv_iommu_op` elements in array. Definition of struct pv_iommu_op: struct pv_iommu_op { uint16_t subop_id; uint16_t flags; int32_t status; union { struct { uint64_t bfn; uint64_t gfn; } map_page; struct { uint64_t bfn; } unmap_page; struct { uint64_t bfn; uint64_t gfn; uint16_t domid; ioservid_t ioserver; } map_foreign_page; struct { uint64_t bfn; uint64_t gfn; uint16_t domid; ioservid_t ioserver; } lookup_foreign_page; struct { uint64_t bfn; ioservid_t ioserver; } unmap_foreign_page; } u; }; Definition of PV IOMMU subops: #define IOMMUOP_query_caps 1 #define IOMMUOP_map_page 2 #define IOMMUOP_unmap_page 3 #define IOMMUOP_map_foreign_page 4 #define IOMMUOP_lookup_foreign_page 5 #define IOMMUOP_unmap_foreign_page 6 Design considerations for hypercall op ------------------------------------------- IOMMU map/unmap operations can be slow and can involve flushing the IOMMU TLB to ensure the IO device uses the updated mappings. The op has been designed to take an array of operations and a count as parameters. This allows for easily implemented hypercall continuations to be used and allows for batches of IOMMU operations to be submitted before flushing the IOMMU TLB. The subop_id to be used for a particular element is encoded into the element itself. This allows for map and unmap operations to be performed in one hypercall and for the IOMMU TLB flushing optimisations to be still applied. The hypercall will ensure that the required IOMMU TLB flushes are applied before returning to guest via either hypercall completion or a hypercall continuation. IOMMUOP_query_caps ------------------ This subop queries the runtime capabilities of the PV-IOMMU interface for the specific called domain. This subop uses `struct pv_iommu_op` directly. ------------------------------------------------------------------------------ Field Purpose ----- --------------------------------------------------------------- `flags` [out] This field details the IOMMUOP capabilities. `status` [out] Status of this op, op specific values listed below ------------------------------------------------------------------------------ Defined bits for flags field: ------------------------------------------------------------------------------ Name Bit Definition ---- ------ ---------------------------------- IOMMU_QUERY_map_cap 0 IOMMUOP_map_page or IOMMUOP_map_foreign can be used for this domain IOMMU_QUERY_map_all_gfns 1 IOMMUOP_map_page subop can map any MFN not used by Xen Reserved for future use 2-9 n/a IOMMU_page_order 10-15 Returns maximum possible page order for all other IOMMUOP subops ------------------------------------------------------------------------------ Defined values for query_caps subop status field: Value Reason ------ ---------------------------------------------------------- 0 subop successfully returned IOMMUOP_map_page ---------------------- This subop uses `struct map_page` part of the `struct pv_iommu_op`. If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be allowed to map all GFN's except for Xen owned MFN's else the hardware domain will only be allowed to map GFN's which it owns. If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be allowed to map all GFN's without taking a reference to the MFN backing the GFN by setting the IOMMU_MAP_OP_no_ref_cnt flag. Every successful pv_iommu_op will result in an additional page reference being taken on the MFN backing the GFN except for the condition detailed above. If the map_op flags indicate a writeable mapping is required then a writeable page type reference will be taken otherwise a standard page reference will be taken. All the following conditions are required to be true for PV IOMMU map subop to succeed: 1. IOMMU detected and supported by Xen 2. The domain has IOMMU controlled hardware allocated to it 3. If hardware_domain and the following Xen IOMMU options are NOT enabled: dom0-passthrough This subop usage of the "struct pv_iommu_op" and ``struct map_page` fields are detailed below: ------------------------------------------------------------------------------ Field Purpose ----- --------------------------------------------------------------- `bfn` [in] Bus address frame number(BFN) to be mapped to specified gfn below `gfn` [in] Guest address frame number for DOMID_SELF `flags` [in] Flags for signalling type of IOMMU mapping to be created, Flags can be combined. `status` [out] Mapping status of this op, op specific values listed below ------------------------------------------------------------------------------ Defined bits for flags field: Name Bit Definition ---- ----- ---------------------------------- IOMMU_OP_readable 0 Create readable IOMMU mapping IOMMU_OP_writeable 1 Create writeable IOMMU mapping IOMMU_MAP_OP_no_ref_cnt 2 IOMMU mapping does not take a reference to MFN backing BFN mapping Reserved for future use 3-9 n/a IOMMU_page_order 10-15 Page order to be used for both gfn and bfn Defined values for map_page subop status field: Value Reason ------ ---------------------------------------------------------------------- 0 subop successfully returned -EIO IOMMU unit returned error when attempting to map BFN to GFN. -EPERM GFN could not be mapped because the GFN belongs to Xen. -EPERM Domain is not a domain and GFN does not belong to domain -EPERM Domain is a hardware domain, IOMMU dom-strict mode is enabled and GFN does not belong to domain -EACCES BFN address conflicts with RMRR regions for device's attached to DOMID_SELF -ENOSPC Page order is too large for either BFN, GFN or IOMMU unit IOMMUOP_unmap_page ------------------ This subop uses `struct unmap_page` part of the `struct pv_iommu_op`. The subop usage of the "struct pv_iommu_op" and ``struct unmap_page` fields are detailed below: -------------------------------------------------------------------- Field Purpose ----- ----------------------------------------------------- `bfn` [in] Bus address frame number to be unmapped in DOMID_SELF `flags` [in] Flags for signalling page order of unmap operation `status` [out] Mapping status of this unmap operation, 0 indicates success -------------------------------------------------------------------- Defined bits for flags field: Name Bit Definition ---- ----- ---------------------------------- Reserved for future use 0-9 n/a IOMMU_page_order 10-15 Page order to be used for bfn Defined values for unmap_page subop status field: Error code Reason ---------- ------------------------------------------------------------ 0 subop successfully returned -EIO IOMMU unit returned error when attempting to unmap BFN. -ENOSPC Page order is too large for either BFN address or IOMMU unit ------------------------------------------------------------------------ IOMMUOP_map_foreign_page ---------------- This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`. It is not valid to use domid representing the calling domain. The hypercall will only succeed if calling domain has sufficient privilege over the specified domid If there is no IOMMU support then the MFN is returned in the BFN field (that is the only valid bus address for the GFN + domid combination). If there IOMMU support then the specified BFN is returned for the GFN + domid combination The M2B mechanism is a MFN to (BFN,domid,ioserver) tuple. Each successful subop will add to the M2B if there was not an existing identical M2B entry. Every new M2B entry will take a reference to the MFN backing the GFN. All the following conditions are required to be true for PV IOMMU map_foreign subop to succeed: 1. IOMMU detected and supported by Xen 2. The domain has IOMMU controlled hardware allocated to it 3. The domain is a hardware_domain and the following Xen IOMMU options are NOT enabled: dom0-passthrough This subop usage of the "struct pv_iommu_op" and ``struct map_foreign_page` fields are detailed below: -------------------------------------------------------------------- Field Purpose ----- ----------------------------------------------------- `domid` [in] The domain ID for which the gfn field applies `ioserver` [in] IOREQ server id associated with mapping `bfn` [in] Bus address frame number for gfn address `gfn` [in] Guest address frame number `flags` [in] Details the status of the BFN mapping `status` [out] status of this subop, 0 indicates success -------------------------------------------------------------------- Defined bits for flags field: Name Bit Definition ---- ----- ---------------------------------- IOMMUOP_readable 0 BFN IOMMU mapping is readable IOMMUOP_writeable 1 BFN IOMMU mapping is writeable IOMMUOP_swap_mfn 2 BFN IOMMU mapping can be safely swapped to scratch page Reserved for future use 3-9 Reserved flag bits should be 0 IOMMU_page_order 10-15 Returns maximum possible page order for all other IOMMUOP subops Defined values for map_foreign_page subop status field: Error code Reason ---------- ------------------------------------------------------------ 0 subop successfully returned -EIO IOMMU unit returned error when attempting to map BFN to GFN. -EPERM Calling domain does not have sufficient privilege over domid -EPERM GFN could not be mapped because the GFN belongs to Xen. -EPERM domid maps to DOMID_SELF -EACCES BFN address conflicts with RMRR regions for device's attached to DOMID_SELF -ENODEV Provided ioserver id is not valid -ENXIO Provided domid id is not valid -ENXIO Provided GFN address is not valid -ENOSPC Page order is too large for either BFN, GFN or IOMMU unit IOMMU_lookup_foreign_page ---------------- This subop uses `struct lookup_foreign_page` part of the `struct pv_iommu_op`. If the BFN is specified as an input and parameter and there is no IOMMU support for the calling domain then an error will be returned. It is the calling domain responsibility to ensure there are no conflicts The hypercall will only succeed if calling domain has sufficient privilege over the specified domid If there is no IOMMU support then the MFN is returned in the BFN field (that is the only valid bus address for the GFN + domid combination). Each successful subop will add to the M2B if there was not an existing identical M2B entry. Every new M2B entry will take a reference to the MFN backing the GFN. This subop usage of the "struct pv_iommu_op" and ``struct lookup_foreign_page` fields are detailed below: -------------------------------------------------------------------- Field Purpose ----- ----------------------------------------------------- `domid` [in] The domain ID for which the gfn field applies `ioserver` [in] IOREQ server id associated with mapping `bfn` [out] Bus address frame number for gfn address `gfn` [in] Guest address frame number `flags` [out] Details the status of the BFN mapping `status` [out] status of this subop, 0 indicates success -------------------------------------------------------------------- Defined bits for flags field: Name Bit Definition ---- ----- ---------------------------------- IOMMUOP_readable 0 Returned BFN IOMMU mapping is readable IOMMUOP_writeable 1 Returned BFN IOMMU mapping is writeable Reserved for future use 2-9 Reserved flag bits should be 0 IOMMU_page_order 10-15 Returns maximum possible page order for all other IOMMUOP subops Defined values for lookup_foreign_page subop status field: Error code Reason ---------- ------------------------------------------------------------ 0 subop successfully returned -EPERM Calling domain does not have sufficient privilege over domid -ENOENT There is no available BFN for provided GFN + domid combination -ENODEV Provided ioserver id is not valid -ENXIO Provided domid id is not valid -ENXIO Provided GFN address is not valid IOMMUOP_unmap_foreign_page ---------------- This subop uses `struct unmap_foreign_page` part of the `struct pv_iommu_op`. If there is no IOMMU support then the MFN is returned in the BFN field (that is the only valid bus address for the GFN + domid combination). If there is IOMMU support then the specified BFN is returned for the GFN + domid combination Each successful subop will add to the M2B if there was not an existing identical M2B entry. The Every new M2B entry will take a reference to the MFN backing the GFN. This subop usage of the "struct pv_iommu_op" and ``struct unmap_foreign_page` fields are detailed below: ----------------------------------------------------------------------- Field Purpose ----- -------------------------------------------------------- `ioserver` [in] IOREQ server id associated with mapping `bfn` [in] Bus address frame number for gfn address `flags` [out] Flags for signalling page order of unmap operation `status` [out] status of this subop, 0 indicates success ----------------------------------------------------------------------- Defined bits for flags field: Name Bit Definition ---- ----- ---------------------------------- Reserved for future use 0-9 n/a IOMMU_page_order 10-15 Page order to be used for bfn Defined values for unmap_foreign_page subop status field: Error code Reason ---------- ------------------------------------------------------------ 0 subop successfully returned -ENOENT There is no mapped BFN + ioserver id combination to unmap IOMMUOP_*_foreign_page interactions with guest domain ballooning ================================================================ Guest domains can balloon out a set of GFN mappings at any time and render the BFN to GFN mapping invalid. When a BFN to GFN mapping becomes invalid, Xen will issue a buffered IO request of type IOREQ_TYPE_INVALIDATE to the affected IOREQ servers with the now invalid BFN address in the data field. If the buffered IO request ring is full then a standard (synchronous) IO request of type IOREQ_TYPE_INVALIDATE will be issued to the affected IOREQ server the with just invalidated BFN address in the data field. The BFN mappings cannot be simply unmapped at the point of the balloon hypercall otherwise a malicious guest could specifically balloon out an in use GFN address in use by an emulator and trigger IOMMU faults for the domains with BFN mappings. For hosts with no IOMMU support: The affected emulator(s) must specifically issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN address so that the references to the underlying MFN are removed and the MFN can be freed back to the Xen memory allocator. For hosts with IOMMU support: If the BFN was mapped without the IOMMUOP_swap_mfn flag set in the IOMMUOP_map_foreign_page then the affected affected emulator(s) must specifically issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN address so that the references to the underlying MFN are removed. If the BFN was mapped with the IOMMUOP_swap_mfn flag set in the IOMMUOP_map_foreign_page subop for all emulators with mappings of that GFN then the BFN mapping will be swapped to point at a scratch MFN page and all BFN references to the invalid MFN will be removed by Xen after the BFN mapping has been updated to point at the scratch MFN page. The rationale for swapping the BFN mapping to point at scratch pages is to enable guest domains to balloon quickly without requiring hypercall(s) from emulators. Not all BFN mappings can be swapped without potentially causing problems for the hardware itself (command rings etc.) so the IOMMUOP_swap_mfn flag is used to allow per BFN control of Xen ballooning behaviour. PV IOMMU interactions with self ballooning ========================================== The guest should clear any IOMMU mappings it has of it's own pages before releasing a page back to Xen. It will need to add IOMMU mappings after repopulating a page with the populate_physmap hypercall. This requires that IOMMU mappings get a writeable page type reference count and that guests clear any IOMMU mappings before pinning page table pages. Security Implications of allowing domain IOMMU control =============================================================== Xen currently allows IO devices attached to hardware domain to have direct access to the all of the MFN address space (except Xen hypervisor memory regions), provided the Xen IOMMU option dom0-strict is not enabled. The PV IOMMU feature provides the same level of access to MFN address space and the feature is not enabled when the Xen IOMMU option dom0-strict is enabled. Therefore security is not degraded by the PV IOMMU feature. Domains with physical device(s) assigned which are not hardware domains are only allowed to map their own GFNs or GFNs for domain(s) they have privilege over. PV IOMMU interactions with grant map/unmap operations ===================================================== Grant map operations return a Physical device accessible address (BFN) if the GNTMAP_device_map flag is set. This operation currently returns the MFN for PV guests which may conflict with the BFN address space the guest uses if PV IOMMU map support is available to the guest. This design proposes to allow the calling domain to control the BFN address that a grant map operation uses. This can be achieved by specifying that the dev_bus_addr in the gnttab_map_grant_ref structure is used an input parameter instead of the output parameter it is currently. Only PAGE_SIZE aligned addresses are allowed for dev_bus_addr input parameter. The revised structure is shown below for convenience. struct gnttab_map_grant_ref { /* IN parameters. */ uint64_t host_addr; uint32_t flags; /* GNTMAP_* */ grant_ref_t ref; domid_t dom; /* OUT parameters. */ int16_t status; /* => enum grant_status */ grant_handle_t handle; /* IN/OUT parameters */ uint64_t dev_bus_addr; }; The grant map operation would then behave similarly to the IOMMUOP_map_page subop for the creation of the IOMMU mapping. The grant unmap operation would then behave similarly to the IOMMUOP_unmap_page subop for the removal of the IOMMU mapping. A new grantmap flag would be used to indicate the domain is requesting the dev_bus_addr field is used an input parameter. #define _GNTMAP_request_bfn_map (6) #define GNTMAP_request_bfn_map (1<<_GNTMAP_request_bfn_map) Linux kernel architecture ========================= The Linux kernel will use the PV-IOMMU hypercalls to map it's PFN address space into the IOMMU. It will map the PFN's to the IOMMU address space using a 1:1 mapping, it does this by programming a BFN to GFN mapping which matches the PFN to GFN mapping. The native SWIOTLB will be used to handle device's which cannot DMA to all of the kernel's PFN address space. An interface shall be provided for emulator usage of IOMMUOP_*_foreign_page subops which will allow the Linux kernel to centrally manage that domains BFN resource and ensure there are no unexpected conflicts. Emulator usage of PV IOMMU interface ==================================== Emulators which require bus address mapping of guest RAM must first determine if it's possible for the domain to control the bus addresses themselves. A IOMMUOP_query_caps subop will return the IOMMU_QUERY_map_cap flag. If this flag is set then the emulator may specify the BFN address it wishes guest RAM to be mapped to via the IOMMUOP_map_foreign_page subop. If the flag is not set then the emulator must use BFN addresses supplied by the Xen via the IOMMUOP_lookup_foreign_page. Operating systems which use the IOMMUOP_map_page subop are expected to provide a common interface for emulators Emulators should unmap unused GFN mappings as often as possible using IOMMUOP_unmap_foreign_page subops so that guest domains can balloon pages quickly and efficiently. Emulators should conform to the ballooning behaviour described section "IOMMUOP_*_foreign_page interactions with guest domain ballooning" so that guest domains are able to effectively balloon out and in memory. Emulators must unmap any active BFN mappings when they shutdown. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |