 
	
| [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [RFC] Xen PV IOMMU interface draft B
 Hi All,
Here is a design for allowing guests to control the IOMMU. This
allows for the guest GFN mapping to be programmed into the IOMMU and
avoid using the SWIOTLB bounce buffer technique in the Linux kernel
(except for legacy 32 bit DMA IO devices).
Draft B has been expanded to include Bus Address mapping/lookup for Mediated
pass-through emulators.
The pandoc markdown format of the document is provided below to allow
for easier inline comments:
% Xen PV IOMMU interface
% Malcolm Crossley <<malcolm.crossley@xxxxxxxxxx>>
  Paul Durrant <<paul.durrant@xxxxxxxxxx>>
% Draft B
Introduction
============
Revision History
----------------
--------------------------------------------------------------------
Version  Date         Changes
-------  -----------  ----------------------------------------------
Draft A  10 Apr 2014  Initial draft.
Draft B  12 Jun 2015  Second draft.
--------------------------------------------------------------------
Background
==========
Linux kernel SWIOTLB
--------------------
Xen PV guests use a Pseudophysical Frame Number(PFN) address space which is
decoupled from the host Machine Frame Number(MFN) address space.
PV guest hardware drivers are only aware of the PFN address space only and
assume that if PFN addresses are contiguous then the hardware addresses would
be contiguous as well. The decoupling between PFN and MFN address spaces means
PFN and MFN addresses may not be contiguous across page boundaries and thus a
buffer allocated in GFN address space which spans a page boundary may not be
contiguous in MFN address space.
PV hardware drivers cannot tolerate this behaviour and so a special
"bounce buffer" region is used to hide this issue from the drivers.
A bounce buffer region is a special part of the PFN address space which has
been made to be contiguous in both PFN and MFN address spaces. When a driver
requests a buffer which spans a page boundary be made available for hardware
to read the core operating system code copies the buffer into a temporarily
reserved part of the bounce buffer region and then returns the MFN address of
the reserved part of the bounce buffer region back to the driver itself. The
driver then instructs the hardware to read the copy of the buffer in the
bounce buffer. Similarly if the driver requests a buffer is made available
for hardware to write to the first a region of the bounce buffer is reserved
and then after the hardware completes writing then the reserved region of
bounce buffer is copied to the originally allocated buffer.
The overheard of memory copies to/from the bounce buffer region is high
and damages performance. Furthermore, there is a risk the fixed size
bounce buffer region will become exhausted and it will not be possible to
return an hardware address back to the driver. The Linux kernel drivers do not
tolerate this failure and so the kernel is forced to crash, as an
uncorrectable error has occurred.
Input/Output Memory Management Units (IOMMU) allow for an inbound address
mapping to be created from the I/O Bus address space (typically PCI) to
the machine frame number address space. IOMMU's typically use a page table
mechanism to manage the mappings and therefore can create mappings of page size
granularity or larger.
The I/O Bus address space will be referred to as the Bus Frame Number (BFN)
address space for the rest of this document.
Mediated Pass-through Emulators
-------------------------------
Mediated Pass-through emulators allow guest domains to interact with
hardware devices via emulator mediation. The emulator runs in a domain separate
to the guest domain and it is used to enforce security of guest access to the
hardware devices and isolation of different guests accessing the same hardware
device.
The emulator requires a mechanism to map guest address's to a bus address that
the hardware devices can access.
Clarification of GFN and BFN fields for different guest types
-------------------------------------------------------------
Guest Frame Numbers (GFN) definition varies depending on the guest type.
Diagram below details the memory accesses originating from CPU, per guest type:
      HVM guest                              PV guest
         (VA)                                   (VA)
          |                                      |
         MMU                                    MMU
          |                                      |
         (GFN)                                   |
          |                                      | (GFN)
     HAP a.k.a EPT/NPT                           |
          |                                      |
         (MFN)                                  (MFN)
          |                                      |
         RAM                                    RAM
For PV guests GFN is equal to MFN for a single page but not for a contiguous
range of pages.
Bus Frame Numbers (BFN) refer to the address presented on the physical bus
before being translated by the IOMMU.
Diagram below details memory accesses originating from physical device.
    Physical Device
          |
        (BFN)
          |
           IOMMU-PT
          |
        (MFN)
          |
         RAM
Purpose
=======
1. Allow Xen guests to create/modify/destroy IOMMU mappings for
hardware devices that the PV guests has access to. This enables the PV guest to
program a bus address space mapping which matches it's GFN mapping. Once a 1-1
mapping of PFN to bus address space is created then a bounce buffer
region is not required for the IO devices connected to the IOMMU.
2. Allow for Xen guests to lookup/create/modify/destroy IOMMU mappings for
guest memory of domains the calling Xen guest has sufficient privilege over.
This enables domains to provide mediated hardware acceleration to other
guest domains.
Xen Architecture
================
The Xen architecture consists of a new hypercall interface and changes to the
grant map interface.
The existing IOMMU mappings setup at domain creation time will be preserved so
that PV domains unaware of this feature will continue to function with no
changes required.
Memory ballooning will be supported by taking an additional reference on the
MFN backing the GFN for each successful IOMMU mapping created.
An M2B tracking structure will be used to ensure all reference's to a MFN can
be located easily.
Xen PV IOMMU hypercall interface
--------------------------------
A two argument hypercall interface (do_iommu_op).
ret_t do_iommu_op(XEN_GUEST_HANDLE_PARAM(void) arg, unsigned int count)
First argument, guest handle pointer to array of `struct pv_iommu_op`
Second argument, unsigned integer count of `struct pv_iommu_op` elements in 
array.
Definition of struct pv_iommu_op:
    struct pv_iommu_op {
        uint16_t subop_id;
        uint16_t flags;
        int32_t status;
        union {
            struct {
                uint64_t bfn;
                uint64_t gfn;
            } map_page;
            struct {
                uint64_t bfn;
            } unmap_page;
            struct {
                uint64_t bfn;
                uint64_t gfn;
                uint16_t domid;
                ioservid_t ioserver;
            } map_foreign_page;
            struct {
                uint64_t bfn;
                uint64_t gfn;
                uint16_t domid;
                ioservid_t ioserver;
            } lookup_foreign_page;
            struct {
                uint64_t bfn;
                ioservid_t ioserver;
            } unmap_foreign_page;
        } u;
    };
Definition of PV IOMMU subops:
    #define IOMMUOP_query_caps            1
    #define IOMMUOP_map_page              2
    #define IOMMUOP_unmap_page            3
    #define IOMMUOP_map_foreign_page      4
    #define IOMMUOP_lookup_foreign_page   5
    #define IOMMUOP_unmap_foreign_page    6
Design considerations for hypercall op
-------------------------------------------
IOMMU map/unmap operations can be slow and can involve flushing the IOMMU TLB
to ensure the IO device uses the updated mappings.
The op has been designed to take an array of operations and a count as
parameters. This allows for easily implemented hypercall continuations to be
used and allows for batches of IOMMU operations to be submitted before flushing
the IOMMU TLB.
The subop_id to be used for a particular element is encoded into the element
itself. This allows for map and unmap operations to be performed in one 
hypercall
and for the IOMMU TLB flushing optimisations to be still applied.
The hypercall will ensure that the required IOMMU TLB flushes are applied before
returning to guest via either hypercall completion or a hypercall continuation.
IOMMUOP_query_caps
------------------
This subop queries the runtime capabilities of the PV-IOMMU interface for the
specific called domain. This subop uses `struct pv_iommu_op` directly.
------------------------------------------------------------------------------
Field          Purpose
-----          ---------------------------------------------------------------
`flags`        [out] This field details the IOMMUOP capabilities.
`status`       [out] Status of this op, op specific values listed below
------------------------------------------------------------------------------
Defined bits for flags field:
------------------------------------------------------------------------------
Name                        Bit                Definition
----                       ------     ----------------------------------
IOMMU_QUERY_map_cap          0        IOMMUOP_map_page or IOMMUOP_map_foreign
                                      can be used for this domain
IOMMU_QUERY_map_all_gfns     1        IOMMUOP_map_page subop can map any MFN
                                      not used by Xen
Reserved for future use     2-9                   n/a
IOMMU_page_order           10-15      Returns maximum possible page order for
                                      all other IOMMUOP subops
------------------------------------------------------------------------------
Defined values for query_caps subop status field:
Value   Reason
------  ----------------------------------------------------------
0       subop successfully returned
IOMMUOP_map_page
----------------------
This subop uses `struct map_page` part of the `struct pv_iommu_op`.
If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
allowed to map all GFN's except for Xen owned MFN's else the hardware
domain will only be allowed to map GFN's which it owns.
If IOMMU dom0-strict mode is NOT enabled then the hardware domain will be
allowed to map all GFN's without taking a reference to the MFN backing the GFN
by setting the IOMMU_MAP_OP_no_ref_cnt flag.
Every successful pv_iommu_op will result in an additional page reference being
taken on the MFN backing the GFN except for the condition detailed above.
If the map_op flags indicate a writeable mapping is required then a writeable
page type reference will be taken otherwise a standard page reference will be
taken.
All the following conditions are required to be true for PV IOMMU map
subop to succeed:
1. IOMMU detected and supported by Xen
2. The domain has IOMMU controlled hardware allocated to it
3. If hardware_domain and the following Xen IOMMU options are
   NOT enabled: dom0-passthrough
This subop usage of the "struct pv_iommu_op" and ``struct map_page` fields
are detailed below:
------------------------------------------------------------------------------
Field          Purpose
-----          ---------------------------------------------------------------
`bfn`          [in]  Bus address frame number(BFN) to be mapped to specified gfn
                     below
`gfn`          [in]  Guest address frame number for DOMID_SELF
`flags`        [in]  Flags for signalling type of IOMMU mapping to be created,
                     Flags can be combined.
`status`       [out] Mapping status of this op, op specific values listed below
------------------------------------------------------------------------------
Defined bits for flags field:
Name                        Bit                Definition
----                       -----      ----------------------------------
IOMMU_OP_readable            0        Create readable IOMMU mapping
IOMMU_OP_writeable           1        Create writeable IOMMU mapping
IOMMU_MAP_OP_no_ref_cnt      2        IOMMU mapping does not take a reference to
                                      MFN backing BFN mapping
Reserved for future use     3-9                   n/a
IOMMU_page_order            10-15     Page order to be used for both gfn and bfn
Defined values for map_page subop status field:
Value   Reason
------  ----------------------------------------------------------------------
0       subop successfully returned
-EIO    IOMMU unit returned error when attempting to map BFN to GFN.
-EPERM  GFN could not be mapped because the GFN belongs to Xen.
-EPERM  Domain is not a  domain and GFN does not belong to domain
-EPERM  Domain is a hardware domain, IOMMU dom-strict mode is enabled and
        GFN does not belong to domain
-EACCES BFN address conflicts with RMRR regions for device's attached to
        DOMID_SELF
-ENOSPC Page order is too large for either BFN, GFN or IOMMU unit
IOMMUOP_unmap_page
------------------
This subop uses `struct unmap_page` part of the `struct pv_iommu_op`.
The subop usage of the "struct pv_iommu_op" and ``struct unmap_page` fields
are detailed below:
--------------------------------------------------------------------
Field          Purpose
-----          -----------------------------------------------------
`bfn`          [in] Bus address frame number to be unmapped in DOMID_SELF
`flags`        [in] Flags for signalling page order of unmap operation
`status`       [out] Mapping status of this unmap operation, 0 indicates success
--------------------------------------------------------------------
Defined bits for flags field:
Name                        Bit                Definition
----                       -----      ----------------------------------
Reserved for future use     0-9                   n/a
IOMMU_page_order            10-15     Page order to be used for bfn
Defined values for unmap_page subop status field:
Error code  Reason
----------  ------------------------------------------------------------
0            subop successfully returned
-EIO         IOMMU unit returned error when attempting to unmap BFN.
-ENOSPC      Page order is too large for either BFN address or IOMMU unit
------------------------------------------------------------------------
IOMMUOP_map_foreign_page
----------------
This subop uses `struct map_foreign_page` part of the `struct pv_iommu_op`.
It is not valid to use domid representing the calling domain.
The hypercall will only succeed if calling domain has sufficient privilege over
the specified domid
If there is no IOMMU support then the MFN is returned in the BFN field (that is
the only valid bus address for the GFN + domid combination).
If there IOMMU support then the specified BFN is returned for the GFN + domid
combination
The M2B mechanism is a MFN to (BFN,domid,ioserver) tuple.
Each successful subop will add to the M2B if there was not an existing identical
M2B entry.
Every new M2B entry will take a reference to the MFN backing the GFN.
All the following conditions are required to be true for PV IOMMU map_foreign
subop to succeed:
1. IOMMU detected and supported by Xen
2. The domain has IOMMU controlled hardware allocated to it
3. The domain is a hardware_domain and the following Xen IOMMU options are
   NOT enabled: dom0-passthrough
This subop usage of the "struct pv_iommu_op" and ``struct map_foreign_page`
fields are detailed below:
--------------------------------------------------------------------
Field          Purpose
-----          -----------------------------------------------------
`domid`        [in] The domain ID for which the gfn field applies
`ioserver`     [in] IOREQ server id associated with mapping
`bfn`          [in] Bus address frame number for gfn address
`gfn`          [in] Guest address frame number
`flags`        [in] Details the status of the BFN mapping
`status`       [out] status of this subop, 0 indicates success
--------------------------------------------------------------------
Defined bits for flags field:
Name                         Bit                Definition
----                        -----      ----------------------------------
IOMMUOP_readable              0        BFN IOMMU mapping is readable
IOMMUOP_writeable             1        BFN IOMMU mapping is writeable
IOMMUOP_swap_mfn              2        BFN IOMMU mapping can be safely
                                       swapped to scratch page
Reserved for future use      3-9       Reserved flag bits should be 0
IOMMU_page_order            10-15      Returns maximum possible page order for
                                       all other IOMMUOP subops
Defined values for map_foreign_page subop status field:
Error code  Reason
----------  ------------------------------------------------------------
0            subop successfully returned
-EIO         IOMMU unit returned error when attempting to map BFN to GFN.
-EPERM       Calling domain does not have sufficient privilege over domid
-EPERM       GFN could not be mapped because the GFN belongs to Xen.
-EPERM       domid maps to DOMID_SELF
-EACCES      BFN address conflicts with RMRR regions for device's attached to
             DOMID_SELF
-ENODEV      Provided ioserver id is not valid
-ENXIO       Provided domid id is not valid
-ENXIO       Provided GFN address is not valid
-ENOSPC      Page order is too large for either BFN, GFN or IOMMU unit
IOMMU_lookup_foreign_page
----------------
This subop uses `struct lookup_foreign_page` part of the `struct pv_iommu_op`.
If the BFN is specified as an input and parameter and there is no IOMMU support
for the calling domain then an error will be returned.
It is the calling domain responsibility to ensure there are no conflicts
The hypercall will only succeed if calling domain has sufficient privilege over
the specified domid
If there is no IOMMU support then the MFN is returned in the BFN field (that is
the only valid bus address for the GFN + domid combination).
Each successful subop will add to the M2B if there was not an existing identical
M2B entry.
Every new M2B entry will take a reference to the MFN backing the GFN.
This subop usage of the "struct pv_iommu_op" and ``struct lookup_foreign_page`
fields are detailed below:
--------------------------------------------------------------------
Field          Purpose
-----          -----------------------------------------------------
`domid`        [in] The domain ID for which the gfn field applies
`ioserver`     [in] IOREQ server id associated with mapping
`bfn`          [out] Bus address frame number for gfn address
`gfn`          [in] Guest address frame number
`flags`        [out] Details the status of the BFN mapping
`status`       [out] status of this subop, 0 indicates success
--------------------------------------------------------------------
Defined bits for flags field:
Name                         Bit                Definition
----                        -----      ----------------------------------
IOMMUOP_readable              0        Returned BFN IOMMU mapping is readable
IOMMUOP_writeable             1        Returned BFN IOMMU mapping is writeable
Reserved for future use      2-9       Reserved flag bits should be 0
IOMMU_page_order            10-15      Returns maximum possible page order for
                                       all other IOMMUOP subops
Defined values for lookup_foreign_page subop status field:
Error code  Reason
----------  ------------------------------------------------------------
0            subop successfully returned
-EPERM       Calling domain does not have sufficient privilege over domid
-ENOENT      There is no available BFN for provided GFN + domid combination
-ENODEV      Provided ioserver id is not valid
-ENXIO       Provided domid id is not valid
-ENXIO       Provided GFN address is not valid
IOMMUOP_unmap_foreign_page
----------------
This subop uses `struct unmap_foreign_page` part of the `struct pv_iommu_op`.
If there is no IOMMU support then the MFN is returned in the BFN field (that is
the only valid bus address for the GFN + domid combination).
If there is IOMMU support then the specified BFN is returned for the GFN + domid
combination
Each successful subop will add to the M2B if there was not an existing identical
M2B entry. The
Every new M2B entry will take a reference to the MFN backing the GFN.
This subop usage of the "struct pv_iommu_op" and ``struct unmap_foreign_page` 
fields
are detailed below:
-----------------------------------------------------------------------
Field          Purpose
-----          --------------------------------------------------------
`ioserver`      [in] IOREQ server id associated with mapping
`bfn`          [in] Bus address frame number for gfn address
`flags`        [out] Flags for signalling page order of unmap operation
`status`       [out] status of this subop, 0 indicates success
-----------------------------------------------------------------------
Defined bits for flags field:
Name                        Bit                Definition
----                        -----      ----------------------------------
Reserved for future use     0-9                   n/a
IOMMU_page_order            10-15     Page order to be used for bfn
Defined values for unmap_foreign_page subop status field:
Error code  Reason
----------  ------------------------------------------------------------
0            subop successfully returned
-ENOENT      There is no mapped BFN + ioserver id combination to unmap
IOMMUOP_*_foreign_page interactions with guest domain ballooning
================================================================
Guest domains can balloon out a set of GFN mappings at any time and render the
BFN to GFN mapping invalid.
When a BFN to GFN mapping becomes invalid, Xen will issue a buffered IO request
of type IOREQ_TYPE_INVALIDATE to the affected IOREQ servers with the now invalid
BFN address in the data field. If the buffered IO request ring is full then a
standard (synchronous) IO request of type IOREQ_TYPE_INVALIDATE will be issued
to the affected IOREQ server the with just invalidated BFN address in the data
field.
The BFN mappings cannot be simply unmapped at the point of the balloon hypercall
otherwise a malicious guest could specifically balloon out an in use GFN address
in use by an emulator and trigger IOMMU faults for the domains with BFN
mappings.
For hosts with no IOMMU support: The affected emulator(s) must specifically
issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN address so that
the references to the underlying MFN are removed and the MFN can be freed back
to the Xen memory allocator.
For hosts with IOMMU support:
If the BFN was mapped without the IOMMUOP_swap_mfn flag set in the
IOMMUOP_map_foreign_page then the affected affected emulator(s) must
specifically issue a IOMMUOP_unmap_foreign_page subop for the now invalid BFN
address so that the references to the underlying MFN are removed.
If the BFN was mapped with the IOMMUOP_swap_mfn flag set in the
IOMMUOP_map_foreign_page subop for all emulators with mappings of that GFN then
the BFN mapping will be swapped to point at a scratch MFN page and all BFN
references to the invalid MFN will be removed by Xen after the BFN mapping has
been updated to point at the scratch MFN page.
The rationale for swapping the BFN mapping to point at scratch pages is to
enable guest domains to balloon quickly without requiring hypercall(s) from
emulators.
Not all BFN mappings can be swapped without potentially causing problems for the
hardware itself (command rings etc.) so the IOMMUOP_swap_mfn flag is used to
allow per BFN control of Xen ballooning behaviour.
PV IOMMU interactions with self ballooning
==========================================
The guest should clear any IOMMU mappings it has of it's own pages before
releasing a page back to Xen. It will need to add IOMMU mappings after
repopulating a page with the populate_physmap hypercall.
This requires that IOMMU mappings get a writeable page type reference count and
that guests clear any IOMMU mappings before pinning page table pages.
Security Implications of allowing domain IOMMU control
===============================================================
Xen currently allows IO devices attached to hardware domain to have direct
access to the all of the MFN address space (except Xen hypervisor memory 
regions),
provided the Xen IOMMU option dom0-strict is not enabled.
The PV IOMMU feature provides the same level of access to MFN address space
and the feature is not enabled when the Xen IOMMU option dom0-strict is
enabled. Therefore security is not degraded by the PV IOMMU feature.
Domains with physical device(s) assigned which are not hardware domains are only
allowed to map their own GFNs or GFNs for domain(s) they have privilege over.
PV IOMMU interactions with grant map/unmap operations
=====================================================
Grant map operations return a Physical device accessible address (BFN) if the
GNTMAP_device_map flag is set.  This operation currently returns the MFN for PV
guests which may conflict with the BFN address space the guest uses if PV IOMMU
map support is available to the guest.
This design proposes to allow the calling domain to control the BFN address that
a grant map operation uses.
This can be achieved by specifying that the dev_bus_addr in the
gnttab_map_grant_ref structure is used an input parameter instead of the
output parameter it is currently.
Only PAGE_SIZE aligned addresses are allowed for dev_bus_addr input parameter.
The revised structure is shown below for convenience.
    struct gnttab_map_grant_ref {
        /* IN parameters. */
        uint64_t host_addr;
        uint32_t flags;               /* GNTMAP_* */
        grant_ref_t ref;
        domid_t  dom;
        /* OUT parameters. */
        int16_t  status;              /* => enum grant_status */
        grant_handle_t handle;
        /* IN/OUT parameters */
        uint64_t dev_bus_addr;
    };
The grant map operation would then behave similarly to the IOMMUOP_map_page
subop for the creation of the IOMMU mapping.
The grant unmap operation would then behave similarly to the IOMMUOP_unmap_page
subop for the removal of the IOMMU mapping.
A new grantmap flag would be used to indicate the domain is requesting the
dev_bus_addr field is used an input parameter.
    #define _GNTMAP_request_bfn_map      (6)
    #define GNTMAP_request_bfn_map   (1<<_GNTMAP_request_bfn_map)
Linux kernel architecture
=========================
The Linux kernel will use the PV-IOMMU hypercalls to map it's PFN address
space into the IOMMU. It will map the PFN's to the IOMMU address space using
a 1:1 mapping, it does this by programming a BFN to GFN mapping which matches
the PFN to GFN mapping.
The native SWIOTLB will be used to handle device's which cannot DMA to all of
the kernel's PFN address space.
An interface shall be provided for emulator usage of IOMMUOP_*_foreign_page
subops which will allow the Linux kernel to centrally manage that domains BFN
resource and ensure there are no unexpected conflicts.
Emulator usage of PV IOMMU interface
====================================
Emulators which require bus address mapping of guest RAM must first determine if
it's possible for the domain to control the bus addresses themselves.
A IOMMUOP_query_caps subop will return the IOMMU_QUERY_map_cap flag. If this
flag is set then the emulator may specify the BFN address it wishes guest RAM to
be mapped to via the IOMMUOP_map_foreign_page subop.  If the flag is not set
then the emulator must use BFN addresses supplied by the Xen via the
IOMMUOP_lookup_foreign_page.
Operating systems which use the IOMMUOP_map_page subop are expected to provide a
common interface for emulators
Emulators should unmap unused GFN mappings as often as possible using
IOMMUOP_unmap_foreign_page subops so that guest domains can balloon pages
quickly and efficiently.
Emulators should conform to the ballooning behaviour described section
"IOMMUOP_*_foreign_page interactions with guest domain ballooning" so that guest
domains are able to effectively balloon out and in memory.
Emulators must unmap any active BFN mappings when they shutdown.
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
 
 | 
|  | Lists.xenproject.org is hosted with RackSpace, monitoring our |