[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [RFC] Dom0 PV IOMMU control design (draft A)
On Fri, Apr 11, 2014 at 06:28:43PM +0100, Malcolm Crossley wrote: > Hi, > > Here is a design for allowing Dom0 PV guests to control the IOMMU. > This allows for the Dom0 GPFN mapping to be programmed into the > IOMMU and avoid using the SWIOTLB bounce buffer technique in the > Linux kernel (except for legacy 32 bit DMA IO devices) > > This feature provides two gains: > 1. Improved performance for use cases which relied upon the bounce > buffer e.g. NIC cards using jumbo frames with linear buffers. > 2. Prevent SWIOTLB bounce buffer region exhaustion which can cause > unrecoverable Linux kernel driver errors. > > A PDF version of the document is available here: > > http://xenbits.xen.org/people/andrewcoop/pv-iommu-control-A.pdf > > The pandoc markdown format of the document is provided below to > allow for easier inline comments: > > Introduction > ============ > > Background > ------- > > Xen PV guests use a Guest Pseudo-physical Frame Number(GPFN) address space > which is decoupled from the host Machine Frame Number(MFN) address > space. PV > guests which interact with hardware need to translate GPFN addresses to MFN > address because hardware uses the host address space only. > PV guest hardware drivers are only aware of the GPFN address space only and > assume that if GPFN addresses are contiguous then the hardware > addresses would > be contiguous as well. The decoupling between GPFN and MFN address > spaces means > GPFN and MFN addresses may not be contiguous across page boundaries > and thus a > buffer allocated in GPFN address space which spans a page boundary > may not be > contiguous in MFN address space. > > PV hardware drivers cannot tolerate this behaviour and so a special > "bounce buffer" region is used to hide this issue from the drivers. > > A bounce buffer region is a special part of the GPFN address space which has > been made to be contiguous in both GPFN and MFN address spaces. When > a driver > requests a buffer which spans a page boundary be made available for > hardware > to read then core operating system code copies the buffer into a temporarily > reserved part of the bounce buffer region and then returns the MFN > address of > the reserved part of the bounce buffer region back to the driver itself. The > driver then instructs the hardware to read the copy of the buffer in the > bounce buffer. Similarly if the driver requests a buffer is made available > for hardware to write to then first a region of the bounce buffer is > reserved > and then after the hardware completes writing then the reserved region of > bounce buffer is copied to the originally allocated buffer. > > The overheard of memory copies to/from the bounce buffer region is high > and damages performance. Furthermore, there is a risk the fixed size > bounce buffer region will become exhausted and it will not be possible to > return an hardware address back to the driver. The Linux kernel > drivers do not > tolerate this failure and so the kernel is forced to crash the kernel, as an > uncorrectable error has occurred. > > Input/Output Memory Management Units (IOMMU) allow for an inbound address > mapping to be created from the I/O Bus address space (typically PCI) to > the machine frame number address space. IOMMU's typically use a page table > mechanism to manage the mappings and therefore can create mappings > of page size > granularity or larger. > > Purpose > ======= > > Allow Xen Domain 0 PV guests to create/modify/destroy IOMMU mappings for > hardware devices that Domain 0 has access to. This enables Domain 0 > to program > a bus address space mapping which matches it's GPFN mapping. Once a 1-1 > mapping of GPFN to bus address space is created then a bounce buffer > region is not required for the IO devices connected to the IOMMU. > > > Architecture > ============ > > A three argument hypercall interface (do_iommu_op), implementing two > hypercall > subops. > > Design considerations for hypercall subops > ------------------------------------------- > IOMMU map/unmap operations can be slow and can involve flushing the > IOMMU TLB > to ensure the IO device uses the updated mappings. > > The subops have been designed to take an array of operations and a count as > parameters. This allows for easily implemented hypercall > continuations to be > used and allows for batches of IOMMU operations to be submitted > before flushing > the IOMMU TLB. > > > > IOMMUOP_map_page > ---------------- > First argument, pointer to array of `struct iommu_map_op` > Second argument, integer count of `struct iommu_map_op` elements in array Could this be 'unsigned integer' count? Is there a limit? Can I do 31415 of them? Can I do it for the whole memory space of the guest? > > This subop will attempt to IOMMU map each element in the `struct > iommu_map_op` > array and record the mapping status back into the array itself. If > an mapping > fault occurs then the hypercall will return with -EFAULT. > > This subop will inspect the MFN address being mapped in each > iommu_map_op to > ensure it does not belong to the Xen hypervisor itself. If the MFN > does belong > to the Xen hypervisor the subop will return -EPERM in the status > field for that > particular iommu_map_op. Is it OK if the MFN belongs to another guest? > > The IOMMU TLB will only be flushed when the hypercall completes or a > hypercall > continuation is created. > > struct iommu_map_op { > uint64_t bfn; bus_frame ? > uint64_t mfn; > uint32_t flags; > int32_t status; > }; > > ------------------------------------------------------------------------------ > Field Purpose > ----- --------------------------------------------------------------- > `bfn` [in] Bus address frame number to mapped to specified > mfn below Huh? Isn't this out? If not, isn't bfn == mfn for dom0? How would dom0 know the bus address? That usually is something only the IOMMU knows. > > `mfn` [in] Machine address frame number > We still need to do a bit of PFN -> MFN -> hypercall -> GFN and program that in the PCIe devices right? > `flags` [in] Flags for signalling type of IOMMU mapping to be created > > `status` [out] Mapping status of this map operation, 0 > indicates success > ------------------------------------------------------------------------------ > > > Defined bits for flags field > ------------------------------------------------------------------------ > Name Bit Definition > ---- ----- ---------------------------------- > IOMMU_MAP_OP_readable 0 Create readable IOMMU mapping > IOMMU_MAP_OP_writeable 1 Create writeable IOMMU mapping And is it OK to use both? > Reserved for future use 2-31 n/a > ------------------------------------------------------------------------ > > Additional error codes specific to this hypercall: > > Error code Reason > ---------- ------------------------------------------------------------ > EPERM PV IOMMU mode not enabled or calling domain is not domain 0 And -EFAULT and what about success? Do you return 0 or the number of ops that were successfull? > ------------------------------------------------------------------------ > > IOMMUOP_unmap_page > ---------------- > First argument, pointer to array of `struct iommu_map_op` > Second argument, integer count of `struct iommu_map_op` elements in array Um, 'unsigned integer' count? > > This subop will attempt to unmap each element in the `struct > iommu_map_op` array > and record the mapping status back into the array itself. If an > unmapping fault > occurs then the hypercall stop processing the array and return with > an EFAULT; > > The IOMMU TLB will only be flushed when the hypercall completes or a > hypercall > continuation is created. > > struct iommu_map_op { > uint64_t bfn; > uint64_t mfn; > uint32_t flags; > int32_t status; > }; > > -------------------------------------------------------------------- > Field Purpose > ----- ----------------------------------------------------- > `bfn` [in] Bus address frame number to be unmapped I presume this is gathered from the 'map' call? > > `mfn` [in] This field is ignored for unmap subop > > `flags` [in] This field is ignored for unmap subop > > `status` [out] Mapping status of this unmap operation, 0 > indicates success > -------------------------------------------------------------------- > > Additional error codes specific to this hypercall: > > Error code Reason > ---------- ------------------------------------------------------------ > EPERM PV IOMMU mode not enabled or calling domain is not domain 0 EFAULT too > ------------------------------------------------------------------------ > > > Conditions for which PV IOMMU hypercalls succeed > ------------------------------------------------ > All the following conditions are required to be true for PV IOMMU hypercalls > to succeed: > > 1. IOMMU detected and supported by Xen > 2. The following Xen IOMMU options are NOT enabled: > dom0-passthrough, dom0-strict > 3. Domain 0 is making the hypercall > > > Security Implications of allowing Domain 0 IOMMU control > ======================================================== > > Xen currently allows IO devices attached to Domain 0 to have direct > access to > the all of the MFN address space (except Xen hypervisor memory regions), > provided the Xen IOMMU option dom0-strict is not enabled. > > The PV IOMMU feature provides the same level of access to MFN address space > and the feature is not enabled when the Xen IOMMU option dom0-strict is > enabled. Therefore security is not affected by the PV IOMMU feature. > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |