[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Linux grant map/unmap improvement proposal (Draft B)



On Mon, 13 Oct 2014, David Vrabel wrote:
> Grant mapping in the Linux kernel has a number of problems:
> 
> * Grant mapping from userspace is broken for many real world use
>   cases.
> 
> * Netback does not handle sending packets to network storage provided
>   by a VM on the same host.
> 
> * Using blkback with network-based storage is unsafe.
> 
> * Performance is poor, particularly with userspace grants.
> 
> A PDF version of this document as available from:
> 
>   http://xenbits.xen.org/people/dvrabel/grant-improvements-B.pdf
> 
> 
> Userspace grant maps
> --------------------
> 
> Certain types of system calls using foreign mappings require
> translating the virtual address to a page using `get_user_pages()` or
> `get_user_pages_fast()`.  These system calls include direct I/O and
> asynchronous I/O (AIO).
> 
> In the native case this translation is done by walking the userspace
> page tables and looking up the PFN in the L1 entry.  PFN to page is
> then trivial.
> 
> For a PV guest this L1 entry contains an MFN and this first needs to
> be translated into a PFN.  For a normal frame this is a simple lookup
> in the M2P.  For foreign pages, the gntdev driver maintains an
> additional hash of foreign MFNs to local PFNs called the m2p_override.
> 
> The m2p_override table has a fundamental design flaw.
> 
> A domain may grant a frame multiple times, using a different grant
> reference each time.  The backend maps each grant reference to a
> separate page.  The 1-to-many MFN-to-page mapping cannot be
> represented in the 1-to-1 m2p_override table and I/O to or from these
> mappings cannot get the correct page.
> 
> Transmitting foreign pages to guests
> ------------------------------------
> 
> Netback when sending pages to the guest uses a grant copy operation to
> copy the data into the frames granted by the guest.  This grant copy
> requires either a local GFN _or_ a grant reference; it is not possible
> to grant copy to/from a foreign mapping.
> 
> In order to support VM to VM traffic, netback stores the grant
> reference for the sender VM in the socket buffer structure which may
> then be used by the receiving netback for the grant copy.
> 
> Packets with foreign pages from other sources cannot be successfully
> copied, since netback does not know the grant reference.  Once such
> configuration is a VM providing an iSCSI or other network-based
> storage that presents a block device in the backend that is then used
> by another VM on the same host.
> 
> Blkback and network storage
> ---------------------------
> 
> Blkback unmaps the foreign pages in a I/O request when the request is
> completed.  If networked storage is used it is possible for requests
> to be completed while the skbs referring to those pages are still
> queued for transmit (e.g., because a retransmission was queued while
> the responds to the original packet was in flight).
> 
> When the network driver attempts to send the packet with the unmapped
> page it may:
> 
> - Fault while trying to access the unmapped page.
> 
> - Transmit from a frame that is no longer granted (potentially
>   transmitting sensitive guest or Xen data).
> 
> The fault does not occur with userspace storage backends since gntdev
> replaces the foreign mapping with one to a local scratch page.  It
> uses GNTOP_unmap_and_replace which atomically replaces the foreign
> mapping with another (source) mapping.  However, this cannot be used
> with batched operations since it clears the source mapping and it does
> not prevent against transmitting from a non-granted frame.
 
This is a very good summary of the issues we are currently having with
Xen support in Linux. As such, I think I should add one that is missing
from the list, but good to keep in mind. I should point out that I am
not asking you to do anything about it at the moment.


dma_ops.unmap_page and dma_ops.unmap_sg only pass dma addresses as arguments
----------------------------------------------------------------------------

The Linux dma_map_ops API consists of a number of functions that only
provide the dma address of the dma request as argument, not the struct
page or the physical address. For example unmap_page and unmap_sg.

For Xen PV guests the dma address is a machine address. If the machine
address corresponds to a foreign page (granted to the current domain),
there is no easy way for us to retrieve the corresponding struct page or
guest physical address (other than the m2p_override with all its
problems).

This is a serious limitation, in particular if we need to do any
operations on the memory region at the time one of these functions are
called:
- on x86 fortunately we don't need to do anything;
- on ARM, if the device is not dma coherent, we might have to issue cache
maintenance operations.

 
> Design
> ======
> 
> Map onto ballooned pages only
> -----------------------------
> 
> Grant maps will only be permitted with ballooned pages.
> 
> The original p2m entry for these pages will always be INVALID_MFN and
> thus the original MFN does not need to saved on map and restored on
> unmap.
> 
> Grant map/unmap will no longer need to use or clobber `page->index`.
> This allows a workaround in netback to clear `page->pfmemalloc` to be
> removed (`index` and `pfmemalloc` are part of the same union).
> 
> 
> Safe grant unmap
> ----------------
> 
> Grant references will only be unmapped when they are no longer in use.
> i.e., the page reference count is one.
> 
>     int gnttab_unmap_refs_async(struct gnttab_unmap_grant_ref *unmap_ops,
>         struct gnttab_unmap_grant_ref *kunmap_ops,
>         struct page **pages, unsigned int count,
>         void (*done)(void *data), void *data);
> 
> The `gnttab_unmap_refs_async()` function will unmap the grant
> references using the supplied unmap operations and call `done(data)`.
> The grant unmap will only be done once all pages are no longer in use.
> 
> It shall run synchronously on the first attempt (this is expected to
> be the most common case).  If any page is in use, it shall queue the
> unmap request to be tried at a later time.
> 
> Only the blkback and gntdev devices need to use asynchronouse unmaps.
> 
> 
> Userspace address to page translation
> -------------------------------------
> 
> The m2p_override table shall be removed.
> 
> Each VMA (struct vm_struct) shall contain an additional pointer to an
> optional array of pages.  This array shall be sized to cover the full
> extent of the VMA.
> 
> The gntdev driver populates this array with the relevant pages for the
> foreign mappings as they are mapped.  It shall also clear them when
> unmapping.  The gntdev driver must ensure it properly splits the page
> array when the VMA itself is split.
> 
> Since the m2p lookup will not return a local PFN, the native
> get_user_pages_fast() call will fail.  Prior to attempting to fault in
> the pages, get_user_pages() can simply look up the pages in the VMA's
> page array.
> 
> `page->private` will no longer need to be set to the MFN.
> 
> This is similar to the approach used in the classic kernel.

It is worth pointing out that if/when non dma coherent devices are going
to start appearing in x86-land, this solution won't suffice.


> Identifying foreign pages
> -------------------------
> 
> A new page flag is introduced: PG_foreign.  This will alias PG_pinned
> so it does not require an additional bit.
> 
> If PG_foreign is set then `page->private` contains the grant reference
> and domid for this foreign page.  This information can only be packed
> into an unsigned long on 64-bit platforms.  32-bit platforms will have
> to allocate an additional structure to store the domid and gref.
> 
> The aliasing of PG_foreign and PG_pinned is safe because:
> 
> - Page table pages will never be foreign.
> - Foreign pages shall have `p2m[P] & FOREIGN_FRAME_BIT`.
> 
> The use of the private field is safe because:
> 
> - The page is allocated by the balloon driver and thus it owns the
>   private field.
> 
> - The other fields in the union (ptl, slab_cache, and first_page) will
>   not be used because the page is not used in a page table, slab or
>   compound page.
> 
> Netback can thus:
> 
> 1. Test PG_foreign.
> 2. Verify that the page is foreign via the p2m.
> 3. Extract the domid and gref from page->private.
> 
> The PG_foreign test is not strictly necessary as the p2m lookup is
> sufficient, but it should be quicker for non-foreign pages.
> 
> 
> Userspace grant performance
> ---------------------------
> 
> Since the m2p_override table will be removed, the gntdev device may
> easy batch the grant map and unmap hypercalls that update the kernel
> mappings.
> 
> The use of the scratch pages on unmap will be unnecessary and can be
> removed.
> 
> Other improvements that may be considered are:
> 
> - Batch the userspace and kernel map and unmap.
> 
> - Lazily map grants into userspace on faults.  For applications that
>   do not access the foreign frames by the userspace mappings (such as
>   block backends using direct I/O) this would avoid a set of maps and
>   unmaps. This lazy mode would have to be requested by the userspace
>   program (since faulting many pages would be much more expensive than
>   a single batched map).
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.