[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] One question about the hypercall to translate gfn to mfn.


At 01:41 +0000 on 11 Dec (1418258504), Tian, Kevin wrote:
> > From: Tim Deegan [mailto:tim@xxxxxxx]
> > It is Xen's job to isolate VMs from each other.  As part of that, Xen
> > uses the MMU, nested paging, and IOMMUs to control access to RAM.  Any
> > software component that can pass a raw MFN to hardware breaks that
> > isolation, because Xen has no way of controlling what that component
> > can do (including taking over the hypervisor).  This is why I am
> > afraid when developers ask for GFN->MFN translation functions.
> When I agree Xen's job absolutely, the isolation is also required in different
> layers, regarding to who controls the resource and where the virtualization 
> happens. For example talking about I/O virtualization, Dom0 or driver domain 
> needs to isolate among backend drivers to avoid one backend interfering 
> with another. Xen doesn't know such violation, since it only knows it's Dom0
> wants to access a VM's page.

I'm going to write second reply to this mail in a bit, to talk about
this kind of system-level design.  In this email I'll just talk about
the practical aspects of interfaces and address spaces and IOMMUs.

> btw curious of how worse exposing GFN->MFN translation compared to
> allowing mapping other VM's GFN? If exposing GFN->MFN is under the
> same permission control as mapping, would it avoid your worry here?

I'm afraid not.  There's nothing worrying per se in a backend knowing
the MFNs of the pages -- the worry is that the backend can pass the
MFNs to hardware.  If the check happens only at lookup time, then XenGT
can (either through a bug or a security breach) just pass _any_ MFN to
the GPU for DMA.

But even without considering the security aspects, this model has bugs
that may be impossible for XenGT itself to even detect.  E.g.:
 1. Guest asks its virtual GPU to DMA to a frame of memory;
 2. XenGT looks up the GFN->MFN mapping;
 3. Guest balloons out the page;
 4. Xen allocates the page to a different guest;
 5. XenGT passes the MFN to the GPU, which DMAs to it.

Whereas if stage 2 is a _mapping_ operation, Xen can refcount the
underlying memory and make sure it doesn't get reallocated until XenGT
is finished with it.

> > When the backend component gets a GFN from the guest, it wants an
> > address that it can give to the GPU for DMA that will map the right
> > memory.  That address must be mapped in the IOMMU tables that the GPU
> > will be using, which means the IOMMU tables of the backend domain,
> > IIUC[1].  So the hypercall it needs is not "give me the MFN that matches
> > this GFN" but "please map this GFN into my IOMMU tables".
> Here "please map this GFN into my IOMMU tables" actually breaks the
> IOMMU isolation. IOMMU is designed for serving DMA requests issued
> by an exclusive VM, so IOMMU page table can restrict that VM's attempts
> strictly.
> To map multiple VM's GFNs into one IOMMU table, the 1st thing is to
> avoid GFN conflictions to make it functional. We thought about this approach
> previously, e.g. by reserving highest 3 bits of GFN as VMID, so one IOMMU
> page table can be used to combine multi-VM's page table together. However
> doing so have two limitations:
> a) it still requires write-protect guest GPU page table, and maintain a shadow
> GPU page table by translate from real GFN to pseudo GFN (plus VMID), which
> doesn't save any engineering effort in the device model part

Yes -- since there's only one IOMMU context for the whole GPU, the
XenGT backend still has to audit all GPU commands to maintain
isolation between clients.

> b) it breaks the designed isolation intrinsic of IOMMU. In such case, IOMMU
> can't isolate multiple VMs by itself, since a DMA request can target any 
> pseudo GFN if valid in the page table. We have to rely on the audit in the 
> backend component in Dom0 to ensure the isolation.


> c) this introduces tricky logic in IOMMU driver to handle such non-standard
> multiplexed page table style. 
> w/o a SR-IOV implementation (so each VF has its own IOMMU page table),
> I don't see using IOMMU can help isolation here.

If I've understood your argument correctly, it basically comes down
to "It would be extra work for no benefit, because XenGT still has to
do all the work of isolating GPU clients from each other".  It's true
that XenGT still has to isolate its clients, but there are other

The main one, from my point of view as a Xen maintainer, is that it
allows Xen to constrain XenGT itself, in the case where bugs or
security breaches mean that XenGT tries to access memory it shouldn't.
More about that in my other reply.  I'll talk about the rest below.

> yes, this is a good feedback we didn't think about before. So far the reason
> why XenGT can work is because we use default IOMMU setting which set
> up a 1:1 r/w mapping for all possible RAM, so when GPU hits a MFN thru
> shadow GPU page table, IOMMU is essentially bypassed. However like
> you said, if IOMMU page table is restricted to dom0's memory, or is not
> 1:1 identity mapping, XenGT will be broken.
> However I don't see a good solution for this, except using multiplexed
> IOMMU page table aforementioned, which however doesn't look like
> a sane design to me.

Right.  AIUI you're talking about having a component, maybe in Xen,
that automatically makes a merged IOMMU table that contains multiple
VMs' p2m tables all at once.  I think that we can do something simpler
than that which will have the same effect and also avoid race
conditions like the one I mentioned at the top of the email.

[First some hopefully-helpful diagrams to explain my thinking.  I'll
 borrow 'BFN' from Malcolm's discussion of IOMMUs to describe the
 addresses that devices issue their DMAs in:

 Here's how the translations work for a HVM guest using HAP:

   CPU    <- Code supplied by the guest
   MMU    <- Pagetables supplied by the guest
   HAP    <- Guest's P2M, supplied by Xen

 Here's how it looks for a GPU operation using XenGT:

   GPU       <- Code supplied by Guest, audited by XenGT
  (GPU VA)
  GPU-MMU    <- GTTs supplied by XenGT (by shadowing guest ones)
  IOMMU      <- XenGT backend dom's P2M (for PVH/HVM) or IOMMU tables (for PV)

 OK, on we go...]

Somewhere in the existing XenGT code, XenGT has a guest GFN in its
hand and makes a lookup hypercall to find the MFN.  It puts that MFN
into the GTTs that it passes to the GPU.  But an MFN is not actually
what it needs here -- it needs a GPU BFN, which the IOMMU will then
turn into an MFN for it.

If we replace that lookup with a _map_ hypercall, either with Xen
choosing the BFN (as happens in the PV grant map operation) or with
the guest choosing an unused address (as happens in the HVM/PVH
grant map operation), then:
 - the only extra code in XenGT itself is that you need to unmap
   when you change the GTT;
 - Xen can track and control exactly which MFNs XenGT/the GPU can access;
 - running XenGT in a driver domain or PVH dom0 ought to work; and
 - we fix the race condition I described above.

The default policy I'm suggesting is that the XenGT backend domain
should be marked IS_PRIV_FOR (or similar) over the XenGT client VMs,
which will need a small extension in Xen since at the moment struct
domain has only one "target" field.

BTW, this is the exact analogue of how all other backend and toolstack
operations work -- they request access from Xen to specific pages and
they relinquish it when they are done.  In particular:

> for mapping and accessing other guest's memory, I don't think we 
> need any new interface atop existing ones. Just similar to other backend
> drivers, we can leverage the same permission control.

I don't think that's right -- other backend drivers use the grant
table mechanism, wher the guest explicitly grants access to only the
memory it needs.  AIUI you're not suggesting that you'll use that for
XenGT! :)

Right - I hope that made some sense.  I'll go get another cup of
coffee and start on that other reply...



Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.