[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] One question about the hypercall to translate gfn to mfn.

> From: Tim Deegan [mailto:tim@xxxxxxx]
> Sent: Wednesday, December 10, 2014 6:55 PM
> At 01:14 +0000 on 10 Dec (1418170461), Tian, Kevin wrote:
> > > From: Tim Deegan [mailto:tim@xxxxxxx]
> > > Sent: Tuesday, December 09, 2014 6:47 PM
> > >
> > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote:
> > > > Hi all,
> > > >
> > > >    As you can see, we are pushing our XenGT patches to the upstream.
> One
> > > > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0
> > > > device model.
> > > >
> > > >    Here we may have 2 similar solutions:
> > > >    1> Paul told me(and thank you, Paul :)) that there used to be a
> > > > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in
> > > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there
> was
> > > no
> > > > usage at that time.
> > >
> > > It's been suggested before that we should revive this hypercall, and I
> > > don't think it's a good idea.  Whenever a domain needs to know the
> > > actual MFN of another domain's memory it's usually because the
> > > security model is problematic.  In particular, finding the MFN is
> > > usually followed by a brute-force mapping from a dom0 process, or by
> > > passing the MFN to a device for unprotected DMA.
> >
> > In our case it's not because the security model is problematic. It's
> > because GPU virtualization is done in Dom0 while the memory virtualization
> > is done in hypervisor. We need a means to query GPFN->MFN so we can
> > setup shadow GPU page table in Dom0 correctly, for a VM.
> I don't think we understand each other.  Let me try to explain what I
> mean.  My apologies if this sounds patronising; I'm just trying to be
> as clear as I can.

Thanks for your explanation. This is a very helpful discussion. :-)

> It is Xen's job to isolate VMs from each other.  As part of that, Xen
> uses the MMU, nested paging, and IOMMUs to control access to RAM.  Any
> software component that can pass a raw MFN to hardware breaks that
> isolation, because Xen has no way of controlling what that component
> can do (including taking over the hypervisor).  This is why I am
> afraid when developers ask for GFN->MFN translation functions.

When I agree Xen's job absolutely, the isolation is also required in different
layers, regarding to who controls the resource and where the virtualization 
happens. For example talking about I/O virtualization, Dom0 or driver domain 
needs to isolate among backend drivers to avoid one backend interfering 
with another. Xen doesn't know such violation, since it only knows it's Dom0
wants to access a VM's page.

btw curious of how worse exposing GFN->MFN translation compared to
allowing mapping other VM's GFN? If exposing GFN->MFN is under the
same permission control as mapping, would it avoid your worry here?

> So if the XenGT model allowed the backend component to (cause the GPU
> to) perform arbitrary DMA without IOMMU checks, then that component
> would have complete access to the system and (from a security pov)
> might as well be running in the hypervisor.  That would be very
> problematic, but AFAICT that's not what's going on.  From your reply
> on the other thread it seems like the GPU is behind the IOMMU, so
> that's OK. :)
> When the backend component gets a GFN from the guest, it wants an
> address that it can give to the GPU for DMA that will map the right
> memory.  That address must be mapped in the IOMMU tables that the GPU
> will be using, which means the IOMMU tables of the backend domain,
> IIUC[1].  So the hypercall it needs is not "give me the MFN that matches
> this GFN" but "please map this GFN into my IOMMU tables".

Here "please map this GFN into my IOMMU tables" actually breaks the
IOMMU isolation. IOMMU is designed for serving DMA requests issued
by an exclusive VM, so IOMMU page table can restrict that VM's attempts

To map multiple VM's GFNs into one IOMMU table, the 1st thing is to
avoid GFN conflictions to make it functional. We thought about this approach
previously, e.g. by reserving highest 3 bits of GFN as VMID, so one IOMMU
page table can be used to combine multi-VM's page table together. However
doing so have two limitations:

a) it still requires write-protect guest GPU page table, and maintain a shadow
GPU page table by translate from real GFN to pseudo GFN (plus VMID), which
doesn't save any engineering effort in the device model part

b) it breaks the designed isolation intrinsic of IOMMU. In such case, IOMMU
can't isolate multiple VMs by itself, since a DMA request can target any 
pseudo GFN if valid in the page table. We have to rely on the audit in the 
backend component in Dom0 to ensure the isolation. So even by using IOMMU,
it loses the isolation intention as you described earlier.

c) this introduces tricky logic in IOMMU driver to handle such non-standard
multiplexed page table style. 

w/o a SR-IOV implementation (so each VF has its own IOMMU page table),
I don't see using IOMMU can help isolation here.

> Asking for the MFN will only work if the backend domain's IOMMU
> tables have an existing 1:1 r/w mapping of all guest RAM, which
> happens to be the case if the backend component is in dom0 _and_ dom0
> is PV _and_ we're not using strict IOMMU tables.  Restricting XenGT to
> work in only those circumstances would be short-sighted, not only
> because it would mean XenGT could never work as a driver domain, but
> also because it seems like PVH dom0 is going to be the default at some
> point.

yes, this is a good feedback we didn't think about before. So far the reason
why XenGT can work is because we use default IOMMU setting which set
up a 1:1 r/w mapping for all possible RAM, so when GPU hits a MFN thru
shadow GPU page table, IOMMU is essentially bypassed. However like
you said, if IOMMU page table is restricted to dom0's memory, or is not
1:1 identity mapping, XenGT will be broken.

However I don't see a good solution for this, except using multiplexed
IOMMU page table aforementioned, which however doesn't look like
a sane design to me.

> If the existing hypercalls that make IOMMU mappings are not right for
> XenGT then we can absolutely consider adding some more.  But we need
> to talk about what policy Xen will enforce on the mapping requests.
> If the shared backend is allowed to map any page of any VM, then it
> can easily take control of any VM on the host (even though the IOMMU
> will prevent it from taking over the hypervisor itself).  The
> absolute minumum we should allow here is some toolstack-controlled
> list of which VMs the XenGT backend is serving, so that it can refuse
> to map other VMs' memory (like an extension of IS_PRIV_FOR, which does
> this job for Qemu).

for mapping and accessing other guest's memory, I don't think we 
need any new interface atop existing ones. Just similar to other backend
drivers, we can leverage the same permission control.

please note here the requirement of exposing p2m here, is really to
setup GPU page table so a guest GPU workload can be directly executed
by the GPU.

> I would also strongly advise using privilege separation in the backend
> between the GPUPT shadow code (which needs mapping rights and is
> trusted to maintain isolation between the VMs that are sharing the
> GPU) and the rest of the XenGT backend (which doesn't/isn't).  But
> that's outside my remit as a hypervisor maintainer so it goes no
> further than an "I told you so". :)

We're open to suggestions making our code better, but could you 
elaborate a bit what exactly privilege separation you meant here? :-)

> Cheers,
> Tim.
> [1] That is, AIUI this GPU doesn't context-switch which set of IOMMU
>     tables it's using for DMA, SR-IOV-style, and that's why you need a
>     software component in the first place.

yes, there's only one IOMMU dedicated for GPU, and it's impractical to
switch the IOMMU page table given concurrent access to graphics
memory from different VCPUs and different render engines within GPU.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.