Xen project Mailing List

Re: [Xen-devel] One question about the hypercall to translate gfn to mfn.

To: Tim Deegan <tim@xxxxxxx>

From: "Tian, Kevin" <kevin.tian@xxxxxxxxx>

Date: Thu, 11 Dec 2014 01:41:44 +0000

Accept-language: en-US

Cc: "keir@xxxxxxx" <keir@xxxxxxx>, "Paul.Durrant@xxxxxxxxxx" <Paul.Durrant@xxxxxxxxxx>, "Yu, Zhang" <yu.c.zhang@xxxxxxxxxxxxxxx>, "JBeulich@xxxxxxxx" <JBeulich@xxxxxxxx>, "Xen-devel@xxxxxxxxxxxxx" <Xen-devel@xxxxxxxxxxxxx>

Delivery-date: Thu, 11 Dec 2014 01:42:59 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thread-index: AQHQE5ijx8eE5zCfX0WmGWg5Lg8fgJyGjX2AgAF26eCAAB3OgIABbe5Q

Thread-topic: One question about the hypercall to translate gfn to mfn.

> From: Tim Deegan [mailto:tim@xxxxxxx] > Sent: Wednesday, December 10, 2014 6:55 PM > > At 01:14 +0000 on 10 Dec (1418170461), Tian, Kevin wrote: > > > From: Tim Deegan [mailto:tim@xxxxxxx] > > > Sent: Tuesday, December 09, 2014 6:47 PM > > > > > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote: > > > > Hi all, > > > > > > > > As you can see, we are pushing our XenGT patches to the upstream. > One > > > > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 > > > > device model. > > > > > > > > Here we may have 2 similar solutions: > > > > 1> Paul told me(and thank you, Paul :)) that there used to be a > > > > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in > > > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there > was > > > no > > > > usage at that time. > > > > > > It's been suggested before that we should revive this hypercall, and I > > > don't think it's a good idea. Whenever a domain needs to know the > > > actual MFN of another domain's memory it's usually because the > > > security model is problematic. In particular, finding the MFN is > > > usually followed by a brute-force mapping from a dom0 process, or by > > > passing the MFN to a device for unprotected DMA. > > > > In our case it's not because the security model is problematic. It's > > because GPU virtualization is done in Dom0 while the memory virtualization > > is done in hypervisor. We need a means to query GPFN->MFN so we can > > setup shadow GPU page table in Dom0 correctly, for a VM. > > I don't think we understand each other. Let me try to explain what I > mean. My apologies if this sounds patronising; I'm just trying to be > as clear as I can. Thanks for your explanation. This is a very helpful discussion. :-) > > It is Xen's job to isolate VMs from each other. As part of that, Xen > uses the MMU, nested paging, and IOMMUs to control access to RAM. Any > software component that can pass a raw MFN to hardware breaks that > isolation, because Xen has no way of controlling what that component > can do (including taking over the hypervisor). This is why I am > afraid when developers ask for GFN->MFN translation functions. When I agree Xen's job absolutely, the isolation is also required in different layers, regarding to who controls the resource and where the virtualization happens. For example talking about I/O virtualization, Dom0 or driver domain needs to isolate among backend drivers to avoid one backend interfering with another. Xen doesn't know such violation, since it only knows it's Dom0 wants to access a VM's page. btw curious of how worse exposing GFN->MFN translation compared to allowing mapping other VM's GFN? If exposing GFN->MFN is under the same permission control as mapping, would it avoid your worry here? > > So if the XenGT model allowed the backend component to (cause the GPU > to) perform arbitrary DMA without IOMMU checks, then that component > would have complete access to the system and (from a security pov) > might as well be running in the hypervisor. That would be very > problematic, but AFAICT that's not what's going on. From your reply > on the other thread it seems like the GPU is behind the IOMMU, so > that's OK. :) > > When the backend component gets a GFN from the guest, it wants an > address that it can give to the GPU for DMA that will map the right > memory. That address must be mapped in the IOMMU tables that the GPU > will be using, which means the IOMMU tables of the backend domain, > IIUC[1]. So the hypercall it needs is not "give me the MFN that matches > this GFN" but "please map this GFN into my IOMMU tables". Here "please map this GFN into my IOMMU tables" actually breaks the IOMMU isolation. IOMMU is designed for serving DMA requests issued by an exclusive VM, so IOMMU page table can restrict that VM's attempts strictly. To map multiple VM's GFNs into one IOMMU table, the 1st thing is to avoid GFN conflictions to make it functional. We thought about this approach previously, e.g. by reserving highest 3 bits of GFN as VMID, so one IOMMU page table can be used to combine multi-VM's page table together. However doing so have two limitations: a) it still requires write-protect guest GPU page table, and maintain a shadow GPU page table by translate from real GFN to pseudo GFN (plus VMID), which doesn't save any engineering effort in the device model part b) it breaks the designed isolation intrinsic of IOMMU. In such case, IOMMU can't isolate multiple VMs by itself, since a DMA request can target any pseudo GFN if valid in the page table. We have to rely on the audit in the backend component in Dom0 to ensure the isolation. So even by using IOMMU, it loses the isolation intention as you described earlier. c) this introduces tricky logic in IOMMU driver to handle such non-standard multiplexed page table style. w/o a SR-IOV implementation (so each VF has its own IOMMU page table), I don't see using IOMMU can help isolation here. > > Asking for the MFN will only work if the backend domain's IOMMU > tables have an existing 1:1 r/w mapping of all guest RAM, which > happens to be the case if the backend component is in dom0 _and_ dom0 > is PV _and_ we're not using strict IOMMU tables. Restricting XenGT to > work in only those circumstances would be short-sighted, not only > because it would mean XenGT could never work as a driver domain, but > also because it seems like PVH dom0 is going to be the default at some > point. yes, this is a good feedback we didn't think about before. So far the reason why XenGT can work is because we use default IOMMU setting which set up a 1:1 r/w mapping for all possible RAM, so when GPU hits a MFN thru shadow GPU page table, IOMMU is essentially bypassed. However like you said, if IOMMU page table is restricted to dom0's memory, or is not 1:1 identity mapping, XenGT will be broken. However I don't see a good solution for this, except using multiplexed IOMMU page table aforementioned, which however doesn't look like a sane design to me. > > If the existing hypercalls that make IOMMU mappings are not right for > XenGT then we can absolutely consider adding some more. But we need > to talk about what policy Xen will enforce on the mapping requests. > If the shared backend is allowed to map any page of any VM, then it > can easily take control of any VM on the host (even though the IOMMU > will prevent it from taking over the hypervisor itself). The > absolute minumum we should allow here is some toolstack-controlled > list of which VMs the XenGT backend is serving, so that it can refuse > to map other VMs' memory (like an extension of IS_PRIV_FOR, which does > this job for Qemu). for mapping and accessing other guest's memory, I don't think we need any new interface atop existing ones. Just similar to other backend drivers, we can leverage the same permission control. please note here the requirement of exposing p2m here, is really to setup GPU page table so a guest GPU workload can be directly executed by the GPU. > > I would also strongly advise using privilege separation in the backend > between the GPUPT shadow code (which needs mapping rights and is > trusted to maintain isolation between the VMs that are sharing the > GPU) and the rest of the XenGT backend (which doesn't/isn't). But > that's outside my remit as a hypervisor maintainer so it goes no > further than an "I told you so". :) We're open to suggestions making our code better, but could you elaborate a bit what exactly privilege separation you meant here? :-) > > Cheers, > > Tim. > > [1] That is, AIUI this GPU doesn't context-switch which set of IOMMU > tables it's using for DMA, SR-IOV-style, and that's why you need a > software component in the first place. yes, there's only one IOMMU dedicated for GPU, and it's impractical to switch the IOMMU page table given concurrent access to graphics memory from different VCPUs and different render engines within GPU. Thanks Kevin _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.