[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Using Restricted DMA for virtio-pci
On Sun, 2025-03-30 at 17:48 -0400, Michael S. Tsirkin wrote: > On Sun, Mar 30, 2025 at 10:27:58PM +0100, David Woodhouse wrote: > > On 30 March 2025 18:06:47 BST, "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote: > > > > It's basically just allowing us to expose through PCI, what I believe > > > > we can already do for virtio in DT. > > > > > > I am not saying I am against this extension. > > > The idea to restrict DMA has a lot of merit outside pkvm. > > > For example, with a physical devices, limiting its DMA > > > to a fixed range can be good for security at a cost of > > > an extra data copy. > > > > > > So I am not saying we have to block this specific hack. > > > > > > what worries me fundamentally is I am not sure it works well > > > e.g. for physical virtio cards. > > > > Not sure why it doesn't work for physical cards. They don't need to > > be bus-mastering; they just take data from a buffer in their own > > RAM. > > I mean, it kind of does, it is just that CPU pulling data over the PCI bus > stalls it so is very expensive. It is not by chance people switched to > DMA almost exclusively. Yes. For a physical implementation it would not be the most high- performance option... unless DMA is somehow blocked as it is in the pKVM+virt case. In the case of a virtual implementation, however, the performance is not an issue because it'll be backed by host memory anyway. (It's just that because it's presented to the guest and the trusted part of the hypervisor as PCI BAR space instead of main memory, it's a whole lot more practical to deal with the fact that it's *shared* with the VMM.) > > > Attempts to pass data between devices will now also require > > > extra data copies. > > > > Yes. I think that's acceptable, but if we really cared we could > > perhaps extend the capability to refer to a range inside a given > > BAR on a specific *device*? Or maybe just *function*, and allow > > sharing of SWIOTLB buffer within a multi-function device? > > Fundamentally, this is what dmabuf does. In software, yes. Extending it to hardware is a little harder. In principle, it might be quite nice to offer a single SWIOTLB buffer region (in a BAR of one device) and have multiple virtio devices share it. Not just because of passing data between devices, as you mentioned, but also because it'll be a more efficient use of memory than each device having its own buffer and allocation pool. So how would a device indicate that it can use a SWIOTLB buffer which is in a BAR of a *different* device? Not by physical address, because BARs get moved around. Not even by PCI bus/dev/fn/BAR# because *buses* get renumbered. You could limit it to sharing within one PCI "bus", and use just dev/fn/BAR#? Or even within one PCI device and just fn/BAR#? The latter could theoretically be usable by multi-function physical devices. The standard struct virtio_pci_cap (which I used for VIRTIO_PCI_CAP_SWIOTLB) just contains BAR and offset/length. We could extend it with device + function, using -1 for 'self', to allow for such sharing? Still not convinced it isn't overkill, but it's certainly easy enough to add on the *spec* side. I haven't yet looked at how that sharing would work in Linux on the guest side; thus far what I'm proposing is intended to be almost identical to the per-device thing that should already work with a `restricted-dma-pool' node in device-tree. > > I think it's overkill though. > > > > > Did you think about adding an swiotlb mode to virtio-iommu at all? > > > Much easier than parsing page tables. > > > > Often the guests which need this will have a real IOMMU for the true > > pass-through devices. > > Not sure I understand. You mean with things like stage 2 passthrough? Yes. AMD's latest IOMMU spec documents it, for example. Exposing a 'vIOMMU' to the guest which handles just stage 1 (IOVA→GPA) while the hypervisor controls the normal GPA→HPA translation in stage 2. Then the guest gets an accelerated path *directly* to the hardware for its IOTLB flushes... which means the hypervisor doesn't get to *see* those IOTLB flushes so it's a PITA to do device emulation as if it's covered by that same IOMMU. (Actually I haven't checked the AMD one in detail for that flaw; most *other* 2-stage IOMMUs I've seen do have it, and I *bet* AMD does too). > > Adding a virtio-iommu into the mix (or any other > > system-wide way of doing something different for certain devices) is > > problematic. > > OK... but the issue isn't specific to no DMA devices, is it? Hm? Allowing virtio devices to operate as "no-DMA devices" is a *workaround* for the issue. The issue is that the VMM may not have full access to the guest's memory for emulating devices. These days, virtio covers a large proportion of emulated devices. So I do think the issue is fairly specific to virtio devices, and suspect that's what you meant to type above? We pondered teaching the trusted part of the hypervisor (e.g. pKVM) to snoop on virtqueues enough to 'know' which memory the VMM was genuinely being *invited* to read/write... and we ran away screaming. (In order to have sufficient trust, you end up not just snooping but implementing quite a lot of the emulation on the trusted side. And then complex enlightenments in the VMM and the untrusted Linux/KVM which hosts it, to interact with that.) Then we realised that for existing DT guests it's trivial just to add the `restricted-dma-pool` node. And wanted to do the same for the guests who are afflicted with UEFI/ACPI too. So here we are, trying to add the same capability to virtio-pci. > > The on-device buffer keeps it nice and simple, > > I am not saying it is not. > It's just a little boutique. Fair. Although with the advent of confidential computing and restrictions on guest memory access, perhaps becoming less boutique over time? And it should also be fairly low-friction; it's a whole lot cleaner in the spec than the awful VIRTIO_F_ACCESS_PLATFORM legacy, and even in the Linux guest driver it should work fairly simply given the existing restricted-dma support (although of course that shouldn't entirely be our guiding motivation). > > and even allows us to > > do device support for operating systems like Windows where it's a lot > > harder to do anything generic in the core OS. > > Well we do need virtio iommu windows support sooner or later, anyway. Heh, good luck with that :) And actually, doesn't that only support *DMA* remapping? So you still wouldn't be able to boot a Windows guest with >255 vCPUs without some further enlightenment (like Windows guests finally supporting the 15- bit MSI extension that even Hyper-V supports on the host side...) Attachment:
smime.p7s
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |