Xen project Mailing List

Re: [RFC PATCH 2/6] xen/public: arch-arm: reserve resources for virtio-pci

To: Oleksandr Tyshchenko <Oleksandr_Tyshchenko@xxxxxxxx>

From: Stewart Hildebrand <stewart.hildebrand@xxxxxxx>

Date: Tue, 21 Nov 2023 14:12:43 -0500

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=epam.com smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0)

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=L3WJgwiZ/bCWh4OmX+UYiRM2a7J7Qi4PhjkLtwxxxVQ=; b=jd13nVAO3CSqwCkFNJFYUK+q+ixCzr5cNKebilTKUE4yHWIInf9LewlnzLFnLLDQUppjDvyz65r3UXvWbyPHixksG8yhRkHVT1b1RS5MjAqnhO9oAklzM5uxpSzIE556OZ2EhYh0WTXGfrDEEcLcd+SO7fLnClzTO1BnJVUujCnZa9S6iBYkxzAn9jN+wuWCLPxisSBqKQfEXUOWemgy1RsJqrlao+Sr3IsuwEA1lZdt4sRSt62r/ER1aISA/MGeiN/Em32FJjLzYrSssnZDmxQnPXahJU9OP1AH7resLZv7Ahz0f2TpoRlK6coKxnCLhuFuJ+pp+/+i4CMpxw6bKQ==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=h7Dwh2NhJg8WSeqJH0+j91qera9iiII1uEQiaYoUN1Ns+BYhsWz3WQ3rRICIpYA/ia85SsrcCdAQU06mdjD0IAsEltELQa9EfgnWcAxUgrPEhpA+3G8OPqLMnQiEDvIL/iQLyYjYSxN8P0X++uqlogkSQVQmLt+4IIUn9XK1mHn4cJeaGMmRaz8N1UgEIziIenOLWLzZeHjXUccoTmjjTLYnuNatBm588vh0vyDob+X09aAmuvsxwe5KbqmPM9hzP0NhQW+6xpjjDiHEgcZIvzJjrL2LYlrZKaW0Zl3+fo4RtEfvuEM0CGHm69Z3RClsxQF9Gr4crlSi8aOM8HF2gg==

Cc: Julien Grall <julien@xxxxxxx>, Sergiy Kibrik <Sergiy_Kibrik@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "Bertrand Marquis" <bertrand.marquis@xxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, "vikram.garhwal@xxxxxxx" <vikram.garhwal@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>

Delivery-date: Tue, 21 Nov 2023 19:13:07 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 11/17/23 03:11, Oleksandr Tyshchenko wrote: > > > On 17.11.23 05:31, Stewart Hildebrand wrote: > > Hello Stewart > > [answering only for virtio-pci bits as for vPCI I am only familiar with > code responsible for trapping config space accesses] > > [snip] > >>> >>> >>> Let me start by saying that if we can get away with it, I think that a >>> single PCI Root Complex in Xen would be best because it requires less >>> complexity. Why emulate 2/3 PCI Root Complexes if we can emulate only >>> one? >>> >>> Stewart, you are deep into vPCI, what's your thinking? >> >> First allow me explain the moving pieces in a bit more detail (skip ahead to >> "Back to the question: " if you don't want to be bored with the details). I >> played around with this series, and I passed through a PCI device (with >> vPCI) and enabled virtio-pci: >> >> virtio = [ >> "type=virtio,device,transport=pci,bdf=0000:00:00.0,backend_type=qemu" ] >> device_model_args = [ "-device", "virtio-serial-pci" ] >> pci = [ "01:00.0" ] >> >> Indeed we get two root complexes (2 ECAM ranges, 2 sets of interrupts, etc.) >> from the domU point of view: >> >> pcie@10000000 { >> compatible = "pci-host-ecam-generic"; >> device_type = "pci"; >> reg = <0x00 0x10000000 0x00 0x10000000>; >> bus-range = <0x00 0xff>; >> #address-cells = <0x03>; >> #size-cells = <0x02>; >> status = "okay"; >> ranges = <0x2000000 0x00 0x23000000 0x00 0x23000000 0x00 0x10000000 >> 0x42000000 0x01 0x00 0x01 0x00 0x01 0x00>; >> #interrupt-cells = <0x01>; >> interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x74 0x04>; >> interrupt-map-mask = <0x00 0x00 0x00 0x07>; > > > I am wondering how you got interrupt-map here? AFAIR upstream toolstack > doesn't add that property for vpci dt node. You are correct. I'm airing my dirty laundry here: this is a hack to get a legacy interrupt passed through on a ZCU102 board (Zynq UltraScale+), which doesn't have GICv3/ITS. In a system with a GICv3/ITS, we would have an msi-map property (this is also not yet upstream, but is in a couple of downstreams). > >> }; >> >> pcie@33000000 { >> compatible = "pci-host-ecam-generic"; >> device_type = "pci"; >> reg = <0x00 0x33000000 0x00 0x200000>; >> bus-range = <0x00 0x01>; >> #address-cells = <0x03>; >> #size-cells = <0x02>; >> status = "okay"; >> ranges = <0x2000000 0x00 0x34000000 0x00 0x34000000 0x00 0x800000 >> 0x42000000 0x00 0x3a000000 0x00 0x3a000000 0x00 0x800000>; >> dma-coherent; >> #interrupt-cells = <0x01>; >> interrupt-map = <0x00 0x00 0x00 0x01 0xfde8 0x00 0x0c 0x04 0x00 >> 0x00 0x00 0x02 0xfde8 0x00 0x0d 0x04 0x00 0x00 0x00 0x03 0xfde8 0x00 0x0e >> 0x04 0x00 0x00 0x00 0x04 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x01 0xfde8 >> 0x00 0x0d 0x04 0x800 0x00 0x00 0x02 0xfde8 0x00 0x0e 0x04 0x800 0x00 0x00 >> 0x03 0xfde8 0x00 0x0f 0x04 0x800 0x00 0x00 0x04 0xfde8 0x00 0x0c 0x04 0x1000 >> 0x00 0x00 0x01 0xfde8 0x00 0x0e 0x04 0x1000 0x00 0x00 0x02 0xfde8 0x00 0x0f >> 0x04 0x1000 0x00 0x00 0x03 0xfde8 0x00 0x0c 0x04 0x1000 0x00 0x00 0x04 >> 0xfde8 0x00 0x0d 0x04 0x1800 0x00 0x00 0x01 0xfde8 0x00 0x0f 0x04 0x1800 >> 0x00 0x00 0x02 0xfde8 0x00 0x0c 0x04 0x1800 0x00 0x00 0x03 0xfde8 0x00 0x0d >> 0x04 0x1800 0x00 0x00 0x04 0xfde8 0x00 0x0e 0x04>; >> interrupt-map-mask = <0x1800 0x00 0x00 0x07>; > > > that is correct dump. > > BTW, if you added "grant_usage=1" (it is disabled by default for dom0) > to virtio configuration you would get iommu-map property here as well > [1]. This is another point to think about when considering combined > approach (single PCI Host bridge node -> single virtual root complex), I > guess usual PCI device doesn't want grant based DMA addresses, correct? > If so, it shouldn't be specified in the property. Right, a usual PCI device doesn't want/need grant based DMA addresses. The iommu-map property [2] is flexible enough to make it apply only to certain vBDFs or ranges of vBDFs. Actually, it looks like ("libxl/arm: Reuse generic PCI-IOMMU bindings for virtio-pci devices") already has the logic for doing exactly this. So it should be fine to have the iommu-map property in the single root complex and still do regular PCI passthrough. Presumably, we would also want msi-map [3] to apply to some vBDFs, not others. msi-map has the same flexible BDF matching as iommu-map (these two bindings actually share the same parsing function), so we should be able to use a similar strategy here if needed. We would also want interrupt-map [4] to apply to some vBDFs, not others. The BDF matching with interrupt-map behaves slightly differently, but I believe it is still flexible enough to achieve this. In this case, the function create_virtio_pci_irq_map(), added in ("libxl/arm: Add basic virtio-pci support"), would need some re-thinking. In summary, with a single virtual root complex, the toolstack needs to know which vBDFs are virtio-pci, and which vBDFs are passthrough, so it can create the [iommu,msi,interrupt]-map properties accordingly. [2] https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/pci-iommu.txt [3] https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/pci-msi.txt [4] https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/host-generic-pci.yaml > > >> }; >> >> Xen vPCI doesn't currently expose a host bridge (i.e. a device with base >> class 0x06). As an aside, we may eventually want to expose a >> virtual/emulated host bridge in vPCI, because Linux's x86 PCI probe expects >> one [0]. >> >> Qemu exposes an emulated host bridge, along with any requested emulated >> devices. >> >> Running lspci -v in the domU yields the following: >> >> 0000:00:00.0 Network controller: Ralink corp. RT2790 Wireless 802.11n 1T/2R >> PCIe >> Subsystem: ASUSTeK Computer Inc. RT2790 Wireless 802.11n 1T/2R PCIe >> Flags: bus master, fast devsel, latency 0, IRQ 13 >> Memory at 23000000 (32-bit, non-prefetchable) [size=64K] >> Capabilities: [50] MSI: Enable- Count=1/128 Maskable- 64bit+ >> Kernel driver in use: rt2800pci >> >> 0001:00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge >> Subsystem: Red Hat, Inc. QEMU PCIe Host bridge >> Flags: fast devsel >> >> 0001:00:01.0 Communication controller: Red Hat, Inc. Virtio console >> Subsystem: Red Hat, Inc. Virtio console >> Flags: bus master, fast devsel, latency 0, IRQ 14 >> Memory at 3a000000 (64-bit, prefetchable) [size=16K] >> Capabilities: [84] Vendor Specific Information: VirtIO: <unknown> >> Capabilities: [70] Vendor Specific Information: VirtIO: Notify >> Capabilities: [60] Vendor Specific Information: VirtIO: DeviceCfg >> Capabilities: [50] Vendor Specific Information: VirtIO: ISR >> Capabilities: [40] Vendor Specific Information: VirtIO: CommonCfg >> Kernel driver in use: virtio-pci >> >> 0000:00:00.0 is a real passed through device (corresponding to 0000:01:00.0 >> in dom0). >> 0001:00:00.0 is the qemu host bridge (base class 0x06). >> They are on different segments because they are associated with different >> root complexes. > > > Glad to hear this patch series doesn't seem to break PCI passthrough in > your environment. > > >> >> >> Back to the question: Sure, avoiding reserving more memory from the >> preciously small lowmem virtual memory layout is probably a good idea. With >> everything in a single virtual root complex (segment), it's probably >> possible to come up with some vBDF-picking algorithm (+ user ability to >> specify) that works for most use cases as discussed elsewhere. It will >> always be in a single fixed segment as far as I can tell. >> >> Some more observations assuming a single virtual root complex: >> >> We should probably hide the qemu host bridge(s) from the guest. In other >> words, hide all devices with base class 0x06, except eventually vPCI's own >> virtual host bridge. If we don't hide them, we would likely end up with >> multiple emulated host bridges on a single root complex (segment). That >> sounds messy and hard to manage. >> >> We have a need to control the vBDF exposed to the guest - can we force qemu >> to use particular BDFs for its emulated devices? > > > Yes, it is possible. Maybe there is a better way, but at > least *bus* and *addr* can be specified and Qemu indeed follows that. > > device_model_args=[ '-device', > 'virtio-blk-pci,scsi=off,disable-legacy=on,iommu_platform=on,bus=pcie.0,addr=2,drive=image', > > '-drive', 'if=none,id=image,format=raw,file=/dev/mmcblk1p3' ] > > virtio=[ "backend=Domain-0, type=virtio,device, transport=pci, > bdf=0000:00:02.0, grant_usage=1, backend_type=qemu" ] > > root@h3ulcb-domd:~# dmesg | grep virtio > [ 0.660789] virtio-pci 0000:00:02.0: enabling device (0000 -> 0002) > [ 0.715876] virtio_blk virtio0: [vda] 4096 512-byte logical blocks > (2.10 MB/2.00 MiB) > > root@h3ulcb-domd:~# lspci > 00:00.0 Host bridge: Red Hat, Inc. QEMU PCIe Host bridge > 00:02.0 SCSI storage controller: Red Hat, Inc. Virtio block device (rev 01) > > Also there is one moment for current series: bdf specified for > virtio-pci device only makes sense for iommu-map property. So > bdf=0000:00:02.0 in virtio property and bus=pcie.0,addr=2 in > device_model_args property should be in sync. > > [1] > https://patchwork.kernel.org/project/xen-devel/patch/20231115112611.3865905-5-Sergiy_Kibrik@xxxxxxxx/ > > > [snip]

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.