Xen project Mailing List

Re: [PATCH] x86/ept: limit calls to memory_type_changed()

To: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Tue, 27 Sep 2022 11:33:40 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=p4w2sn6T2auAzZEl+2AzSpCcJnJlNsEBPwgwER5SX5k=; b=SFDfFRjRcU2B6EPLjnjoVYty8+RPFvL9HXTReUj3skSVSHtfTOsfxOsauVKV3y+J6xZ5LKJzXj77w9g0aRC6v/UcOGsUwN8f4nM4ej82ThKJrUcUW6wD3bbehXFvKQ8rfv8MFVgwXtKxxqY8hczmuilf+0WmMVDhUpTEjMnXvEOCUMFXuNf7pBeMw1TzoASo0E4o7yDDNrpVjxG7JJuQ2WIPtpgaAfLA/yKFlJMd6gjc3zBO4Mlj/0yFt2zj7/B+uw56J3YZQWiQbNo61GvnPINZBYZHkYf++cs7lytjOmPmIK5VNmO6kyx1b0mqm7bI++ku00ZkofmNE9ZJBBBhdQ==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=a90wsQFSMtoIcjdH4LA8NwcmmtVrQPBbBzxL6ucDC+YyA/N1sgr1OwEBbRaqFAlTKZdng+K6quEZ3yanmXcUBCcN7jF4nE0FBjSLtNhHybOsXLsKPZ3rRVbAUg0ozxhMqTEzKH2+m1ACRX09KsSUXsu4QAB+tlS72Wq/StWm5QNZEqVEtf2auBSZvPyscuLHJM49DF+J2tTZg5+kZgnBzsnACux+szRu/LMaZX0c58pwaYIxH52mE/Eh8NRSXos+203UadUVHysuu+p1XYqLBQ9SAQu7G2QgAG0SXljlWd81PL8KAgHy7yHUVnlEx688eq0uGoqSI+3tGwvzQzgqsw==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;

Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>

Delivery-date: Tue, 27 Sep 2022 09:34:16 +0000

Ironport-data: A9a23:sgWQyqCAdmcShRVW/+riw5YqxClBgxIJ4kV8jS/XYbTApGl31DFRz 2NJWWiDMv6PYzOmed9xPYvl9EgDv5DUm9JlQQY4rX1jcSlH+JHPbTi7wuUcHAvJd5GeExg3h yk6QoOdRCzhZiaE/n9BCpC48T8mk/ngqoPUUIbsIjp2SRJvVBAvgBdin/9RqoNziJ2yDhjlV ena+qUzA3f4nW8vWo4ow/jb8kk37a6t4GlwUmEWPpingnePzxH5M7pHTU2BByOQapVZGOe8W 9HCwNmRlo8O105wYj8Nuu+TnnwiGtY+DyDX4pZlc/HKbix5jj4zys4G2M80Mi+7vdkrc+dZk 72hvbToIesg0zaldO41C3G0GAkmVUFKFSOuzdFSfqV/wmWfG0YAzcmCA2kYY5QU9bdbAFh8r 8cKJA0NSU+SieOPlefTpulE3qzPLeHNFaZG4DRM6G+cCvwrB5feX6/N+NlUmi8qgdxDFurfY MxfbidzaBPHYFtEPVJ/5JAWxb/0wCWgNWAJ7gvL9MLb4ECKpOB1+KLqP9fPPMSDWO1en1qCp 3KA9GP8av0fHIzEmGHYoiPz7gPJtT/HY9wpT7+fz8NnhnyYxXRUIjAJT2Lu9JFVjWb7AbqzM Xc86ico6KQ/6kGvZt38RAGj5m6JuAYGXNhdGPF87xuCooLW6QuEAmkPThZadccr8sQxQFQCy Vuhj97vQzt1v9W9S2+Z97qShSO/P24SN2BqTTQfUQIP7t3noYcyphHCVNBuFOiylNKdJN3r6 zWDrSx7i7BNi8cOjv+/5Qqe3WPqoYXVRAko4AmRRnii8g5yeI+iYcqv9ETf6vFDao2eSzFto UQ5piRX18hWZbnlqcBHaLxl8G2BjxpdDADhvA==

Ironport-hdrordr: A9a23:D2mirK6Yr8MmKQ5VRgPXwVOBI+orL9Y04lQ7vn2ZFiY5TiXIra qTdaogviMc6Ax/ZJjvo6HkBEClewKlyXcT2/hrAV7CZniehILMFu1fBOTZowEIdxeOldK1kJ 0QCZSWa+eAcmSS7/yKhzVQeuxIqLfnzEnrv5a5854Ed3AXV0gK1XYcNu/0KDwVeOEQbqBJaa Z0q/A30QaISDAyVICWF3MFV+/Mq5nik4/nWwcPA1oC5BOVhT2lxbbmG1zAty1uGw9n8PMHyy zoggb57qKsv7WSzQLd7Xba69BzlMH6wtVOKcSQgow+KynqiCyveIN9Mofy9QwdkaWK0hIHgd PMqxAvM4Ba7G7QRHi8pV/X1wzpwF8Vmgrf4G7dpUGmjd3yRTo8BcYEr5leaAHl500pu8w5+L 5X3kqC3qAnQS/orWDY3ZzlRhtqnk27rT4JiugIlUFSVoMYdft4sZEfxkVIC50NdRiKpbzPKN MeQv002cwmMG9zNxvizylSKZ2XLz4O9y69Mwc/Upf/6UkUoJh7p3FotvD30E1wtq7VcKM0l9 gsAp4Y6o2mcfVmHJ6VJN1xNfdfWVa9Ni7kASa1HWnNMp0hFjbkl6PXiY9Fl91CPqZ4h6cPpA ==

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, Sep 26, 2022 at 06:03:24PM +0000, Andrew Cooper wrote: > On 22/09/2022 17:05, Roger Pau Monne wrote: > > memory_type_changed() is currently only implemented for Intel EPT, and > > results in the invalidation of EMT attributes on all the entries in > > the EPT page tables. Such invalidation causes EPT_MISCONFIG vmexits > > when the guest tries to access any gfns for the first time, which > > results in the recalculation of the EMT for the accessed page. The > > vmexit and the recalculations are expensive, and as such should be > > avoided when possible. > > > > Remove the call to memory_type_changed() from > > XEN_DOMCTL_memory_mapping: there are no modifications of the > > iomem_caps ranges anymore that could alter the return of > > cache_flush_permitted() from that domctl. > > > > Calls to memory_type_changed() resulting from changes to the domain > > iomem_caps or ioport_caps ranges are only relevant for EMT > > calculations if the IOMMU is not enabled, and the call has resulted in > > a change to the return value of cache_flush_permitted(). > > This, and the perf problem Citrix have found, is caused by a more > fundamental bug which I identified during XSA-402. > > Xen is written with assumption that cacheability other than WB is > dependent on having devices. While this is perhaps true of current > configurations available, it is a layering violation, and will cease > being true in order to support encrypted RAM (and by extension, > encrypted VMs). I assumed this was done as a performance improvement (or hack). > At the moment, we know the IOMMU-ness of a domain right from the outset, > but the cacheability permits are dynamic, based on the non-emptyness of > the domain's device list, ioport list, and various others. Well, as long as there's an IOMMU assigned cacheability will be calculated taking into account the guest MTRR values, and won't be forced to WB, regardless of the emptiness of the device or IO ports/memory capability lists. Just setting `passthrough=1` on the guest config file without any devices actually passed through should prevent the forcing WB. The use case of allowing io memory or ports to be assigned to an HVM guest without an IOMMU has always confused me, but I guess if it's there it's because it's used by someone. > All the memory_type_changed() calls here are to cover the fact that the > original design was buggy by not having the cacheability-ness part of > domain create in the first place. > > The appropriate fix, but definitely 4.18 work at this point, is to have > a new CDF flag which permits the use of non-WB cacheability. > > For testing purposes alone, turning it on on an otherwise "plain VM" is > useful (its how I actually debugged XSA-402, and the only sane way to go > about investigating the MTRR per disasters for VGPU VMs[1]), Wasn't it enough to set `passthrough=1` in order to enable cacheability for debugging purposes? (not that I oppose to adding the flag, just curious why enabling the IOMMU wasn't enough). > but for > regular usecases, it wants cross-checking with the IOMMU flag (and > encrypted VM flag in the future), and for all dynamic list checks to > turn into a simple 'd->config & CDF_full_cacheability'. > > This way, we delete all calls to memory_type_changed() which are trying > to cover the various dynamic lists becoming empty/non-empty, and we > remove several ordering-of-hypercalls bugs where non-cacheable mappings > can't actually be created on a VM declared to have an IOMMU until a > device has actually been assigned to start with. It should be possible with current code to create non-cacheable mappings on a VM as long as the IOMMU is enabled, regardless of whether no devices are assigned to the VM. > ~Andrew > > [1] MTRR handling is also buggy with reduced cacheability, causing some > areas of RAM to be used UC; notably the grant table. This manifests as > PV device perf being worse than qemu-emulated device perf, only when a > GPU is added to a VM[2]. Instead of fixing this properly, it was hacked > around by forcing IPAT=1 for Xenheap pages, which only "fixed" the > problem on Intel (AMD has no equivalent mechanism), and needs reverting > and fixing properly (i.e. get the vMTRR layout working correctly) to > support VMs with encrypted RAM. My understanding of the original problem was slightly different: we place the grant table in a BAR region of the xenpci device, and hence the gMTRRs are set to UC by the guest. The workaround in Xen is to cope with existing Windows guest drivers not setting the gMTRR values for the grant table frames to WB. Even if we calculate all the cache attributes correctly we would still need the 'hack'. > > [2] There's a second bug with memory_type_changed() in that it causes > dreadful system performance during VM migration, which is something to > do with the interaction of restoring vMTRRs for a VM that has a device > but isn't running yet. This still needs investigating, and I suspect > it's got a similar root cause. XenServer seems to have a custom patch for this: https://github.com/xenserver/xen.pg/blob/XS-8.3.x/patches/0001-x86-HVM-Avoid-cache-flush-operations-during-hvm_load.patch But we could likely avoid the flush completely if the VM hasn't been started yet. Thanks, Roger.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.