Xen project Mailing List

Re: [PATCH] x86/ept: simplify detection of special pages for EMT calculation

To: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Mon, 26 Sep 2022 17:43:06 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=/wLw02V4JwhQCt/QO7yeGhhN/aqVFrk/+3J/uoGXdoE=; b=DYLQrMlfxbjcs91uMcsgwpNJvCSg8BT7S2/QZcxXh4lLYuKyCpv86EEQJbZ6m0aNvSTO7826W1ZH1p6diodb5rbOyosiA4MqRhTgZj5omVCciawofnwdByXve7Bi1evyqcy7+4kN+ZO8rJd83TMB7Rzd/Rb9dQ6C8FGZP/PtQLhnfsZA7DB344PCJ6yW9DJ7e8CxWTyTWG3Av9YSEKcu2RkBlRy+CpjRidM/2H7TToNkc1XUTb+Tta5bituSEUgacDMWJALUopc4L3fJWkQOqwv+Mgi7YRjEwN6kXjEimXOdrKr13xyxbXXvP5fQ2spghF7kU2bpI2O3JLjWVqDwCg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=CdwIO187jJ06U/9c6GPFEHStikVx1/wfhW4C1ZJkW2BXPJ/YBnsMnU9rVVbEwS1CgM7yqjkxjVgjtA51p0GMDl0W/eQh1bBgZlXBOEULKAh60tAStxqomWy+F3A6mIAcyeCcYgd9MlY0RWqvKq3E6Y7kCwz1aRugxBqg9LT7/NXD1vgCi7NYQ7igL1aJ1a7gU/zbcvfzS7smQrYd4Lbt4mGGPVX5QppIQs/2XoFi+58k6k7KggJ5LDKaTJQoTdOZ2dfjERbF2i7Qv4+9G+ByKJpXs/S0IERNKGxhs+j+rdAz5LQ0jCI1qxamaBabtOY+NPKc942pMJ0nY3kbKEgCng==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: Jun Nakajima <jun.nakajima@xxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Mon, 26 Sep 2022 15:43:20 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 26.09.2022 16:27, Roger Pau Monné wrote: > On Mon, Sep 26, 2022 at 10:38:40AM +0200, Jan Beulich wrote: >> On 23.09.2022 12:56, Roger Pau Monne wrote: >>> The current way to detect whether a page handled to >>> epte_get_entry_emt() is special and needs a forced write-back cache >>> attribute involves iterating over all the smaller 4K pages for >>> superpages. >>> >>> Such loop consumes a high amount of CPU time for 1GiB pages (order >>> 18): on a Xeon® Silver 4216 (Cascade Lake) at 2GHz this takes an >>> average amount of time of 1.5ms. Note that this figure just accounts >>> for the is_special_page() loop, and not the whole code of >>> epte_get_entry_emt(). Also the resolve_misconfig() operation that >>> calls into epte_get_entry_emt() is done while holding the p2m lock in >>> write (exclusive) mode, which blocks concurrent EPT_MISCONFIG faults >>> and prevents most guest hypercalls for progressing due to the need to >>> take the p2m lock in read mode to access any guest provided hypercall >>> buffers. >>> >>> Simplify the checking in epte_get_entry_emt() and remove the loop, >>> assuming that there won't be superpages being only partially special. >>> >>> So far we have no special superpages added to the guest p2m, >> >> We may not be adding them as superpages, but what a guest makes of >> the pages it is given access to for e.g. grant handling, or what Dom0 >> makes of e.g. the (per-CPU) trace buffers is unknown. And I guess >> Dom0 ending up with a non-WB mapping of the trace buffers might >> impact tracing quite a bit. I don't think we can build on guests not >> making any such the subject of a large-range mapping attempt, which >> might end up suitable for a superpage mapping (recall that rather >> sooner than later we ought to finally re-combine suitable ranges of >> contiguous 4k mappings into 2M ones, just like we [now] do in IOMMU >> code). > > Hm, doesn't pages used for grant handling (XENMAPSPACE_grant_table) > cause them to be mapped as 4K entries in the p2m page tables. The > code in xenmem_add_to_physmap_one() seems to remove and re-add them > with order 0. Same with the trace buffers, they are added as order 0 > to the p2m. Indeed. I was half way through writing the earlier response when recalling that aspect; I may not have succeeded in adjusting the text to properly convey that the concern is applicable only to future code, not what we have right now. > Note that when coalescing we would need to be careful then to not > coalesce special pages. Well, no, ... > Might not be the best model because I'm not sure why we require > XENMAPSPACE_grant_table to force entries to not be mapped as part of a > super page in the guest p2m. ... as you say here, there may actually be benefits from allowing such (re-)coalescing. >> Since for data structures like the ones named above 2M mappings >> might be enough (i.e. there might be little "risk" of even needing to >> go to 1G ones), could we maybe take a "middle" approach and check all >> pages when order == 9, but use your approach for higher orders? The >> to-be-added re-coalescing would then need to by taught to refuse re- >> coalescing of such ranges to larger than 2M mappings, while still >> at least allowing for 2M ones. (Special casing at that boundary is >> going to be necessary also for shadow code, at the very least.) But >> see also below as to caveats. > > I guess a rangeset would be more future proof than anything else. > >>> and in >>> any case the forcing of the write-back cache attribute is a courtesy >>> to the guest to avoid such ranges being accessed as uncached when not >>> really needed. It's not acceptable for such assistance to tax the >>> system so badly. >> >> I agree we would better improve the situation, but I don't think we >> can do so by ... >> >>> @@ -518,26 +517,19 @@ int epte_get_entry_emt(struct domain *d, gfn_t gfn, >>> mfn_t mfn, >>> return MTRR_TYPE_UNCACHABLE; >>> } >>> >>> - if ( type != p2m_mmio_direct && !is_iommu_enabled(d) && >>> - !cache_flush_permitted(d) ) >>> + if ( (type != p2m_mmio_direct && !is_iommu_enabled(d) && >>> + !cache_flush_permitted(d)) || >>> + /* >>> + * Assume the whole page to be special if the first 4K chunk is: >>> + * iterating over all possible 4K sub-pages for higher order >>> pages is >>> + * too expensive. >>> + */ >>> + is_special_page(mfn_to_page(mfn)) ) >> >> ... building in assumptions like this one. The more that here you may >> also produce too weak a memory type (think of a later page in the range >> requiring a stronger-ordered memory type). >> >> While it may not help much, ... >> >>> { >>> *ipat = true; >>> return MTRR_TYPE_WRBACK; >>> } >>> >>> - for ( special_pgs = i = 0; i < (1ul << order); i++ ) >>> - if ( is_special_page(mfn_to_page(mfn_add(mfn, i))) ) >>> - special_pgs++; >>> - >>> - if ( special_pgs ) >>> - { >>> - if ( special_pgs != (1ul << order) ) >>> - return -1; >>> - >>> - *ipat = true; >>> - return MTRR_TYPE_WRBACK; >>> - } >> >> ... this logic could be improved to at least bail from the loop once it's >> clear that the "-1" return path will be taken. Improvements beyond that >> would likely involve adding some data structure (rangeset?) to track >> special pages. > > For the guest I was running the loop didn't find any special pages in > order 18 mappings, which are the most troublesome to handle in the > loop. I'm not sure bailing early would make that much of a difference > in practice TBH. As said - it may not help much. > I did also consider using a rangeset, but that would use more > per-domain memory and also require coordination to add special pages > to it. "Special" is a global property, so I don't think this would need to be a per-domain data structure. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.