|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] x86/ept: simplify detection of special pages for EMT calculation
On 26.09.2022 16:27, Roger Pau Monné wrote:
> On Mon, Sep 26, 2022 at 10:38:40AM +0200, Jan Beulich wrote:
>> On 23.09.2022 12:56, Roger Pau Monne wrote:
>>> The current way to detect whether a page handled to
>>> epte_get_entry_emt() is special and needs a forced write-back cache
>>> attribute involves iterating over all the smaller 4K pages for
>>> superpages.
>>>
>>> Such loop consumes a high amount of CPU time for 1GiB pages (order
>>> 18): on a Xeon® Silver 4216 (Cascade Lake) at 2GHz this takes an
>>> average amount of time of 1.5ms. Note that this figure just accounts
>>> for the is_special_page() loop, and not the whole code of
>>> epte_get_entry_emt(). Also the resolve_misconfig() operation that
>>> calls into epte_get_entry_emt() is done while holding the p2m lock in
>>> write (exclusive) mode, which blocks concurrent EPT_MISCONFIG faults
>>> and prevents most guest hypercalls for progressing due to the need to
>>> take the p2m lock in read mode to access any guest provided hypercall
>>> buffers.
>>>
>>> Simplify the checking in epte_get_entry_emt() and remove the loop,
>>> assuming that there won't be superpages being only partially special.
>>>
>>> So far we have no special superpages added to the guest p2m,
>>
>> We may not be adding them as superpages, but what a guest makes of
>> the pages it is given access to for e.g. grant handling, or what Dom0
>> makes of e.g. the (per-CPU) trace buffers is unknown. And I guess
>> Dom0 ending up with a non-WB mapping of the trace buffers might
>> impact tracing quite a bit. I don't think we can build on guests not
>> making any such the subject of a large-range mapping attempt, which
>> might end up suitable for a superpage mapping (recall that rather
>> sooner than later we ought to finally re-combine suitable ranges of
>> contiguous 4k mappings into 2M ones, just like we [now] do in IOMMU
>> code).
>
> Hm, doesn't pages used for grant handling (XENMAPSPACE_grant_table)
> cause them to be mapped as 4K entries in the p2m page tables. The
> code in xenmem_add_to_physmap_one() seems to remove and re-add them
> with order 0. Same with the trace buffers, they are added as order 0
> to the p2m.
Indeed. I was half way through writing the earlier response when
recalling that aspect; I may not have succeeded in adjusting the text
to properly convey that the concern is applicable only to future code,
not what we have right now.
> Note that when coalescing we would need to be careful then to not
> coalesce special pages.
Well, no, ...
> Might not be the best model because I'm not sure why we require
> XENMAPSPACE_grant_table to force entries to not be mapped as part of a
> super page in the guest p2m.
... as you say here, there may actually be benefits from allowing such
(re-)coalescing.
>> Since for data structures like the ones named above 2M mappings
>> might be enough (i.e. there might be little "risk" of even needing to
>> go to 1G ones), could we maybe take a "middle" approach and check all
>> pages when order == 9, but use your approach for higher orders? The
>> to-be-added re-coalescing would then need to by taught to refuse re-
>> coalescing of such ranges to larger than 2M mappings, while still
>> at least allowing for 2M ones. (Special casing at that boundary is
>> going to be necessary also for shadow code, at the very least.) But
>> see also below as to caveats.
>
> I guess a rangeset would be more future proof than anything else.
>
>>> and in
>>> any case the forcing of the write-back cache attribute is a courtesy
>>> to the guest to avoid such ranges being accessed as uncached when not
>>> really needed. It's not acceptable for such assistance to tax the
>>> system so badly.
>>
>> I agree we would better improve the situation, but I don't think we
>> can do so by ...
>>
>>> @@ -518,26 +517,19 @@ int epte_get_entry_emt(struct domain *d, gfn_t gfn,
>>> mfn_t mfn,
>>> return MTRR_TYPE_UNCACHABLE;
>>> }
>>>
>>> - if ( type != p2m_mmio_direct && !is_iommu_enabled(d) &&
>>> - !cache_flush_permitted(d) )
>>> + if ( (type != p2m_mmio_direct && !is_iommu_enabled(d) &&
>>> + !cache_flush_permitted(d)) ||
>>> + /*
>>> + * Assume the whole page to be special if the first 4K chunk is:
>>> + * iterating over all possible 4K sub-pages for higher order
>>> pages is
>>> + * too expensive.
>>> + */
>>> + is_special_page(mfn_to_page(mfn)) )
>>
>> ... building in assumptions like this one. The more that here you may
>> also produce too weak a memory type (think of a later page in the range
>> requiring a stronger-ordered memory type).
>>
>> While it may not help much, ...
>>
>>> {
>>> *ipat = true;
>>> return MTRR_TYPE_WRBACK;
>>> }
>>>
>>> - for ( special_pgs = i = 0; i < (1ul << order); i++ )
>>> - if ( is_special_page(mfn_to_page(mfn_add(mfn, i))) )
>>> - special_pgs++;
>>> -
>>> - if ( special_pgs )
>>> - {
>>> - if ( special_pgs != (1ul << order) )
>>> - return -1;
>>> -
>>> - *ipat = true;
>>> - return MTRR_TYPE_WRBACK;
>>> - }
>>
>> ... this logic could be improved to at least bail from the loop once it's
>> clear that the "-1" return path will be taken. Improvements beyond that
>> would likely involve adding some data structure (rangeset?) to track
>> special pages.
>
> For the guest I was running the loop didn't find any special pages in
> order 18 mappings, which are the most troublesome to handle in the
> loop. I'm not sure bailing early would make that much of a difference
> in practice TBH.
As said - it may not help much.
> I did also consider using a rangeset, but that would use more
> per-domain memory and also require coordination to add special pages
> to it.
"Special" is a global property, so I don't think this would need to be a
per-domain data structure.
Jan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |