[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: x86: memset() / clear_page() / page scrubbing

To: Jan Beulich <jbeulich@xxxxxxxx>
From: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
Date: Fri, 9 Apr 2021 14:01:37 -0700
Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none
Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=qyBgzCKl607gL22dRsS+TP0bUPGsXo35p1F7OMhd5wE=; b=SbtbYlnjxc8fI6PG83WTVUxdkB/ewv2MCTN5F4bLdTZQ+1FSOar0zrzxzzOqdIjLyOI9Smm61r5DFU/s16ZnGskkSwBX+F7d4kjP3O4mrp8sVrmkb/BTHkSCJmBmJTXy62aGKJaLDGABjLXVyw6lw7eRZEijaB/RHnSdHJ4OdhqUitsGOJuB9qYf5oeCnBGfalbeBqxFOOJKmHCuuHZweVTmRcUX1Ti52hufgjURIOonhgqJbNRoNCV85uBMpvTqSItXHbrcYgPSlBVlf5eoPgboFb9gdwdPCycaRs+JsvQfcDowae9bnMBeopM/x77LjHWVZdPoT/EFS7wyRl+T+g==
Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=dQL4Ok6YmdhEDVqwDTXfsUUf4Kf3bWLLQDFc4Xd59sAQqQt7AZgxBvKO6oD7EEPRAzwVTkGS1f0CV3Jze1DngCCV4H4YphQcWTTsJiRQDzeJrASkNK+wzz3nQDtI/AsGQV96VsUW5zDuIdpTUtbYFlJnXVp1EjmtXxDWsfNNY6U88aOgsQcvtgl9GFlNVyO4cNFz+hqMh9VsiQ2lL+HWEc3VsiQF7RM2OPA+VU/mtXSQl/1OUmg6QJ5E939eCqqS9mKfS1ZGGRVt3+EmAD8+J7syR7miyoVh/WRtK+XWugsbk+BEZtoTVEtE/qsnZuzhuTqIqQAn9O23zq7niHWD3A==
Authentication-results: lists.xenproject.org; dkim=none (message not signed) header.d=none;lists.xenproject.org; dmarc=none action=none header.from=oracle.com;
Cc: andrew.cooper3@xxxxxxxxxx, roger.pau@xxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx
Delivery-date: Fri, 09 Apr 2021 21:02:16 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 2021-04-08 11:38 p.m., Jan Beulich wrote:

On 09.04.2021 08:08, Ankur Arora wrote:

I'm working on somewhat related optimizations on Linux (clear_page(),
going in the opposite direction, from REP STOSB to MOVNT) and have
some comments/questions below.


Interesting.

On 4/8/2021 6:58 AM, Jan Beulich wrote:

All,

since over the years we've been repeatedly talking of changing the
implementation of these fundamental functions, I've taken some time
to do some measurements (just for possible clear_page() alternatives
to keep things manageable). I'm not sure I want to spend as much time
subsequently on memcpy() / copy_page() (or more, because there are
yet more combinations of arguments to consider), so for the moment I
think the route we're going to pick here is going to more or less
also apply to those.

The present copy_page() is the way it is because of the desire to
avoid disturbing the cache. The effect of REP STOS on the L1 cache
(compared to the present use of MOVNTI) is more or less noticable on
all hardware, and at least on Intel hardware more noticable when the
cache starts out clean. For L2 the results are more mixed when
comparing cache-clean and cache-filled cases, but the difference
between MOVNTI and REP STOS remains or (at least on Zen2 and older
Intel hardware) becomes more prominent.


Could you give me any pointers on the cache-effects on this? This
obviously makes sense but I couldn't come up with any benchmarks
which would show this in a straight-forward fashion.


No benchmarks in that sense, but a local debugging patch measuring
things before bringing up APs, to have a reasonably predictable
environment. I have attached it for your reference.


Thanks, that does look like a pretty good predictable test.
(Btw, there might be an oversight in the clear_page_clzero() logic.
I believe that also needs an sfence.)

Just curious: you had commented out the local irq disable/enable clauses.
Is that because you decided that it the code ran at an early enough
point that they were not required or some other reason?

Otoh REP STOS, as was to be expected, in most cases has meaningfully
lower latency than MOVNTI.

Because I was curious I also included AVX (32-byte stores), AVX512
(64-byte stores), and AMD's CLZERO in my testing. While AVX is a
clear win except on the vendors' first generations implementing it
(but I've left out any playing with CR0.TS, which is what I expect
would take this out as an option), AVX512 isn't on Skylake (perhaps
newer hardware does better). CLZERO has slightly higher impact on
L1 than MOVNTI, but lower than REP STOS.


Could you elaborate on what kind of difference in L1 impact you are
talking about? Evacuation of cachelines?


Replacement of ones, yes. As you may see from that patch, I prefill
the cache, do the clearing, and then measure how much longer the
same operation takes that was used for prefilling. If the clearing
left the cache completely alone (or if the hw prefetcher was really
good), there would be no difference.


Yeah, that does sound like a good way to get an idea of how much the
clear_page_x() does perturb the cache.

Its latency is between
both when the caches are warm, and better than both when the caches
are cold.

Therefore I think that we want to distinguish page clearing (where
we care about latency) from (background) page scrubbing (where I
think the goal ought to be to avoid disturbing the caches). That
would make it
- REP STOS{L,Q} for clear_page() (perhaps also to be used for
   synchronous scrubbing),
- MOVNTI for scrub_page() (when done from idle context), unless
   CLZERO is available.
Whether in addition we should take into consideration activity of
other (logical) CPUs sharing caches I don't know - this feels like
it could get complex pretty quickly.


The one other case might be for ~L3 (or larger) regions. In my tests,
MOVNT/CLZERO is almost always better (the one exception being Skylake)
wrt both cache and latency for larger extents.


Good to know - will keep this in mind.

In the particular cases I was looking at (mmap+MAP_POPULATE and
page-fault path), that makes the choice of always using MOVNT/CLZERO
easy for GB pages, but fuzzier for 2MB pages.

Not sure if the large-page case is interesting for you though.


Well, we never fill large pages in one go, yet the scrubbing may
touch many individual pages in close succession. But for the
(background) scrubbing my recommendation is to use MOVNT/CLZERO
anyway, irrespective of volume. While upon large page allocations
we may also end up scrubbing many pages in close succession, I'm
not sure that's worth optimizing for - we at least hope for the
pages to have got scrubbed in the background before they get
re-used. Plus we don't (currently) know up front how many of them
may still need scrubbing; this isn't difficult to at least
estimate, but may require yet another loop over the constituent
pages.


Agreed MOVNT/CLZERO do seem ideally suited for background scrubbing.
Alas, AFAICS Linux currently only does foreground cleaning. The
only reason for I can think of for that "decision" is maybe that
there one trusted user with a significant footprint -- the page
cache -- where pages can be allocate without needing to clear.

That said, given that background scrubbing is a fairly cheap way of
time-shifting work to idle without negatively affecting the cache
it does make sense to move towards it for at least a subset of pages.

The only potential negative could be higher power consumption
because idle is spending less time in C-states. That said, that
also seems like a wash given that this only shifts when we do
the clearing.
Would you have any intuition on, if the power consumption of
the non-temporal primitives is meaningfully different from
REP STOS and friends?

Ankur

Jan

Follow-Ups:
- Re: x86: memset() / clear_page() / page scrubbing
  - From: Jan Beulich

References:
- x86: memset() / clear_page() / page scrubbing
  - From: Jan Beulich
- Re: x86: memset() / clear_page() / page scrubbing
  - From: Ankur Arora
- Re: x86: memset() / clear_page() / page scrubbing
  - From: Jan Beulich

Prev by Date: Re: [GIT PULL] xen: branch for v5.12-rc7
Next by Date: Re: [PATCH 2/3] xen-pciback: reconfigure also from backend watch handler
Previous by thread: Re: x86: memset() / clear_page() / page scrubbing
Next by thread: Re: x86: memset() / clear_page() / page scrubbing
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.