[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 2/3] xen/mm: allow deferred scrub of physmap populate allocated pages


  • To: Roger Pau Monne <roger.pau@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Mon, 19 Jan 2026 14:00:49 +0100
  • Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Mon, 19 Jan 2026 13:01:04 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 15.01.2026 12:18, Roger Pau Monne wrote:
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -722,6 +722,15 @@ static void _domain_destroy(struct domain *d)
>  
>      XVFREE(d->console);
>  
> +    if ( d->pending_scrub )
> +    {
> +        BUG_ON(d->creation_finished);
> +        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
> +        d->pending_scrub = NULL;
> +        d->pending_scrub_order = 0;
> +        d->pending_scrub_index = 0;
> +    }

Because of the other zeroing wanted (it's not strictly needed, is it?),
it may be a little awkward to use FREE_DOMHEAP_PAGES() here. Yet I would
still have recommended to avoid its open-coding, if only we had such a
wrapper already.

Would this better be done earlier, in domain_kill(), to avoid needlessly
holding back memory that isn't going to be used by this domain anymore?
Would require the spinlock be acquired to guard against a racing
stash_allocation(), I suppose. In fact freeing right in
domain_unpause_by_systemcontroller() might be yet better (albeit without
eliminating the need to do it here or in domain_kill()).

> @@ -1678,6 +1687,14 @@ int domain_unpause_by_systemcontroller(struct domain 
> *d)
>       */
>      if ( new == 0 && !d->creation_finished )
>      {
> +        if ( d->pending_scrub )
> +        {
> +            printk(XENLOG_ERR
> +                   "%pd: cannot be started with pending dirty pages, 
> destroying\n",

s/dirty/unscrubbed/ to avoid ambiguity with "dirty" as in "needing writeback"?

> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -159,6 +159,74 @@ static void increase_reservation(struct memop_args *a)
>      a->nr_done = i;
>  }
>  
> +/*
> + * Temporary storage for a domain assigned page that's not been fully 
> scrubbed.
> + * Stored pages must be domheap ones.
> + *
> + * The stashed page can be freed at any time by Xen, the caller must pass the
> + * order and NUMA node requirement to the fetch function to ensure the
> + * currently stashed page matches it's requirements.
> + */
> +static void stash_allocation(struct domain *d, struct page_info *page,
> +                             unsigned int order, unsigned int scrub_index)
> +{
> +    BUG_ON(d->creation_finished);

Is this valid here and ...

> +    rspin_lock(&d->page_alloc_lock);
> +
> +    /*
> +     * Drop any stashed allocation to accommodated the current one.  This
> +     * interface is designed to be used for single-threaded domain creation.
> +     */
> +    if ( d->pending_scrub )
> +        free_domheap_pages(d->pending_scrub, d->pending_scrub_order);
> +
> +    d->pending_scrub_index = scrub_index;
> +    d->pending_scrub_order = order;
> +    d->pending_scrub = page;
> +
> +    rspin_unlock(&d->page_alloc_lock);
> +}
> +
> +static struct page_info *get_stashed_allocation(struct domain *d,
> +                                                unsigned int order,
> +                                                nodeid_t node,
> +                                                unsigned int *scrub_index)
> +{
> +    struct page_info *page = NULL;
> +
> +    BUG_ON(d->creation_finished && d->pending_scrub);

... here? A badly behaved toolstack could do a populate in parallel with
the initial unpause, couldn't it?

> +    rspin_lock(&d->page_alloc_lock);
> +
> +    /*
> +     * If there's a pending page to scrub check it satisfies the current
> +     * request.  If it doesn't keep it stashed and return NULL.
> +     */
> +    if ( !d->pending_scrub || d->pending_scrub_order != order ||
> +         (node != NUMA_NO_NODE && node != page_to_nid(d->pending_scrub)) )

Ah, and MEMF_exact_node is handled in the caller.

> +        goto done;
> +    else
> +    {
> +        page = d->pending_scrub;
> +        *scrub_index = d->pending_scrub_index;
> +    }
> +
> +    /*
> +     * The caller now owns the page, clear stashed information.  Prevent
> +     * concurrent usages of get_stashed_allocation() from returning the same
> +     * page to different contexts.
> +     */
> +    d->pending_scrub_index = 0;
> +    d->pending_scrub_order = 0;
> +    d->pending_scrub = NULL;
> +
> + done:
> +    rspin_unlock(&d->page_alloc_lock);
> +
> +    return page;
> +}

Hmm, you free the earlier allocation only in stash_allocation(), i.e. that
memory isn't available to fulfill the present request. (I do understand
that the freeing there can't be dropped, to deal with possible races
caused by the toolstack.)

The use of "goto" here also looks a little odd, as it would be easy to get
away without. Or else I'd like to ask that the "else" be dropped.

> @@ -286,6 +365,30 @@ static void populate_physmap(struct memop_args *a)
>                      goto out;
>                  }
>  
> +                if ( !d->creation_finished )
> +                {
> +                    unsigned int dirty_cnt = 0, j;

Declaring (another) j here is going to upset Eclair, I fear, as ...

> +                    /* Check if there's anything to scrub. */
> +                    for ( j = scrub_start; j < (1U << a->extent_order); j++ )
> +                    {
> +                        if ( !test_and_clear_bit(_PGC_need_scrub,
> +                                                 &page[j].count_info) )
> +                            continue;
> +
> +                        scrub_one_page(&page[j], true);
> +
> +                        if ( (j + 1) != (1U << a->extent_order) &&
> +                             !(++dirty_cnt & 0xff) &&
> +                             hypercall_preempt_check() )
> +                        {
> +                            a->preempted = 1;
> +                            stash_allocation(d, page, a->extent_order, ++j);

Better j + 1, as j's value isn't supposed to be used any further?

> +                            goto out;
> +                        }
> +                    }
> +                }
> +
>                  if ( unlikely(a->memflags & MEMF_no_tlbflush) )
>                  {
>                      for ( j = 0; j < (1U << a->extent_order); j++ )

... for this to work there must already be one available from an outer scope.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.