[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH v4 03/10] xen/page_alloc: Implement NUMA-node-specific claims


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Bernhard Kaindl <bernhard.kaindl@xxxxxxxxxx>
  • Date: Thu, 5 Mar 2026 13:12:42 +0000
  • Accept-language: en-GB, en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=YDr9PlBIUI66HZ5WEV0DH5b5E12lb28oTqCBCMj4qtQ=; b=r/ZLF6fcI90ySUjJB5s3kbyiOsisOSaAlR6iYG0/gRwsx3PJg4LNCSoMRPhkxhAyVWMzn6L964Evi4GELpJ+0HqzFVNka/vs7XieKvIRxdqegarMw06+Vs6af/FKbCKpUK5xlc/qFuvwBBbLM7bNt+Db7D/d164kW5TgqVo1QvmMInYvj+NSIGjAnKj3Oj9NfRgh9FQJT3Kubxqwjgx4WCNIuWaUU/c8Rxtj07KKzfNCMZkrL6Qat32liWZN6IqR2N791IAInZJwlvNlnbBOXmr0n6OUmd6I6jpjL22hzaSgXasLOgm8eOe3APHLggpj8q4pgptUma75lGQcEEX94A==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=h+4q2ul463yB1r+jp0ZXnDCmKvIjfDlBKiYjDv87G7l3ZWesiPatJhMUvLLj/chVb+YHnBjkNIdzIyDztE9tP0KJOGpyxRKYIbJyCHO8PFsVmITdrRgXQmApEZE+NTQWMNzSCIMEY/VaDwQTwI8mC75PBxafkXeCfyWF3yi/X+c0+yJ0o1f08OZb+eldakNkz/JcHePu6y7jbPaLBQjE2fmllyyZnXRGTMnfJbli0U185nlMzrzUEk9xGuoqDCNlSW1aM4jiOfHeiCGAp4a9Yba9jINXBGdj5QPr6DGw4qzt5rQxZ/VyqwRqpd34Rg4jprXjMRs41AtjDbJCAcULFg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: Andrew Cooper <andrew.cooper@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Roger Pau Monne <roger.pau@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Marcus Granado <Marcus.Granado@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Alejandro Vallejo <Alejandro.GarciaVallejo@xxxxxxx>
  • Delivery-date: Thu, 05 Mar 2026 13:12:58 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHcpy5A2UpiyJQ97kePD66zF1miC7Wfzj4AgAAAi7A=
  • Thread-topic: [PATCH v4 03/10] xen/page_alloc: Implement NUMA-node-specific claims

Jan Beulich wrote:
> > +    d->claim_node = NUMA_NO_NODE;
> 
> If, as the cover letter says, the new domctl is going to allow claiming from
> multiple nodes in one go, why would this new field still be necessary?

Roger requested the domctl API to allow claiming from multiple nodes in one go
and he specified that we should focus on getting the implementation for one
node-specific claim done first before we dive into multi-node claims code.

- Instead of adding/linking an array of claims to struct domain, we can keep
  using d->outstanding_pages for the single-node claim.

- There are numerous comments and questions for this minimal implementation.
  If we'd add multi-node claims to it, this review may become even more complex.

- The single-node claims backend contains the infrastructure and multi-node
  claims would be an extension on top of that infrastructure.
  
We've a thread where I added the links to the past discussion here:
https://lists.xenproject.org/archives/html/xen-devel/2026-02/msg01403.html

> > +static unsigned long node_outstanding_claims[MAX_NUMNODES];
> 
> How come this is being added, rather than it replacing outstanding_claims?

The global outstanding_claims variable counts the host-level claimed pages.

It has the sum of all host-level claims that are not specific to a NUMA node
and also the sum of all node-specific claims (see more on that in an answer
to another question further below).

If we were to replace it, we'd not have the outstanding_claims counter,
which would result in not supporting global claims anymore in Xen.

If a toolstack would want to claim more memory than a single NUMA node
has, someone would have to go loop over all NUMA nodes with enough memory
and split the claim across a number or per-NUMA-node claims.

- This would be less flexible than what we have with it: With it, we can
  still support host-level claims without those claims be placed on
  specific NUMA nodes.

- This allows for many domains to be built in parallel where some
  domains are built with claims specific to specific NUMA nodes, while
  allowing the other domains are built dynamically at allocation time
  from any remaining memory wherever that memory remains to be available.
  
- There are use cases where the memory of some domains shall be spread
  across NUMA nodes to have the memory bandwidth of those NUMA nodes
  available to the individual processes in the guest domain but still
  want the assurance of host-level claims for claimed memory when
  constructing of many domains in parallel on a host.

> > +/* Return available pages after subtracting claimed pages */
> > +static inline unsigned long available_after_claims(unsigned long
> avail_pages,
> > +                                                   unsigned long claims)
> > +{
> > +    BUG_ON(claims > avail_pages);
> > +    return avail_pages - claims; /* Due to the BUG_ON, it cannot be 
> > negative
> */
> > +}
> 
> A helper for a simple subtraction?

It is about not having to repeat the BUG_ON(claims > avail_pages) everywhere.

Also, the name of the helper makes clear what the result of the expression is,
So when using it, the flow of the code is more natural to understand.

Having to repeat the BUG_ON() everywhere would make the code less readable.
The BUG_ON() is good when refactoring as a guardrail when you broke the code.

> > +/* Answer if host-level memory and claims permit this request to proceed */
> > +static inline bool host_allocatable_request(const struct domain *d,
> > +                                            unsigned int memflags,
> > +                                            unsigned long request)
> > +{
> > +    unsigned long allocatable_pages;
> > +
> > +    ASSERT(spin_is_locked(&heap_lock));
> > +
> > +    allocatable_pages = available_after_claims(total_avail_pages,
> > +                                               outstanding_claims);
> > +    if ( allocatable_pages >= request )
> > +        return true; /* The not claimed pages are enough to proceed */
> > +
> > +    if ( !d || (memflags & MEMF_no_refcount) )
> > +        return false; /* Claims are not available for this allocation */
> > +
> > +    /* The domain's claims are available, return true if sufficient */
> > +    return request <= allocatable_pages + d->outstanding_pages;
> > +}
> 
> This only uses variables which existed before, i.e. there's nothing NUMA-ish
> in here. What's the deal?

The deal is that for taking unclaimed memory beyond the remaining claims
Into account for deciding that the host has usable memory for a domain with
a claim, the needed if-expression would be quite complicated to understand.
When factoring this logic into an if expression without extracting it into
a function, it would bloat flow alloc_heap_pages(), especially if one would
want to have the comments. I'm not sure if this is a good idea.

> > +/* Answer if node-level memory and claims permit this request to proceed */
> > +static inline bool node_allocatable_request(const struct domain *d,
> > +                                            unsigned int memflags,
> > +                                            unsigned long request,
> > +                                            nodeid_t node)
> > +{
> > +    unsigned long allocatable_pages;
> > +
> > +    ASSERT(spin_is_locked(&heap_lock));
> > +    ASSERT(node < MAX_NUMNODES);
> > +
> > +    allocatable_pages = available_after_claims(node_avail_pages(node),
> > +
> node_outstanding_claims(node));
> > +    if ( allocatable_pages >= request )
> > +        return true; /* The not claimed pages are enough to proceed */
> > +
> > +    if ( !d || (memflags & MEMF_no_refcount) || (node != d->claim_node) )
> > +        return false; /* Claims are not available for this allocation */
> > +
> > +    /* The domain's claims are available, return true if sufficient */
> > +    return request <= allocatable_pages + d->outstanding_pages;
> > +}
> 
> And this is the NUMA counterpart, almost identical in the basic logic. If

Yes, and this is intentional: The same simple check just with per-node claims.

> (for whatever reason) both are really needed, I think it should at least be

The reason is that staking claims is the easy part. Protecting claims is the
(a bit) tricky part, which is what such functions are helping with.

> considered to fold them (with NUMA_NO_NODE indicating the non-NUMA intent).
> In fact the node != d->claim_node would probably also apply to the non-NUMA
> variant (as d->claim_node != NUMA_NO_NODE).
 
That's not true: The sum of all node-specific claims should also be part
of the global host-level outstanding_claims counter so the host-level check
of an allocation may proceed regarding claims incorporates the check if
node-specific claims would (in theory) allow a host-level alloc to proceed.

That means that for the host-level claims protection check for a domain,
the claims of the domain must be accounted, even if they are node-specific.

Thus, `node != d->claim_node` does not apply if `node != NUMA_NO_NODE`,
Yes, I should merge the functions, but this check is only for node-claims.

> As to the comments in both functions, personally I think
> s/not claimed/unclaimed/ would be slightly more logical to follow.
> 
> In any event, the first of these function looks like it could be split out
> in a separate, earlier patch. Then (as per above) ideally here that function
> would simply be extended to become NUMA-capable.

Yes indeed, Good.

> > +    if ( d->outstanding_pages == 0 )
> > +        d->claim_node = NUMA_NO_NODE; /* Clear if no outstanding pages left
> 
> I fear I don't understand this. If the domain has claims on other nodes,
> why would would it be switched back to non-NUMA claims?

( d->outstanding_pages == 0 ) means d has no claim left: None-at-all.
As there is none left, noting is switched, claims are reset to the no-claim 
state.

> > @@ -564,14 +642,41 @@ void consume_outstanding_claims(struct domain *d,
> unsigned long allocation)
> >
> >      /* Of course, the domain can only release up its outstanding claims */
> >      allocation = min(allocation, d->outstanding_pages + 0UL);
> > +
> > +    if ( d->claim_node != NUMA_NO_NODE && d->claim_node != alloc_node )
> > +    {
> > +        /*
> > +         * The domain has a claim on a node, but the alloc is on a 
> > different
> > +         * node. If it would exceed the domain's max_pages, reduce the 
> > claim
> > +         * up to the excess over max_pages so we don't reduce the claim 
> > more
> > +         * than we have to to honor the max_pages limit.
> > +         */
> > +        unsigned long booked_pages = domain_tot_pages(d) + allocation +
> > +                                     d->outstanding_pages;
> > +        if ( booked_pages <= d->max_pages )
> > +            return; /* booked is within max_pages, no excess, keep the 
> > claim
> */
> > +
> > +        /* Excess detected, release the exceeding pages from the claimed
> node */
> > +        allocation = min(allocation, booked_pages - d->max_pages);
> > +    }
> >      release_outstanding_claims(d, allocation);
> 
> Please can there be another blank line above this one?
> 
> Why is the adjustment made excluded for the NUMA_NO_NODE case? That's odd in

When the domain's claim is global, the allocation just consume from this claim.

> itself, but particularly with release_outstanding_claims() possibly switching 
> a
> domain to NUMA_NO_NODE.
 
Commented above, its not switching with zero claims, it just resets the state.

>                         Plus the caller looks to be passing in the actual node
> memory was taken from, not what the original request said (which is
> specifically relevant when the request named no particular node).

For this check, the target of the initial allocation request does not matter.
What matters here is that for whatever reason when an allocation ends up not
being made on a NUMA node the domain has a node-specific claim on, and this
allocation would exceed the claimed + allocated memory beyond d->max_pages, we
should release and consume the size of claims that would exceed d->max_pages.

There is a bug here that I already fix in my draft for a v5 series: It needs to
Use the size of the full allocation to calculate the excess beyond d->max_pages,
not the min() of the allocation size and d->outstanding_claims. Will be fixed.

> > + /* When releasing a claim, the node must be NUMA_NO_NODE (it is not used) 
> > */
> 
> Why would this be?

I need to fix the comment: It should say "resetting". When resetting claims to 
0,
which is done by libxenguest after it has completed populating the guest memory,
we are not passing a NUMA node and as this function rejects any updates of an
existing claim besides resetting it using 0, the reset shall always apply to all
claims of the domain, independent which claims it has (even multi-node claims).
Roger added this check to make it clear that we expect NUMA_NO_NODE with this 
call.

> >          if ( (memflags & MEMF_exact_node) && req_node != NUMA_NO_NODE )
> >              return NULL;
> 
> As per this, ...
> 
> > @@ -1042,13 +1164,8 @@ static struct page_info *alloc_heap_pages(
> >
> >      spin_lock(&heap_lock);
> >
> > -    /*
> > -     * Claimed memory is considered unavailable unless the request
> > -     * is made by a domain with sufficient unclaimed pages.
> > -     */
> > -    if ( (outstanding_claims + request > total_avail_pages) &&
> > -          ((memflags & MEMF_no_refcount) ||
> > -           !d || d->outstanding_pages < request) )
> > +    /* Proceed if host-level memory and claims permit this request to
> proceed */
> > +    if ( !host_allocatable_request(d, memflags, request) )
> 
> ... in the MEMF_exact_node case I see little reason to check the global value 
> here.

Ack, if (memflags & MEMF_exact_node), we can skip the host-wide check indeed,
which would be an optimisation as we'd not have to look at the host-level 
counters.

> Jan

Thanks for this review, I'll apply those changes!
Bernhard

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.