[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 2/2] x86/numa: introduce per-NUMA node flush locks



On Wed, Jun 04, 2025 at 02:35:01PM +0200, Jan Beulich wrote:
> On 22.05.2025 10:48, Roger Pau Monne wrote:
> > Contention around the global flush_lock increases as the amount of physical
> > CPUs on the host also increases.  Sadly this doesn't scale on big boxes.
> > However most of the time Xen doesn't require broadcasting flushes to all
> > CPUs on the system, and hence more fine grained (ie: covering less CPUs)
> > locks could be used.
> > 
> > A natural boundary to use when splitting the locks are NUMA nodes.  Most
> > domains will be limited to running on a single node, specifically the one
> > where the domain memory has been allocated from.  Flushes related to
> > domains are most likely to be limited to a single NUMA node, and hence
> > being able to execute per-node flushes allows to reduce the contention
> > around the global flush_lock, while also allowing to perform concurrent
> > flushes on different nodes.
> > 
> > This however doesn't come for free.  A new vector must be allocated to be
> > used for the per-NUMA flushes, and more logic is required in the flush
> > dispatcher to figure out whether a flush is limited to a single node.
> > 
> > The figures on a 2-node NUMA system are as follows, after having been
> > running the same XenRT boot storm workload for 90 minutes.
> > 
> > Without the per-NUMA node flush:
> > 
> > Global flush_lock: addr=ffff82d040837340, lockval=d8ded8de, not locked
> >   lock:21878876(98178042228), block:1603338(6043805110)
> > 
> > So a total block time of ~6s, and average block time of 3.7us.
> > 
> > With the per-node locks:
> > 
> > Global flush_lock: addr=ffff82d040837360, lockval=78e678e6, not locked
> >   lock:6781028(41032945811), block:583712(2025657239)
> > NUMA node 1 flush_lock: addr=ffff832fd085b110, lockval=5cd65cd6, not locked
> >   lock:220374(766500536), block:4091(9933129)
> > NUMA node 0 flush_lock: addr=ffff8310336a7110, lockval=5c715c71, not locked
> >   lock:547953(1658170241), block:23856(51266297)
> > 
> > The total block time goes down to ~2s, and the average block time is 3.4us.
> > The total block time of the per-node locks is much lower compared to the
> > global flush_lock, 9ms and 51ms respectively.
> > 
> > Note the example here is possibly the system where such split locks don't
> > make a lot of difference, being a 2 node system, but still there's a
> > non-trivial difference between the block times.  On a 4 or 9 node system
> > the figures should likely be even better.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> 
> The savings must be from TLB flushes only? Cache flushes, as we have discussed
> recently, need to be global anyway, and hence there are no savings to be had.
> This may want reflecting the subject and/or description.

Yeah, it's only TLB flushes.

> > @@ -126,3 +129,95 @@ int __init arch_get_ram_range(unsigned int idx, 
> > paddr_t *start, paddr_t *end)
> >  
> >      return 0;
> >  }
> > +
> > +static struct arch_numa_node {
> > +    const void *flush_va;
> > +    unsigned int flush_flags;
> > +    cpumask_t flush_mask;
> > +    spinlock_t flush_lock;
> > +    struct lock_profile_qhead profile_head;
> > +} *node_info[MAX_NUMNODES];
> 
> __ro_after_init or at least __read_mostly?

__ro_after_init would be OK I think.

> > +static int __init cf_check arch_numa_init(void)
> > +{
> > +    unsigned int i;
> > +
> > +    if ( num_online_nodes() == 1 )
> > +        return 0;
> > +
> > +    for_each_online_node ( i )
> > +    {
> > +        struct arch_numa_node *node =
> > +            alloc_xenheap_pages(get_order_from_bytes(sizeof(*node)),
> > +                                                     MEMF_node(i));
> 
> A full page for what may cover just a cacheline or two? I realize ...
> 
> > +        if ( node )
> > +            clear_page(node);
> > +        else
> > +            node = xvzalloc(typeof(*node));
> 
> ... this (sadly) still has no NUMA-capable counterpart, but I'd have expected
> at least a brief comment to justify.

Oh, yes, it should better have a comment.

> > +void cf_check invalidate_tbl_numa(void)
> 
> Nit: s/tbl/tlb/ ?
> 
> > +bool flush_numa_node(const cpumask_t *mask, const void *va, unsigned int 
> > flags)
> > +{
> > +    nodeid_t node = num_online_nodes() > 1 ? cpumask_to_node(mask)
> > +                                           : NUMA_NO_NODE;
> > +    struct arch_numa_node *info;
> > +
> > +    if ( node == NUMA_NO_NODE )
> > +        return false;
> > +
> > +    info = node_info[node];
> > +
> > +    if ( !info )
> > +        return false;
> > +
> > +    spin_lock(&info->flush_lock);
> > +    cpumask_and(&info->flush_mask, mask, &cpu_online_map);
> > +    cpumask_clear_cpu(smp_processor_id(), &info->flush_mask);
> 
> IOW this function is strictly a helper of flush_area_mask(), relying on
> the local CPU to have had its flushing done already. Might it not be
> better if this was put as a static in smp.c then? Or, to keep struct
> arch_numa_node local to this file, split it into two parts?
> 
> > +    info->flush_va = va;
> > +    info->flush_flags = flags;
> > +    send_IPI_mask(&info->flush_mask, INVALIDATE_NUMA_VECTOR);
> 
> This one similarly depends on flush_area_mask() behavior, not calling here
> when mask has solely the local CPU set.
> 
> > --- a/xen/common/numa.c
> > +++ b/xen/common/numa.c
> > @@ -689,6 +689,29 @@ static int __init cf_check numa_setup(const char *opt)
> >  }
> >  custom_param("numa", numa_setup);
> >  
> > +/*
> > + * Return the NUMA node index if all CPUs in the mask belong to the same 
> > node,
> > + * otherwise return NUMA_NO_NODE.
> > + */
> > +nodeid_t cpumask_to_node(const cpumask_t *mask)
> > +{
> > +    unsigned int cpu;
> > +    nodeid_t node = NUMA_NO_NODE;
> > +
> > +    if ( num_online_nodes() == 1 )
> > +        return cpu_to_node[0];
> 
> The sole caller checks this already, so strictly speaking Misra would consider
> this dead / unreachable code.
> 
> Similarly Misra may not like this on Arm, where it's unused right now. Then
> again looks like NUMA work there was abandoned, so for now all ought to be
> fine here.

Thanks, given the feedback I think you consider this an improvement
worth pursuing then?

Regards, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.