|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 2/2] x86/numa: introduce per-NUMA node flush locks
On Wed, Jun 04, 2025 at 02:35:01PM +0200, Jan Beulich wrote:
> On 22.05.2025 10:48, Roger Pau Monne wrote:
> > Contention around the global flush_lock increases as the amount of physical
> > CPUs on the host also increases. Sadly this doesn't scale on big boxes.
> > However most of the time Xen doesn't require broadcasting flushes to all
> > CPUs on the system, and hence more fine grained (ie: covering less CPUs)
> > locks could be used.
> >
> > A natural boundary to use when splitting the locks are NUMA nodes. Most
> > domains will be limited to running on a single node, specifically the one
> > where the domain memory has been allocated from. Flushes related to
> > domains are most likely to be limited to a single NUMA node, and hence
> > being able to execute per-node flushes allows to reduce the contention
> > around the global flush_lock, while also allowing to perform concurrent
> > flushes on different nodes.
> >
> > This however doesn't come for free. A new vector must be allocated to be
> > used for the per-NUMA flushes, and more logic is required in the flush
> > dispatcher to figure out whether a flush is limited to a single node.
> >
> > The figures on a 2-node NUMA system are as follows, after having been
> > running the same XenRT boot storm workload for 90 minutes.
> >
> > Without the per-NUMA node flush:
> >
> > Global flush_lock: addr=ffff82d040837340, lockval=d8ded8de, not locked
> > lock:21878876(98178042228), block:1603338(6043805110)
> >
> > So a total block time of ~6s, and average block time of 3.7us.
> >
> > With the per-node locks:
> >
> > Global flush_lock: addr=ffff82d040837360, lockval=78e678e6, not locked
> > lock:6781028(41032945811), block:583712(2025657239)
> > NUMA node 1 flush_lock: addr=ffff832fd085b110, lockval=5cd65cd6, not locked
> > lock:220374(766500536), block:4091(9933129)
> > NUMA node 0 flush_lock: addr=ffff8310336a7110, lockval=5c715c71, not locked
> > lock:547953(1658170241), block:23856(51266297)
> >
> > The total block time goes down to ~2s, and the average block time is 3.4us.
> > The total block time of the per-node locks is much lower compared to the
> > global flush_lock, 9ms and 51ms respectively.
> >
> > Note the example here is possibly the system where such split locks don't
> > make a lot of difference, being a 2 node system, but still there's a
> > non-trivial difference between the block times. On a 4 or 9 node system
> > the figures should likely be even better.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
>
> The savings must be from TLB flushes only? Cache flushes, as we have discussed
> recently, need to be global anyway, and hence there are no savings to be had.
> This may want reflecting the subject and/or description.
Yeah, it's only TLB flushes.
> > @@ -126,3 +129,95 @@ int __init arch_get_ram_range(unsigned int idx,
> > paddr_t *start, paddr_t *end)
> >
> > return 0;
> > }
> > +
> > +static struct arch_numa_node {
> > + const void *flush_va;
> > + unsigned int flush_flags;
> > + cpumask_t flush_mask;
> > + spinlock_t flush_lock;
> > + struct lock_profile_qhead profile_head;
> > +} *node_info[MAX_NUMNODES];
>
> __ro_after_init or at least __read_mostly?
__ro_after_init would be OK I think.
> > +static int __init cf_check arch_numa_init(void)
> > +{
> > + unsigned int i;
> > +
> > + if ( num_online_nodes() == 1 )
> > + return 0;
> > +
> > + for_each_online_node ( i )
> > + {
> > + struct arch_numa_node *node =
> > + alloc_xenheap_pages(get_order_from_bytes(sizeof(*node)),
> > + MEMF_node(i));
>
> A full page for what may cover just a cacheline or two? I realize ...
>
> > + if ( node )
> > + clear_page(node);
> > + else
> > + node = xvzalloc(typeof(*node));
>
> ... this (sadly) still has no NUMA-capable counterpart, but I'd have expected
> at least a brief comment to justify.
Oh, yes, it should better have a comment.
> > +void cf_check invalidate_tbl_numa(void)
>
> Nit: s/tbl/tlb/ ?
>
> > +bool flush_numa_node(const cpumask_t *mask, const void *va, unsigned int
> > flags)
> > +{
> > + nodeid_t node = num_online_nodes() > 1 ? cpumask_to_node(mask)
> > + : NUMA_NO_NODE;
> > + struct arch_numa_node *info;
> > +
> > + if ( node == NUMA_NO_NODE )
> > + return false;
> > +
> > + info = node_info[node];
> > +
> > + if ( !info )
> > + return false;
> > +
> > + spin_lock(&info->flush_lock);
> > + cpumask_and(&info->flush_mask, mask, &cpu_online_map);
> > + cpumask_clear_cpu(smp_processor_id(), &info->flush_mask);
>
> IOW this function is strictly a helper of flush_area_mask(), relying on
> the local CPU to have had its flushing done already. Might it not be
> better if this was put as a static in smp.c then? Or, to keep struct
> arch_numa_node local to this file, split it into two parts?
>
> > + info->flush_va = va;
> > + info->flush_flags = flags;
> > + send_IPI_mask(&info->flush_mask, INVALIDATE_NUMA_VECTOR);
>
> This one similarly depends on flush_area_mask() behavior, not calling here
> when mask has solely the local CPU set.
>
> > --- a/xen/common/numa.c
> > +++ b/xen/common/numa.c
> > @@ -689,6 +689,29 @@ static int __init cf_check numa_setup(const char *opt)
> > }
> > custom_param("numa", numa_setup);
> >
> > +/*
> > + * Return the NUMA node index if all CPUs in the mask belong to the same
> > node,
> > + * otherwise return NUMA_NO_NODE.
> > + */
> > +nodeid_t cpumask_to_node(const cpumask_t *mask)
> > +{
> > + unsigned int cpu;
> > + nodeid_t node = NUMA_NO_NODE;
> > +
> > + if ( num_online_nodes() == 1 )
> > + return cpu_to_node[0];
>
> The sole caller checks this already, so strictly speaking Misra would consider
> this dead / unreachable code.
>
> Similarly Misra may not like this on Arm, where it's unused right now. Then
> again looks like NUMA work there was abandoned, so for now all ought to be
> fine here.
Thanks, given the feedback I think you consider this an improvement
worth pursuing then?
Regards, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |