[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 2/2] x86/numa: introduce per-NUMA node flush locks
On Wed, Jun 04, 2025 at 02:35:01PM +0200, Jan Beulich wrote: > On 22.05.2025 10:48, Roger Pau Monne wrote: > > Contention around the global flush_lock increases as the amount of physical > > CPUs on the host also increases. Sadly this doesn't scale on big boxes. > > However most of the time Xen doesn't require broadcasting flushes to all > > CPUs on the system, and hence more fine grained (ie: covering less CPUs) > > locks could be used. > > > > A natural boundary to use when splitting the locks are NUMA nodes. Most > > domains will be limited to running on a single node, specifically the one > > where the domain memory has been allocated from. Flushes related to > > domains are most likely to be limited to a single NUMA node, and hence > > being able to execute per-node flushes allows to reduce the contention > > around the global flush_lock, while also allowing to perform concurrent > > flushes on different nodes. > > > > This however doesn't come for free. A new vector must be allocated to be > > used for the per-NUMA flushes, and more logic is required in the flush > > dispatcher to figure out whether a flush is limited to a single node. > > > > The figures on a 2-node NUMA system are as follows, after having been > > running the same XenRT boot storm workload for 90 minutes. > > > > Without the per-NUMA node flush: > > > > Global flush_lock: addr=ffff82d040837340, lockval=d8ded8de, not locked > > lock:21878876(98178042228), block:1603338(6043805110) > > > > So a total block time of ~6s, and average block time of 3.7us. > > > > With the per-node locks: > > > > Global flush_lock: addr=ffff82d040837360, lockval=78e678e6, not locked > > lock:6781028(41032945811), block:583712(2025657239) > > NUMA node 1 flush_lock: addr=ffff832fd085b110, lockval=5cd65cd6, not locked > > lock:220374(766500536), block:4091(9933129) > > NUMA node 0 flush_lock: addr=ffff8310336a7110, lockval=5c715c71, not locked > > lock:547953(1658170241), block:23856(51266297) > > > > The total block time goes down to ~2s, and the average block time is 3.4us. > > The total block time of the per-node locks is much lower compared to the > > global flush_lock, 9ms and 51ms respectively. > > > > Note the example here is possibly the system where such split locks don't > > make a lot of difference, being a 2 node system, but still there's a > > non-trivial difference between the block times. On a 4 or 9 node system > > the figures should likely be even better. > > > > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> > > The savings must be from TLB flushes only? Cache flushes, as we have discussed > recently, need to be global anyway, and hence there are no savings to be had. > This may want reflecting the subject and/or description. Yeah, it's only TLB flushes. > > @@ -126,3 +129,95 @@ int __init arch_get_ram_range(unsigned int idx, > > paddr_t *start, paddr_t *end) > > > > return 0; > > } > > + > > +static struct arch_numa_node { > > + const void *flush_va; > > + unsigned int flush_flags; > > + cpumask_t flush_mask; > > + spinlock_t flush_lock; > > + struct lock_profile_qhead profile_head; > > +} *node_info[MAX_NUMNODES]; > > __ro_after_init or at least __read_mostly? __ro_after_init would be OK I think. > > +static int __init cf_check arch_numa_init(void) > > +{ > > + unsigned int i; > > + > > + if ( num_online_nodes() == 1 ) > > + return 0; > > + > > + for_each_online_node ( i ) > > + { > > + struct arch_numa_node *node = > > + alloc_xenheap_pages(get_order_from_bytes(sizeof(*node)), > > + MEMF_node(i)); > > A full page for what may cover just a cacheline or two? I realize ... > > > + if ( node ) > > + clear_page(node); > > + else > > + node = xvzalloc(typeof(*node)); > > ... this (sadly) still has no NUMA-capable counterpart, but I'd have expected > at least a brief comment to justify. Oh, yes, it should better have a comment. > > +void cf_check invalidate_tbl_numa(void) > > Nit: s/tbl/tlb/ ? > > > +bool flush_numa_node(const cpumask_t *mask, const void *va, unsigned int > > flags) > > +{ > > + nodeid_t node = num_online_nodes() > 1 ? cpumask_to_node(mask) > > + : NUMA_NO_NODE; > > + struct arch_numa_node *info; > > + > > + if ( node == NUMA_NO_NODE ) > > + return false; > > + > > + info = node_info[node]; > > + > > + if ( !info ) > > + return false; > > + > > + spin_lock(&info->flush_lock); > > + cpumask_and(&info->flush_mask, mask, &cpu_online_map); > > + cpumask_clear_cpu(smp_processor_id(), &info->flush_mask); > > IOW this function is strictly a helper of flush_area_mask(), relying on > the local CPU to have had its flushing done already. Might it not be > better if this was put as a static in smp.c then? Or, to keep struct > arch_numa_node local to this file, split it into two parts? > > > + info->flush_va = va; > > + info->flush_flags = flags; > > + send_IPI_mask(&info->flush_mask, INVALIDATE_NUMA_VECTOR); > > This one similarly depends on flush_area_mask() behavior, not calling here > when mask has solely the local CPU set. > > > --- a/xen/common/numa.c > > +++ b/xen/common/numa.c > > @@ -689,6 +689,29 @@ static int __init cf_check numa_setup(const char *opt) > > } > > custom_param("numa", numa_setup); > > > > +/* > > + * Return the NUMA node index if all CPUs in the mask belong to the same > > node, > > + * otherwise return NUMA_NO_NODE. > > + */ > > +nodeid_t cpumask_to_node(const cpumask_t *mask) > > +{ > > + unsigned int cpu; > > + nodeid_t node = NUMA_NO_NODE; > > + > > + if ( num_online_nodes() == 1 ) > > + return cpu_to_node[0]; > > The sole caller checks this already, so strictly speaking Misra would consider > this dead / unreachable code. > > Similarly Misra may not like this on Arm, where it's unused right now. Then > again looks like NUMA work there was abandoned, so for now all ought to be > fine here. Thanks, given the feedback I think you consider this an improvement worth pursuing then? Regards, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |