[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 2/2] x86/numa: introduce per-NUMA node flush locks
On 22.05.2025 10:48, Roger Pau Monne wrote: > Contention around the global flush_lock increases as the amount of physical > CPUs on the host also increases. Sadly this doesn't scale on big boxes. > However most of the time Xen doesn't require broadcasting flushes to all > CPUs on the system, and hence more fine grained (ie: covering less CPUs) > locks could be used. > > A natural boundary to use when splitting the locks are NUMA nodes. Most > domains will be limited to running on a single node, specifically the one > where the domain memory has been allocated from. Flushes related to > domains are most likely to be limited to a single NUMA node, and hence > being able to execute per-node flushes allows to reduce the contention > around the global flush_lock, while also allowing to perform concurrent > flushes on different nodes. > > This however doesn't come for free. A new vector must be allocated to be > used for the per-NUMA flushes, and more logic is required in the flush > dispatcher to figure out whether a flush is limited to a single node. > > The figures on a 2-node NUMA system are as follows, after having been > running the same XenRT boot storm workload for 90 minutes. > > Without the per-NUMA node flush: > > Global flush_lock: addr=ffff82d040837340, lockval=d8ded8de, not locked > lock:21878876(98178042228), block:1603338(6043805110) > > So a total block time of ~6s, and average block time of 3.7us. > > With the per-node locks: > > Global flush_lock: addr=ffff82d040837360, lockval=78e678e6, not locked > lock:6781028(41032945811), block:583712(2025657239) > NUMA node 1 flush_lock: addr=ffff832fd085b110, lockval=5cd65cd6, not locked > lock:220374(766500536), block:4091(9933129) > NUMA node 0 flush_lock: addr=ffff8310336a7110, lockval=5c715c71, not locked > lock:547953(1658170241), block:23856(51266297) > > The total block time goes down to ~2s, and the average block time is 3.4us. > The total block time of the per-node locks is much lower compared to the > global flush_lock, 9ms and 51ms respectively. > > Note the example here is possibly the system where such split locks don't > make a lot of difference, being a 2 node system, but still there's a > non-trivial difference between the block times. On a 4 or 9 node system > the figures should likely be even better. > > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> The savings must be from TLB flushes only? Cache flushes, as we have discussed recently, need to be global anyway, and hence there are no savings to be had. This may want reflecting the subject and/or description. > @@ -126,3 +129,95 @@ int __init arch_get_ram_range(unsigned int idx, paddr_t > *start, paddr_t *end) > > return 0; > } > + > +static struct arch_numa_node { > + const void *flush_va; > + unsigned int flush_flags; > + cpumask_t flush_mask; > + spinlock_t flush_lock; > + struct lock_profile_qhead profile_head; > +} *node_info[MAX_NUMNODES]; __ro_after_init or at least __read_mostly? > +static int __init cf_check arch_numa_init(void) > +{ > + unsigned int i; > + > + if ( num_online_nodes() == 1 ) > + return 0; > + > + for_each_online_node ( i ) > + { > + struct arch_numa_node *node = > + alloc_xenheap_pages(get_order_from_bytes(sizeof(*node)), > + MEMF_node(i)); A full page for what may cover just a cacheline or two? I realize ... > + if ( node ) > + clear_page(node); > + else > + node = xvzalloc(typeof(*node)); ... this (sadly) still has no NUMA-capable counterpart, but I'd have expected at least a brief comment to justify. > +void cf_check invalidate_tbl_numa(void) Nit: s/tbl/tlb/ ? > +bool flush_numa_node(const cpumask_t *mask, const void *va, unsigned int > flags) > +{ > + nodeid_t node = num_online_nodes() > 1 ? cpumask_to_node(mask) > + : NUMA_NO_NODE; > + struct arch_numa_node *info; > + > + if ( node == NUMA_NO_NODE ) > + return false; > + > + info = node_info[node]; > + > + if ( !info ) > + return false; > + > + spin_lock(&info->flush_lock); > + cpumask_and(&info->flush_mask, mask, &cpu_online_map); > + cpumask_clear_cpu(smp_processor_id(), &info->flush_mask); IOW this function is strictly a helper of flush_area_mask(), relying on the local CPU to have had its flushing done already. Might it not be better if this was put as a static in smp.c then? Or, to keep struct arch_numa_node local to this file, split it into two parts? > + info->flush_va = va; > + info->flush_flags = flags; > + send_IPI_mask(&info->flush_mask, INVALIDATE_NUMA_VECTOR); This one similarly depends on flush_area_mask() behavior, not calling here when mask has solely the local CPU set. > --- a/xen/common/numa.c > +++ b/xen/common/numa.c > @@ -689,6 +689,29 @@ static int __init cf_check numa_setup(const char *opt) > } > custom_param("numa", numa_setup); > > +/* > + * Return the NUMA node index if all CPUs in the mask belong to the same > node, > + * otherwise return NUMA_NO_NODE. > + */ > +nodeid_t cpumask_to_node(const cpumask_t *mask) > +{ > + unsigned int cpu; > + nodeid_t node = NUMA_NO_NODE; > + > + if ( num_online_nodes() == 1 ) > + return cpu_to_node[0]; The sole caller checks this already, so strictly speaking Misra would consider this dead / unreachable code. Similarly Misra may not like this on Arm, where it's unused right now. Then again looks like NUMA work there was abandoned, so for now all ought to be fine here. Jan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |