Xen project Mailing List

Re: [PATCH 2/2] x86/numa: introduce per-NUMA node flush locks

To: Roger Pau Monne <roger.pau@xxxxxxxxxx>

Date: Wed, 4 Jun 2025 14:35:01 +0200

Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Wed, 04 Jun 2025 12:35:26 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 22.05.2025 10:48, Roger Pau Monne wrote: > Contention around the global flush_lock increases as the amount of physical > CPUs on the host also increases. Sadly this doesn't scale on big boxes. > However most of the time Xen doesn't require broadcasting flushes to all > CPUs on the system, and hence more fine grained (ie: covering less CPUs) > locks could be used. > > A natural boundary to use when splitting the locks are NUMA nodes. Most > domains will be limited to running on a single node, specifically the one > where the domain memory has been allocated from. Flushes related to > domains are most likely to be limited to a single NUMA node, and hence > being able to execute per-node flushes allows to reduce the contention > around the global flush_lock, while also allowing to perform concurrent > flushes on different nodes. > > This however doesn't come for free. A new vector must be allocated to be > used for the per-NUMA flushes, and more logic is required in the flush > dispatcher to figure out whether a flush is limited to a single node. > > The figures on a 2-node NUMA system are as follows, after having been > running the same XenRT boot storm workload for 90 minutes. > > Without the per-NUMA node flush: > > Global flush_lock: addr=ffff82d040837340, lockval=d8ded8de, not locked > lock:21878876(98178042228), block:1603338(6043805110) > > So a total block time of ~6s, and average block time of 3.7us. > > With the per-node locks: > > Global flush_lock: addr=ffff82d040837360, lockval=78e678e6, not locked > lock:6781028(41032945811), block:583712(2025657239) > NUMA node 1 flush_lock: addr=ffff832fd085b110, lockval=5cd65cd6, not locked > lock:220374(766500536), block:4091(9933129) > NUMA node 0 flush_lock: addr=ffff8310336a7110, lockval=5c715c71, not locked > lock:547953(1658170241), block:23856(51266297) > > The total block time goes down to ~2s, and the average block time is 3.4us. > The total block time of the per-node locks is much lower compared to the > global flush_lock, 9ms and 51ms respectively. > > Note the example here is possibly the system where such split locks don't > make a lot of difference, being a 2 node system, but still there's a > non-trivial difference between the block times. On a 4 or 9 node system > the figures should likely be even better. > > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> The savings must be from TLB flushes only? Cache flushes, as we have discussed recently, need to be global anyway, and hence there are no savings to be had. This may want reflecting the subject and/or description. > @@ -126,3 +129,95 @@ int __init arch_get_ram_range(unsigned int idx, paddr_t > *start, paddr_t *end) > > return 0; > } > + > +static struct arch_numa_node { > + const void *flush_va; > + unsigned int flush_flags; > + cpumask_t flush_mask; > + spinlock_t flush_lock; > + struct lock_profile_qhead profile_head; > +} *node_info[MAX_NUMNODES]; __ro_after_init or at least __read_mostly? > +static int __init cf_check arch_numa_init(void) > +{ > + unsigned int i; > + > + if ( num_online_nodes() == 1 ) > + return 0; > + > + for_each_online_node ( i ) > + { > + struct arch_numa_node *node = > + alloc_xenheap_pages(get_order_from_bytes(sizeof(*node)), > + MEMF_node(i)); A full page for what may cover just a cacheline or two? I realize ... > + if ( node ) > + clear_page(node); > + else > + node = xvzalloc(typeof(*node)); ... this (sadly) still has no NUMA-capable counterpart, but I'd have expected at least a brief comment to justify. > +void cf_check invalidate_tbl_numa(void) Nit: s/tbl/tlb/ ? > +bool flush_numa_node(const cpumask_t *mask, const void *va, unsigned int > flags) > +{ > + nodeid_t node = num_online_nodes() > 1 ? cpumask_to_node(mask) > + : NUMA_NO_NODE; > + struct arch_numa_node *info; > + > + if ( node == NUMA_NO_NODE ) > + return false; > + > + info = node_info[node]; > + > + if ( !info ) > + return false; > + > + spin_lock(&info->flush_lock); > + cpumask_and(&info->flush_mask, mask, &cpu_online_map); > + cpumask_clear_cpu(smp_processor_id(), &info->flush_mask); IOW this function is strictly a helper of flush_area_mask(), relying on the local CPU to have had its flushing done already. Might it not be better if this was put as a static in smp.c then? Or, to keep struct arch_numa_node local to this file, split it into two parts? > + info->flush_va = va; > + info->flush_flags = flags; > + send_IPI_mask(&info->flush_mask, INVALIDATE_NUMA_VECTOR); This one similarly depends on flush_area_mask() behavior, not calling here when mask has solely the local CPU set. > --- a/xen/common/numa.c > +++ b/xen/common/numa.c > @@ -689,6 +689,29 @@ static int __init cf_check numa_setup(const char *opt) > } > custom_param("numa", numa_setup); > > +/* > + * Return the NUMA node index if all CPUs in the mask belong to the same > node, > + * otherwise return NUMA_NO_NODE. > + */ > +nodeid_t cpumask_to_node(const cpumask_t *mask) > +{ > + unsigned int cpu; > + nodeid_t node = NUMA_NO_NODE; > + > + if ( num_online_nodes() == 1 ) > + return cpu_to_node[0]; The sole caller checks this already, so strictly speaking Misra would consider this dead / unreachable code. Similarly Misra may not like this on Arm, where it's unused right now. Then again looks like NUMA work there was abandoned, so for now all ought to be fine here. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.