[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 2/6] xen/x86: move generically usable NUMA code from x86 to common


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Wei Chen <Wei.Chen@xxxxxxx>
  • Date: Thu, 29 Sep 2022 15:43:30 +0800
  • Arc-authentication-results: i=2; mx.microsoft.com 1; spf=pass (sender ip is 63.35.35.123) smtp.rcpttodomain=lists.xenproject.org smtp.mailfrom=arm.com; dmarc=pass (p=none sp=none pct=100) action=none header.from=arm.com; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com; arc=pass (0 oda=1 ltdi=1 spf=[1,1,smtp.mailfrom=arm.com] dkim=[1,1,header.d=arm.com] dmarc=[1,1,header.from=arm.com])
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none
  • Arc-message-signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=j8IsjiY1jpjUkiqZ7otyvOXn4y+HtgkY331kkTtVr94=; b=dZjlJT/PRJys08Z+1moNhFFURMT3ZVj2ctsawKm6JhBbt33p0w28YaTouCRoF9CMl/Y09pFDfpp42eoo7JSw2YcQsio9tU5Artlcjnf7FyYf80115Nw+PG6bEoo/M7vghToeubrI5xdYTtwefqu3Z2Augz+7Sr/ihgBoTfI0EGQQTWUPbi/VpJVljWmNyr69I0FvRmle4ssgGKa5+9CRWyoHRwkpyFHKVb6iTHASzsUcAQXR3zRHgNvMi5tIqwpsxkaLOUjTHkwoGocW1JqO+H18Nx+7iG/zS4aPyGcwsLMbcD44Sn9BaqX6pxedijZGm8d1YYLZnNmUvQLsmPsXQA==
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=j8IsjiY1jpjUkiqZ7otyvOXn4y+HtgkY331kkTtVr94=; b=GK4tXkcaTQRhxewDPPVFRPtpSchLNnFM1kOMUvWVSTwJ4SlAdrVug2AVM4AlGc/8w88SW4REGEaGUPhDobjZOgVpPQ+ttv6Cx3Y3SQZ/FZ1w2jo+T0ae2uwo96bbcRsA9i/oqHFWtMBDS8TWTkSnp43B1qHQJH83d3wF7Vc9NhI4vPq0zdhhLQFgYqj8tDXZhltN+cYLn6aC5fYZrYDQxrMRM/VKCKA5ALEXi+Sp+9Hxhet5yaI0zfHG/V/jDy0g4E1FCEMlHq0EHGp+mh1NDw9hEW3aAGtuZTEMfMY1u04xlUcogIILaFw+gD7cDjj9HtC3J2P+Gj88j+65JzANKw==
  • Arc-seal: i=2; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=pass; b=BzfxZfLWoX2YaoWKHDs5SonuZjOaz0FmF90jikMGAXh+wPsmj2cxgSEWQtu3+ZYeB3j0KXxzscZgJVVlDKYKtHoGf6LKVt7pdCwX0UrFQZ6iP2fLPATeKb/2m1TdpMCMXaV+cKFNcqFzOqyYRD+X8DMMXpPUulk16x2xbdbrmb7ywSP68Rb1DakLUBl5bS0JqDb2stu7cN92fp1kWkzIQ9SDNKkOtvu9yTrIAkLDcyCWIRzv/f7plr8Rjeld38oNg6Z89l1J0OC3ABSVyhRvLIe7M4SW3PxROHsRdWzL8JGen0tvZB4SXMZtB9UCFEH/h+d0o3Fez17pam8Vpo9+Sg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=OtJh0/zViZ360IlmV79+2Sf6qK+3CIjqfEvleqfAN0sFiAIQkhKOazDLBChEg5COVkyrECmn4HqzpWl6NIzmDB4DTa+twjeamArXCT7wMfO4z0ZmkjMi/vguOlCtVTnzOfERZOWKZRUXCbbg+yAp9FQYMwkz1DUDFGS2L2dmYcUw+QRHPEmSq7c60zxqggvQmi3CYHVZ9iu0S+H8G1JMyMIgaL2z55+VX+wuIfX0RJGkTO9DP+PMxDM82DNAgepWwcT8OYMdGLDYOQvv7EC2DNBpjCJEw4SYpNF4h8nzucrvgLKtw4UGhhEstsmZaiuhCG4x6jsss690Mi8waQF0nQ==
  • Authentication-results-original: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com;
  • Cc: nd@xxxxxxx, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Thu, 29 Sep 2022 07:44:16 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Nodisclaimer: true
  • Original-authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com;

Hi Jan,

On 2022/9/27 16:19, Jan Beulich wrote:
On 20.09.2022 11:12, Wei Chen wrote:
There are some codes in x86/numa.c can be shared by common
architectures to implememnt NUMA support. Just like some
variables and functions to check and store NUMA memory map.
And some variables and functions to do NUMA initialization.

In this patch, we move them to common/numa.c and xen/numa.h
and use the CONFIG_NUMA to gate them for non-NUMA supported
architectures. As the target header file is Xen-style, so
we trim some spaces and replace tabs for the codes that has
been moved to xen/numa.h at the same time.

As acpi_scan_nodes has been used in a common function, it
doesn't make sense to use acpi_xxx in common code, so we
rename it to numa_scan_nodes in this patch too. After that

numa_process_nodes() now?

Oh, yes, I will update it.


if we still use CONFIG_ACPI_NUMA in to gate numa_scan_nodes
in numa_initmem_init, that doesn't make sense. As CONFIG_NUMA
will be selected by CONFIG_ACPI_NUMA for x86. So, we replace
CONFIG_ACPI_NUMA by CONFIG_NUMA to gate numa_scan_nodes.

As arch_numa_disabled has been implememnted for ACPI NUMA,
we can rename srat_disabled to numa_disabled and move it
to common code as well.

Please update the description: arch_numa_disabled() appears in patch 5
only. Of course if you follow the comments to patch 2, the wording here
would be correct again.


I will update the description.

+static unsigned int __init extract_lsb_from_nodes(const struct node *nodes,
+                                                  nodeid_t numnodes)
+{
+    unsigned int i, nodes_used = 0;
+    unsigned long spdx, epdx;
+    unsigned long bitfield = 0, memtop = 0;
+
+    for ( i = 0; i < numnodes; i++ )
+    {
+        spdx = paddr_to_pdx(nodes[i].start);
+        epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
+
+        if ( spdx >= epdx )
+            continue;
+
+        bitfield |= spdx;

Perhaps to be taken care of in a separate patch: We accumulate only
the bits from the start addresses here, contrary to what the comment
ahead of the function says (and I think it is the comment which is
putting things correctly).

If one node has non-zero memory, the bit of end will >= the bit of start. As we use this function to calculate LSB, I don't think only
accumulate bits of start addresses will be a problem. Instead I think
we should modify this function comment to say why we only need to accumulate bits of start addresses.


+        nodes_used++;
+        if ( epdx > memtop )
+            memtop = epdx;
+    }
+
+    if ( nodes_used <= 1 )
+        i = BITS_PER_LONG - 1;

Is this actually going to be correct for all architectures? Aiui
Arm64 has only up to 48 physical address bits, but what about an
architecture allowing the use of all 64 bits? I think at the very
least we want BUILD_BUG_ON(PADDR_BITS >= BITS_PER_LONG) here.


Ok I will add above BUILD_BUG_ON. And I also have question why can't
we use PADDR_BITS here directly?

+    else
+        i = find_first_bit(&bitfield, sizeof(unsigned long) * 8);
+
+    memnodemapsize = (memtop >> i) + 1;

Again perhaps the subject of a separate patch: Isn't there an off-by-1
mistake here? memtop is the maximum of all epdx-es, which are
calculated to be the first PDX following the region. Hence I'd expect

     memnodemapsize = ((memtop - 1) >> i) + 1;

here. I guess I'll make patches for both issues, which you may then
need to re-base over.


Thanks, I will wait your patches.

+static int __init cf_check numa_setup(const char *opt)
+{
+    if ( !strncmp(opt, "off", 3) )
+        numa_off = true;
+    else if ( !strncmp(opt, "on", 2) )
+        numa_off = false;
+#ifdef CONFIG_NUMA_EMU
+    else if ( !strncmp(opt, "fake=", 5) )
+    {
+        numa_off = false;
+        numa_fake = simple_strtoul(opt + 5, NULL, 0);
+        if ( numa_fake >= MAX_NUMNODES )
+            numa_fake = MAX_NUMNODES;
+    }
+#endif
+    else
+        return arch_numa_setup(opt);
+
+    return 0;
+}
+custom_param("numa", numa_setup);

Note that with this moved here at some point during your work (when
allowing NUMA=y for Arm) you'll need to update the command line doc.


I have prepared a patch for this doc in part#3 Arm part code.

+static void cf_check dump_numa(unsigned char key)
+{
+    s_time_t now = NOW();
+    unsigned int i, j, n;
+    struct domain *d;
+    const struct page_info *page;
+    unsigned int page_num_node[MAX_NUMNODES];
+    const struct vnuma_info *vnuma;
+
+    printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key,
+           now);
+
+    for_each_online_node ( i )
+    {
+        paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1);
+
+        printk("NODE%u start->%lu size->%lu free->%lu\n",
+               i, node_start_pfn(i), node_spanned_pages(i),
+               avail_node_heap_pages(i));
+        /* Sanity check phys_to_nid() */
+        if ( phys_to_nid(pa) != i )
+            printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n",
+                   pa, phys_to_nid(pa), i);
+    }
+
+    j = cpumask_first(&cpu_online_map);
+    n = 0;
+    for_each_online_cpu ( i )
+    {
+        if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] )
+        {
+            if ( n > 1 )
+                printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
+            else
+                printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
+            j = i;
+            n = 1;
+        }
+        else
+            ++n;
+    }
+    if ( n > 1 )
+        printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]);
+    else
+        printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]);
+
+    rcu_read_lock(&domlist_read_lock);
+
+    printk("Memory location of each domain:\n");
+    for_each_domain ( d )
+    {
+        process_pending_softirqs();
+
+        printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d));
+
+        for_each_online_node ( i )
+            page_num_node[i] = 0;

I'd be inclined to suggest to use memset() here, but I won't insist
on you doing this "on the fly". Along with this would likely go the
request to limit the scope of page_num_node[] (and then perhaps also
vnuma and page).


memset for page_num_node makes sense, I will do it before for_each_domain ( d ).
About limit the scope, did you mean, we should move:

"const struct page_info *page;
unsigned int page_num_node[MAX_NUMNODES];
const struct vnuma_info *vnuma;"

to the block of for_each_domain ( d )?


+        spin_lock(&d->page_alloc_lock);
+        page_list_for_each ( page, &d->page_list )
+        {
+            i = phys_to_nid(page_to_maddr(page));
+            page_num_node[i]++;
+        }
+        spin_unlock(&d->page_alloc_lock);
+
+        for_each_online_node ( i )
+            printk("    Node %u: %u\n", i, page_num_node[i]);
+
+        if ( !read_trylock(&d->vnuma_rwlock) )
+            continue;
+
+        if ( !d->vnuma )
+        {
+            read_unlock(&d->vnuma_rwlock);
+            continue;
+        }
+
+        vnuma = d->vnuma;
+        printk("     %u vnodes, %u vcpus, guest physical layout:\n",
+               vnuma->nr_vnodes, d->max_vcpus);
+        for ( i = 0; i < vnuma->nr_vnodes; i++ )
+        {
+            unsigned int start_cpu = ~0U;
+
+            if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
+                printk("       %3u: pnode ???,", i);
+            else
+                printk("       %3u: pnode %3u,", i, vnuma->vnode_to_pnode[i]);
+
+            printk(" vcpus ");
+
+            for ( j = 0; j < d->max_vcpus; j++ )
+            {
+                if ( !(j & 0x3f) )
+                    process_pending_softirqs();
+
+                if ( vnuma->vcpu_to_vnode[j] == i )
+                {
+                    if ( start_cpu == ~0U )
+                    {
+                        printk("%d", j);

j being "unsigned int", would you mind switching to %u here and below?


Ok, I will do it and below.

--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -18,4 +18,71 @@
    (((d)->vcpu != NULL && (d)->vcpu[0] != NULL) \
     ? vcpu_to_node((d)->vcpu[0]) : NUMA_NO_NODE)
+/* The following content can be used when NUMA feature is enabled */
+#ifdef CONFIG_NUMA
+
+extern nodeid_t      cpu_to_node[NR_CPUS];
+extern cpumask_t     node_to_cpumask[];
+
+#define cpu_to_node(cpu)        cpu_to_node[cpu]
+#define parent_node(node)       (node)
+#define node_to_first_cpu(node) __ffs(node_to_cpumask[node])

I can't spot a use of this - perhaps better drop than generalize (if
done right here then along with mentioning this in the description)?


Yes, there is no code using this macro anymore, I will delete it and mention it in the commit log.

Cheers,
Wei Chen


Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.