[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v5 2/6] xen/x86: move generically usable NUMA code from x86 to common
Hi Jan, On 2022/9/27 16:19, Jan Beulich wrote: On 20.09.2022 11:12, Wei Chen wrote:There are some codes in x86/numa.c can be shared by common architectures to implememnt NUMA support. Just like some variables and functions to check and store NUMA memory map. And some variables and functions to do NUMA initialization. In this patch, we move them to common/numa.c and xen/numa.h and use the CONFIG_NUMA to gate them for non-NUMA supported architectures. As the target header file is Xen-style, so we trim some spaces and replace tabs for the codes that has been moved to xen/numa.h at the same time. As acpi_scan_nodes has been used in a common function, it doesn't make sense to use acpi_xxx in common code, so we rename it to numa_scan_nodes in this patch too. After thatnuma_process_nodes() now? Oh, yes, I will update it. if we still use CONFIG_ACPI_NUMA in to gate numa_scan_nodes in numa_initmem_init, that doesn't make sense. As CONFIG_NUMA will be selected by CONFIG_ACPI_NUMA for x86. So, we replace CONFIG_ACPI_NUMA by CONFIG_NUMA to gate numa_scan_nodes. As arch_numa_disabled has been implememnted for ACPI NUMA, we can rename srat_disabled to numa_disabled and move it to common code as well.Please update the description: arch_numa_disabled() appears in patch 5 only. Of course if you follow the comments to patch 2, the wording here would be correct again. I will update the description. +static unsigned int __init extract_lsb_from_nodes(const struct node *nodes, + nodeid_t numnodes) +{ + unsigned int i, nodes_used = 0; + unsigned long spdx, epdx; + unsigned long bitfield = 0, memtop = 0; + + for ( i = 0; i < numnodes; i++ ) + { + spdx = paddr_to_pdx(nodes[i].start); + epdx = paddr_to_pdx(nodes[i].end - 1) + 1; + + if ( spdx >= epdx ) + continue; + + bitfield |= spdx;Perhaps to be taken care of in a separate patch: We accumulate only the bits from the start addresses here, contrary to what the comment ahead of the function says (and I think it is the comment which is putting things correctly). If one node has non-zero memory, the bit of end will >= the bit of start. As we use this function to calculate LSB, I don't think only accumulate bits of start addresses will be a problem. Instead I thinkwe should modify this function comment to say why we only need to accumulate bits of start addresses. + nodes_used++; + if ( epdx > memtop ) + memtop = epdx; + } + + if ( nodes_used <= 1 ) + i = BITS_PER_LONG - 1;Is this actually going to be correct for all architectures? Aiui Arm64 has only up to 48 physical address bits, but what about an architecture allowing the use of all 64 bits? I think at the very least we want BUILD_BUG_ON(PADDR_BITS >= BITS_PER_LONG) here. Ok I will add above BUILD_BUG_ON. And I also have question why can't we use PADDR_BITS here directly? + else + i = find_first_bit(&bitfield, sizeof(unsigned long) * 8); + + memnodemapsize = (memtop >> i) + 1;Again perhaps the subject of a separate patch: Isn't there an off-by-1 mistake here? memtop is the maximum of all epdx-es, which are calculated to be the first PDX following the region. Hence I'd expect memnodemapsize = ((memtop - 1) >> i) + 1; here. I guess I'll make patches for both issues, which you may then need to re-base over. Thanks, I will wait your patches. +static int __init cf_check numa_setup(const char *opt) +{ + if ( !strncmp(opt, "off", 3) ) + numa_off = true; + else if ( !strncmp(opt, "on", 2) ) + numa_off = false; +#ifdef CONFIG_NUMA_EMU + else if ( !strncmp(opt, "fake=", 5) ) + { + numa_off = false; + numa_fake = simple_strtoul(opt + 5, NULL, 0); + if ( numa_fake >= MAX_NUMNODES ) + numa_fake = MAX_NUMNODES; + } +#endif + else + return arch_numa_setup(opt); + + return 0; +} +custom_param("numa", numa_setup);Note that with this moved here at some point during your work (when allowing NUMA=y for Arm) you'll need to update the command line doc. I have prepared a patch for this doc in part#3 Arm part code. +static void cf_check dump_numa(unsigned char key) +{ + s_time_t now = NOW(); + unsigned int i, j, n; + struct domain *d; + const struct page_info *page; + unsigned int page_num_node[MAX_NUMNODES]; + const struct vnuma_info *vnuma; + + printk("'%c' pressed -> dumping numa info (now = %"PRI_stime")\n", key, + now); + + for_each_online_node ( i ) + { + paddr_t pa = pfn_to_paddr(node_start_pfn(i) + 1); + + printk("NODE%u start->%lu size->%lu free->%lu\n", + i, node_start_pfn(i), node_spanned_pages(i), + avail_node_heap_pages(i)); + /* Sanity check phys_to_nid() */ + if ( phys_to_nid(pa) != i ) + printk("phys_to_nid(%"PRIpaddr") -> %d should be %u\n", + pa, phys_to_nid(pa), i); + } + + j = cpumask_first(&cpu_online_map); + n = 0; + for_each_online_cpu ( i ) + { + if ( i != j + n || cpu_to_node[j] != cpu_to_node[i] ) + { + if ( n > 1 ) + printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]); + else + printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); + j = i; + n = 1; + } + else + ++n; + } + if ( n > 1 ) + printk("CPU%u...%u -> NODE%d\n", j, j + n - 1, cpu_to_node[j]); + else + printk("CPU%u -> NODE%d\n", j, cpu_to_node[j]); + + rcu_read_lock(&domlist_read_lock); + + printk("Memory location of each domain:\n"); + for_each_domain ( d ) + { + process_pending_softirqs(); + + printk("Domain %u (total: %u):\n", d->domain_id, domain_tot_pages(d)); + + for_each_online_node ( i ) + page_num_node[i] = 0;I'd be inclined to suggest to use memset() here, but I won't insist on you doing this "on the fly". Along with this would likely go the request to limit the scope of page_num_node[] (and then perhaps also vnuma and page). memset for page_num_node makes sense, I will do it before for_each_domain ( d ). About limit the scope, did you mean, we should move: "const struct page_info *page; unsigned int page_num_node[MAX_NUMNODES]; const struct vnuma_info *vnuma;" to the block of for_each_domain ( d )? + spin_lock(&d->page_alloc_lock); + page_list_for_each ( page, &d->page_list ) + { + i = phys_to_nid(page_to_maddr(page)); + page_num_node[i]++; + } + spin_unlock(&d->page_alloc_lock); + + for_each_online_node ( i ) + printk(" Node %u: %u\n", i, page_num_node[i]); + + if ( !read_trylock(&d->vnuma_rwlock) ) + continue; + + if ( !d->vnuma ) + { + read_unlock(&d->vnuma_rwlock); + continue; + } + + vnuma = d->vnuma; + printk(" %u vnodes, %u vcpus, guest physical layout:\n", + vnuma->nr_vnodes, d->max_vcpus); + for ( i = 0; i < vnuma->nr_vnodes; i++ ) + { + unsigned int start_cpu = ~0U; + + if ( vnuma->vnode_to_pnode[i] == NUMA_NO_NODE ) + printk(" %3u: pnode ???,", i); + else + printk(" %3u: pnode %3u,", i, vnuma->vnode_to_pnode[i]); + + printk(" vcpus "); + + for ( j = 0; j < d->max_vcpus; j++ ) + { + if ( !(j & 0x3f) ) + process_pending_softirqs(); + + if ( vnuma->vcpu_to_vnode[j] == i ) + { + if ( start_cpu == ~0U ) + { + printk("%d", j);j being "unsigned int", would you mind switching to %u here and below? Ok, I will do it and below. --- a/xen/include/xen/numa.h +++ b/xen/include/xen/numa.h @@ -18,4 +18,71 @@ (((d)->vcpu != NULL && (d)->vcpu[0] != NULL) \ ? vcpu_to_node((d)->vcpu[0]) : NUMA_NO_NODE)+/* The following content can be used when NUMA feature is enabled */+#ifdef CONFIG_NUMA + +extern nodeid_t cpu_to_node[NR_CPUS]; +extern cpumask_t node_to_cpumask[]; + +#define cpu_to_node(cpu) cpu_to_node[cpu] +#define parent_node(node) (node) +#define node_to_first_cpu(node) __ffs(node_to_cpumask[node])I can't spot a use of this - perhaps better drop than generalize (if done right here then along with mentioning this in the description)? Yes, there is no code using this macro anymore, I will delete it and mention it in the commit log. Cheers, Wei Chen Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |