|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] x86/NUMA: improve memnode_shift calculation for multi node system
On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote:
> SRAT may describe individual nodes using multiple ranges. When they're
> adjacent (with or without a gap in between), only the start of the first
> such range actually needs accounting for. Furthermore the very first
> range doesn't need considering of its start address at all, as it's fine
> to associate all lower addresses (with no memory) with that same node.
> For this to work, the array of ranges needs to be sorted by address -
> adjust logic accordingly in acpi_numa_memory_affinity_init().
Speaking for myself (due to the lack of knowledge of the NUMA stuff) I
would benefit from a bit of context on why and how memnode_shift is
used.
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> ---
> On my Dinar and Rome systems this changes memnodemapsize to a single
> page. Originally they used 9 / 130 pages (with shifts going from 8 / 6
> to 15 / 16) respectively, resulting from lowmem gaps [A0,FF] / [A0,BF].
>
> This goes on top of "x86/NUMA: correct memnode_shift calculation for
> single node system".
>
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -127,7 +127,8 @@ static int __init extract_lsb_from_nodes
> epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
> if ( spdx >= epdx )
> continue;
> - bitfield |= spdx;
> + if ( i && (!nodeids || nodeids[i - 1] != nodeids[i]) )
> + bitfield |= spdx;
> if ( !i || !nodeids || nodeids[i - 1] != nodeids[i] )
> nodes_used++;
> if ( epdx > memtop )
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -312,6 +312,7 @@ acpi_numa_memory_affinity_init(const str
> unsigned pxm;
> nodeid_t node;
> unsigned int i;
> + bool next = false;
>
> if (srat_disabled())
> return;
> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str
> node, pxm, start, end - 1,
> ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
>
> - node_memblk_range[num_node_memblks].start = start;
> - node_memblk_range[num_node_memblks].end = end;
> - memblk_nodeid[num_node_memblks] = node;
> + /* Keep node_memblk_range[] sorted by address. */
> + for (i = 0; i < num_node_memblks; ++i)
> + if (node_memblk_range[i].start > start ||
> + (node_memblk_range[i].start == start &&
Maybe I'm confused, but won't .start == start means we have
overlapping ranges?
> + node_memblk_range[i].end > end))
> + break;
> +
> + memmove(&node_memblk_range[i + 1], &node_memblk_range[i],
> + (num_node_memblks - i) * sizeof(*node_memblk_range));
> + node_memblk_range[i].start = start;
> + node_memblk_range[i].end = end;
> +
> + memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i],
> + (num_node_memblks - i) * sizeof(*memblk_nodeid));
> + memblk_nodeid[i] = node;
> +
> if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
> - __set_bit(num_node_memblks, memblk_hotplug);
> + next = true;
> if (end > mem_hotplug)
> mem_hotplug = end;
> }
> + for (; i <= num_node_memblks; ++i) {
> + bool prev = next;
> +
> + next = test_bit(i, memblk_hotplug);
> + if (prev)
> + __set_bit(i, memblk_hotplug);
> + else
> + __clear_bit(i, memblk_hotplug);
Nit: I think you could avoid doing the clear for the last bit, ie:
else if (i != num_node_memblks) __clear_bit(...);
But I'm not sure it's worth adding the logic, just makes it more
complicated to follow.
Thanks, Roger.
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |