[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] x86/NUMA: improve memnode_shift calculation for multi node system
On 30.09.2022 10:25, Roger Pau Monné wrote: > On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote: >> SRAT may describe individual nodes using multiple ranges. When they're >> adjacent (with or without a gap in between), only the start of the first >> such range actually needs accounting for. Furthermore the very first >> range doesn't need considering of its start address at all, as it's fine >> to associate all lower addresses (with no memory) with that same node. >> For this to work, the array of ranges needs to be sorted by address - >> adjust logic accordingly in acpi_numa_memory_affinity_init(). > > Speaking for myself (due to the lack of knowledge of the NUMA stuff) I > would benefit from a bit of context on why and how memnode_shift is > used. It's used solely to shift PDXes right before indexing memnodemap[]. Hence a larger shift allows for a smaller array (less memory and, perhaps more importantly, less cache footprint). Personally I don't think such needs mentioning in a patch, but I know others think differently (George for example looks to believe that the present situation should always be described). In the case here a simple grep for memnode_shift would tell you, and even if I described this here it wouldn't really help review I think: You'd still want to verify then that what I claim actually matches reality. >> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str >> node, pxm, start, end - 1, >> ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : ""); >> >> - node_memblk_range[num_node_memblks].start = start; >> - node_memblk_range[num_node_memblks].end = end; >> - memblk_nodeid[num_node_memblks] = node; >> + /* Keep node_memblk_range[] sorted by address. */ >> + for (i = 0; i < num_node_memblks; ++i) >> + if (node_memblk_range[i].start > start || >> + (node_memblk_range[i].start == start && > > Maybe I'm confused, but won't .start == start means we have > overlapping ranges? Yes, except when a range is empty. >> + node_memblk_range[i].end > end)) >> + break; >> + >> + memmove(&node_memblk_range[i + 1], &node_memblk_range[i], >> + (num_node_memblks - i) * sizeof(*node_memblk_range)); >> + node_memblk_range[i].start = start; >> + node_memblk_range[i].end = end; >> + >> + memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i], >> + (num_node_memblks - i) * sizeof(*memblk_nodeid)); >> + memblk_nodeid[i] = node; >> + >> if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) { >> - __set_bit(num_node_memblks, memblk_hotplug); >> + next = true; >> if (end > mem_hotplug) >> mem_hotplug = end; >> } >> + for (; i <= num_node_memblks; ++i) { >> + bool prev = next; >> + >> + next = test_bit(i, memblk_hotplug); >> + if (prev) >> + __set_bit(i, memblk_hotplug); >> + else >> + __clear_bit(i, memblk_hotplug); > > Nit: I think you could avoid doing the clear for the last bit, ie: > else if (i != num_node_memblks) __clear_bit(...); > > But I'm not sure it's worth adding the logic, just makes it more > complicated to follow. Indeed. I did consider both this and using a single __change_bit() wrapped in a suitable if(). Both looked to me like they would hamper proving the code is doing what it ought to do. Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |