| 
    
 [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] x86/NUMA: improve memnode_shift calculation for multi node system
 On 30.09.2022 10:25, Roger Pau Monné wrote:
> On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote:
>> SRAT may describe individual nodes using multiple ranges. When they're
>> adjacent (with or without a gap in between), only the start of the first
>> such range actually needs accounting for. Furthermore the very first
>> range doesn't need considering of its start address at all, as it's fine
>> to associate all lower addresses (with no memory) with that same node.
>> For this to work, the array of ranges needs to be sorted by address -
>> adjust logic accordingly in acpi_numa_memory_affinity_init().
> 
> Speaking for myself (due to the lack of knowledge of the NUMA stuff) I
> would benefit from a bit of context on why and how memnode_shift is
> used.
It's used solely to shift PDXes right before indexing memnodemap[].
Hence a larger shift allows for a smaller array (less memory and,
perhaps more importantly, less cache footprint). Personally I don't
think such needs mentioning in a patch, but I know others think
differently (George for example looks to believe that the present
situation should always be described). In the case here a simple
grep for memnode_shift would tell you, and even if I described this
here it wouldn't really help review I think: You'd still want to
verify then that what I claim actually matches reality.
>> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str
>>             node, pxm, start, end - 1,
>>             ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
>>  
>> -    node_memblk_range[num_node_memblks].start = start;
>> -    node_memblk_range[num_node_memblks].end = end;
>> -    memblk_nodeid[num_node_memblks] = node;
>> +    /* Keep node_memblk_range[] sorted by address. */
>> +    for (i = 0; i < num_node_memblks; ++i)
>> +            if (node_memblk_range[i].start > start ||
>> +                (node_memblk_range[i].start == start &&
> 
> Maybe I'm confused, but won't .start == start means we have
> overlapping ranges?
Yes, except when a range is empty.
>> +                 node_memblk_range[i].end > end))
>> +                    break;
>> +
>> +    memmove(&node_memblk_range[i + 1], &node_memblk_range[i],
>> +            (num_node_memblks - i) * sizeof(*node_memblk_range));
>> +    node_memblk_range[i].start = start;
>> +    node_memblk_range[i].end = end;
>> +
>> +    memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i],
>> +            (num_node_memblks - i) * sizeof(*memblk_nodeid));
>> +    memblk_nodeid[i] = node;
>> +
>>      if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
>> -            __set_bit(num_node_memblks, memblk_hotplug);
>> +            next = true;
>>              if (end > mem_hotplug)
>>                      mem_hotplug = end;
>>      }
>> +    for (; i <= num_node_memblks; ++i) {
>> +            bool prev = next;
>> +
>> +            next = test_bit(i, memblk_hotplug);
>> +            if (prev)
>> +                    __set_bit(i, memblk_hotplug);
>> +            else
>> +                    __clear_bit(i, memblk_hotplug);
> 
> Nit: I think you could avoid doing the clear for the last bit, ie:
> else if (i != num_node_memblks) __clear_bit(...);
> 
> But I'm not sure it's worth adding the logic, just makes it more
> complicated to follow.
Indeed. I did consider both this and using a single __change_bit()
wrapped in a suitable if(). Both looked to me like they would hamper
proving the code is doing what it ought to do.
Jan
 
 
  | 
  
![]()  | 
            
         Lists.xenproject.org is hosted with RackSpace, monitoring our  |