[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH] SYSCTL_numainfo.memsize: Switch spanned to present memory



On Mon Dec 9, 2024 at 8:23 AM GMT, Jan Beulich wrote:
> On 05.12.2024 11:55, Bernhard Kaindl wrote:
> > On 03/12/2024 12:37, Jan Beulich wrote:
> >> On 03.12.2024 12:12, Bernhard Kaindl wrote:
> >>> This the 2nd part of my submission to fix the NUMA node memsize
> >>> retured in xen_sysctl_meminfo[].memsize by the XEN_SYSCTL_numainfo
> >>> hypercall to not count MMIO memory holes etc but only memory pages.
> >>>
> >>> For this, we introduced NODE_DATA->node_present_pages as a prereq.
> >>> With the prereq merged in master, I send this 2nd part for review:
> >>>
> >>> This RFC is for changing the value of xen_sysctl_meminfo[]->memsize
> >>> from NODE_DATA->node_spanned_pages << PAGE_SHIFT
> >>>    to NODE_DATA->node_present_pages << PAGE_SHIFT
> >>> for returing total present NUMA node memory instead of spanned range.
> >>>
> >>> Sample of struct xen_sysctl_meminfo[].* as presented by in xl info -n:
> >>>
> >>> xl info -n:
> >>> [...]
> >>> node:    memsize    memfree    distances
> >>>     0:  -> 67584 <-   60672      10,21
> >>>     1:     65536      60958      21,10
> >>>
> >>> The -> memsize <- marked here is the value that we'd like to fix:
> >>> The current value based node_spanned_pages is often 2TB too large.
> >>>
> >>> We're currently not using these often false memsize values in XenServer
> >>> according to my code review and and Andrew seemed to confirm this as well.
> >>>
> >>> I think that the same is likely true for other Xen toolstacks, but of 
> >>> course
> >>> to review this change or propose an alternaive is the purpose of this RFC.
> >>>
> >>> Thanks,
> >>> Bernhard
> >>
> >> All of the above reads like a cover letter. What's missing is a patch
> >> description, part of which would be to clarify whether the field is
> >> indeed unused except for display purposes, or why respective users would
> >> at least not regress from this change. What's also unclear is what
> >> comments you're actually after (i.e. what question(s) you want to have
> >> answered), seeing this is tagged RFC.
> > [...]
> >> Jan
> > 
> > Hi Jan!
> > 
> > The answer I'm looking for is which users to check, or to check with.
> > 
> > For example, I know that Xapi can use xen_sysctl_meminfo[].memfree to
> > get a preference about the NUMA node use use when creating a domain
> > (when the new mode `numa_affinity_policy.best_effort` is enabled):
> > https://xapi-project.github.io/new-docs/toolstack/features/NUMA/
> > 
> > A potential use of xen_sysctl_meminfo.memsize in Xen toolstacks is
> > less clear to me:
> > 
> > The only potential use would be if some Xen toolstack would not like
> > to solely rely on [nid].memfree for NUMA placement.
> > 
> > The question is if there are other NUMA aware toolstacks besides Xapi,
> > that would try to use it for e.g. planning the placement of domains.
> > 
> > My in the Xapi and Xen repos only turned up a debug printf() in
> > xen-api's xen-api/xenopsd and in xen only the output of xl info -n.
> > 
> > It seems questionable to me that any other toolstacks would rely on it,
> > especially as the value it returns currently is offset even 2GB on some
> > machines. I'd expect that this bug would have affected code using it.
> > 
> > The answers I am looking for are acknowledgements of that or references 
> > which users might use .memsize currently (that could be affected).
>
> IOW all questions to respective toolstack people.
>
> > Alternatively, I'd hope to get an idea what would be the method to 
> > create a new revision of the numainfo hypercall:
> > 
> > I guess it would be to add a new #define XEN_SYSCTL_numainfo_v2,
> > and if v2 is called, return [].memsize using [nid].node_present_pages 
> > instead?
>
> That's a last resort, yes. Since sysctls aren't stable (yet), changing
> existing interfaces generally is an option. We merely want to figure
> how careful we need to be. It may be fine to do the change "silently",
> as you do now. A middle option might be to rename the field which has
> its meaning changed, such that anyone using the field will notice that
> they need to update their code, hopefully resulting in them checking
> what changed and hence what they may need to change.
>
> Jan

The biggest unknown is libvirt, I think. They use libxl and do use
libxl_get_numainfo(). I suspect they rely on this new semantic rather than the
old one, but didn't dig hard enough to find out. It may very well be just
informative, just as `xl info`.

  https://gitlab.com/libvirt/libvirt

If they don't rely on memsize meaning "node span" then this patch is best as it
is now. Might be worth either checking their code or pinging them in their
mailing list or IRC to confirm.

If they do rely on the current semantics, adding a new field to
xen_sysctl_meminfo ought to be fine as long as you bump the interface version
on top of sysctl.h as well.

Cheers,
Alejandro

P.S: If going down the "new field" route, I'd also like to suggest replacing
     uint64_t with uint64_aligned_t in xen_sysctl_meminfo while at it.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.