Xen project Mailing List

Re: [Xen-devel] [PATCH 08 of 10 v3] libxl: enable automatic placement of guests on NUMA nodes

On Wed, 2012-07-04 at 17:18 +0100, Dario Faggioli wrote: > # HG changeset patch > # User Dario Faggioli <raistlin@xxxxxxxx> > # Date 1341416324 -7200 > # Node ID 7087d3622ee2051654c9e78fe4829da10c2d46f1 > # Parent 6fd693e7f3bc8b4d9bd20befff2c13de5591a7c5 > libxl: enable automatic placement of guests on NUMA nodes > > If a domain does not have a VCPU affinity, try to pin it automatically to some > PCPUs. This is done taking into account the NUMA characteristics of the host. > In fact, we look for a combination of host's NUMA nodes with enough free > memory > and number of PCPUs for the new domain, and pin it to the VCPUs of those > nodes. > > Once we know which ones, among all the possible combinations, represents valid > placement candidates for a domain, use some heuistics for deciding which is > the > best. For instance, smaller candidates are considered to be better, both from > the domain's point of view (fewer memory spreading among nodes) and from the > system as a whole point of view (fewer memoy fragmentation). In case of > candidates of equal sizes (i.e., with the same number of nodes), the amount of > free memory and the number of domain already assigned to their nodes are > considered. Very often, candidates with greater amount of memory are the one > we wants, as this is also good for keeping memory fragmentation under control. > However, if the difference in how much free memory two candidates have, the > number of assigned domains might be what decides which candidate wins. I can't parse this last sentence. Are there some words missing after "how much free memory two candidates have"? If you want to post the corrected text I think we can fold it in while applying rather than reposting (assuming no other reason to repost crops up). > This all happens internally to libxl, and no API for driving the mechanism is > provided for now. This matches what xend already does. > > Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx> > Acked-by: George Dunlap <george.dunlap@xxxxxxxxxxxxx> > > --- > Changes from v2: > * lots of typos. > * Clayfied some comments, as requested during (ijc's) review. > * Added some more information/reference for the combination generation > algorithm. > * nodemap_to_nodes_cpus() function renamed to nodemap_to_nr_cpus(). > * libxl_bitmap_init() used to make sure we do not try to free random > memory on failure paths of functions that allocates a libxl_bitmap. > * Always invoke libxl__sort_numa_candidates(), even if we get there > with just 1 candidate, as requested during review. > * Simplified the if-s that check for input parameter consistency in > libxl__get_numa_candidates() as requested during (gwd's) review. > * Comparison function for candidates changed so that it now provides > total ordering, as requested during review. It is still using FP > arithmetic, though. Also I think that just putting the difference > between the amount of free memory and between the number of assigned > domains of two candidates in a single formula (after normalizing and > weighting them) is both clear and effective enough. > * Function definitions moved to a numa specific source file (libxl_numa.c), > as suggested during review. > > > Changes from v1: > * This patches incorporates the changes from both "libxl, xl: enable > automatic > placement of guests on NUMA nodes" and "libxl, xl: heuristics for > reordering > NUMA placement candidates" from v1. > * The logic of the algorithm is basically the same as in v1, but the > splitting > of it in the various functions has been completely redesigned from scratch. > * No public API for placement or candidate generation is now exposed, > everything happens within libxl, as agreed during v1 review. > * The relevant documentation have been moved near the actual functions and > features. Also, the amount and (hopefully!) the quality of the > documentation > has been improved a lot, as requested. > * All the comments about using the proper libxl facilities and helpers for > allocations, etc., have been considered and applied. > * This patch still bails out from NUMA optimizations if it find out cpupools > are being utilized. It is next patch that makes the two things interact > properly, as suggested during review. > > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 > --- a/docs/man/xl.cfg.pod.5 > +++ b/docs/man/xl.cfg.pod.5 > @@ -111,8 +111,8 @@ created online and the remainder will be > > =item B<cpus="CPU-LIST"> > > -List of which cpus the guest is allowed to use. Default behavior is > -`all cpus`. A C<CPU-LIST> may be specified as follows: > +List of which cpus the guest is allowed to use. By default xl will (via > +libxl) pick some cpus (see below). A C<CPU-LIST> may be specified as follows: > > =over 4 > > @@ -132,6 +132,12 @@ run on cpu #3 of the host. > > =back > > +If this option is not specified, libxl automatically tries to place the new > +domain on the host's NUMA nodes (provided the host has more than one NUMA > +node) by pinning it to the cpus of those nodes. A heuristic approach is > +utilized with the goals of maximizing performance for the domain and, at > +the same time, achieving efficient utilization of the host's CPUs and RAM. > + > =item B<cpu_weight=WEIGHT> > > A domain with a weight of 512 will get twice as much CPU as a domain > diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile > --- a/tools/libxl/Makefile > +++ b/tools/libxl/Makefile > @@ -66,7 +66,7 @@ LIBXL_LIBS += -lyajl -lm > LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \ > libxl_dom.o libxl_exec.o libxl_xshelp.o > libxl_device.o \ > libxl_internal.o libxl_utils.o libxl_uuid.o \ > - libxl_json.o libxl_aoutils.o \ > + libxl_json.o libxl_aoutils.o libxl_numa.o \ > libxl_save_callout.o _libxl_save_msgs_callout.o \ > libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y) > LIBXL_OBJS += _libxl_types.o libxl_flask.o _libxl_types_internal.o > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -98,6 +98,106 @@ out: > return sched; > } > > +/* Subtract two values and translate the result in [0, 1] */ > +static double normalized_diff(double a, double b) > +{ > +#define max(a, b) (a > b ? a : b) > + if (!a && a == b) > + return 0.0; > + return (a - b) / max(a, b); > +} I think this actually returns a result in [-1,1] rather than [0,1] like the comment says. Considering the a==1 b==2 case -> ( 1 - 2 ) / max(1, 2) => -1 / 2 => -0.5 Is it the comment or the code which is wrong? (From the following numa_cmpf I think the comment) > + > +/* > + * The NUMA placement candidates are reordered according to the following > + * heuristics: > + * - candidates involving fewer nodes come first. In case two (or > + * more) candidates span the same number of nodes, > + * - the amount of free memory and the number of domains assigned to the > + * candidates are considered. In doing that, candidates with greater > + * amount of free memory and fewer domains assigned to them are preferred, > + * with free memory "weighting" three times as much as number of domains. > + */ > +static int numa_cmpf(const void *v1, const void *v2) > +{ > + const libxl__numa_candidate *c1 = v1; > + const libxl__numa_candidate *c2 = v2; > +#define sign(a) a > 0 ? 1 : a < 0 ? -1 : 0 Does the caller of numa_cmpf rely on the result being specifically -1, 0 or +1? Usually such functions only care about -ve, 0 or +ve. Or maybe this is a double->int conversion thing? I guess (int)0.1 == 0 rather than +ve like you'd want? Yes, I suppose that's it, nevermind. I might have written the floating point zeroes as 0.0 to make it clearer rather than relying on automatic promotion. > + double freememkb_diff = normalized_diff(c2->free_memkb, c1->free_memkb); > + double nrdomains_diff = normalized_diff(c1->nr_domains, c2->nr_domains); > + > + if (c1->nr_nodes != c2->nr_nodes) > + return c1->nr_nodes - c2->nr_nodes; > + > + return sign(3*freememkb_diff + nrdomains_diff); > +} > + > +/* The actual automatic NUMA placement routine */ > +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info) I didn't review the rest -- George already acked it. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.