|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:
> If a domain does not have a VCPU affinity, try to pin it automatically to some
> PCPUs. This is done taking into account the NUMA characteristics of the host.
> In fact, we look for a combination of host's NUMA nodes with enough free
> memory
> and number of PCPUs for the new domain, and pin it to the VCPUs of those
> nodes.
>
> Once we know which ones, among all the possible combinations, represents valid
> placement candidates for a domain, use some heuistics for deciding which is
> the
heuristics
> best. For instance, smaller candidates are considered to be better, both from
> the domain's point of view (fewer memory spreading among nodes) and from the
> system as a whole point of view (fewer memoy fragmentation). In case of
memory
> candidates of equal sizes (i.e., with the same number of nodes), the one with
> the greater amount of memory wins, as this is also good for keeping memory
> fragmentation under control.
>
> This all happens internally to libxl, and no API for driving the mechanism is
> provided for now. This matches what xend already does.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
>
> Changes from v1:
> * This patches incorporates the changes from both "libxl, xl: enable
> automatic
> placement of guests on NUMA nodes" and "libxl, xl: heuristics for
> reordering
> NUMA placement candidates" from v1.
> * The logic of the algorithm is basically the same as in v1, but the
> splitting
> of it in the various functions has been completely redesigned from scratch.
> * No public API for placement or candidate generation is now exposed,
> everything happens within libxl, as agreed during v1 review.
> * The relevant documentation have been moved near the actual functions and
> features. Also, the amount and (hopefully!) the quality of the
> documentation
> has been improved a lot, as requested.
> * All the comments about using the proper libxl facilities and helpers for
> allocations, etc., have been considered and applied.
> * This patch still bails out from NUMA optimizations if it find out cpupools
> are being utilized. It is next patch that makes the two things interact
> properly, as suggested during review.
>
> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> --- a/docs/man/xl.cfg.pod.5
> +++ b/docs/man/xl.cfg.pod.5
> @@ -111,8 +111,8 @@ created online and the remainder will be
>
> =item B<cpus="CPU-LIST">
>
> -List of which cpus the guest is allowed to use. Default behavior is
There is (according to grep) a slight preference for British spelling,
but I imagine we are totally inconsistent all over the place so lets not
worry too much ;-)
> -`all cpus`. A C<CPU-LIST> may be specified as follows:
> +List of which cpus the guest is allowed to use. By default xl will (via
> +libxl) pick some cpus (see below). A C<CPU-LIST> may be specified as follows:
>
> =over 4
>
> @@ -132,6 +132,12 @@ run on cpu #3 of the host.
>
> =back
>
> +If this option is not specified, libxl automatically tries to place the new
> +domain on the host's NUMA nodes (provided the host has more than one NUMA
> +node) by pinning it to the cpus of those nodes. An heuristical approach is
I don't think heuristical is a word, I think "A heuristic approach..."
is fine
(not sure about "A heuristic" vs "An heuristic" but A sounds more
natural to me, generally the rule is An before "vowel sounds" so I guess
it depends on how you pronounce the h in heuristic, which varies even
depending on which bit of England you are from ;-))
> +utilized with the goals of maximizing performance for the domain and, at
> +the same time, achieving efficient utilization of the host's CPUs and RAM.
> +
> =item B<cpu_weight=WEIGHT>
>
> A domain with a weight of 512 will get twice as much CPU as a domain
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
I think it would be quite reasonable for you to create a new
libxl_numa(_placement).c if you wanted too. likewise a libxl_numa.h
header. Up to you.
> @@ -95,6 +95,459 @@ out:
> return sched;
> }
>
> +/*
> + * What follows are helpers for generating all the k-combinations
> + * without repetitions of a set S with n elements in it. Formally
> + * speaking, they are subsets of k distinct elements of S and, if
> + * S is n elements big, the number of k-combinations is equal to
> + * the binomial coefficient (n k), which means we get exactly
> + * n!/(k! * (n - k)!) subsets, all of them with k elements.
> + *
> + * The various subset are geneated one after the other by calling
generated
> + * comb_init() first, and, after that, comb_next()
> + * (n k)-1 times. An iterator is used to store the curent status
current
> + * of the whole generation operation (i.e., basically, the last
> + * combination that has been generated). As soon as all
> + * combinations have been generated, comb_next() will
> + * start returning 0 instead of 1. It is of course impotant that
important
> + * the same instance of the iterator and the same values for
> + * n and k are used for each call. If that doesn't happen, the
> + * result is unspecified.
> + *
> + * The algorithm is a well known one,
Worth a link/reference?
> and it produces the
> + * combinations in such a way that they (well, more precisely,
> + * their indexes it the array/map representing the set) come in
> + * lexicogaphic order.
lexicographic
> + *
> + * For example, with n = 5 and k = 3, calling comb_init()
> + * will generate { 0, 1, 2 }, while subsequent valid calls to
> + * comb_next() will produce the following:
> + * { { 0, 1, 2 }, { 0, 1, 3 }, { 0, 1, 4 },
^ strictly speaking this 0,1,2 isn't produced by a call
to comb_next since it came from the initial comb_init,
right? IOW the first comb_next() call won't duplicate
it...
I'm not going to try very hard to review the algorithm here, I trust you
and I've got a head cold ;-)
> + * { 0, 2, 3 }, { 0, 2, 4 }, { 0, 3, 4 },
> + * { 1, 2, 3 }, { 1, 2, 4 }, { 1, 3, 4 },
> + * { 2, 3, 4 } }
> + *
> + * This is used by the automatic NUMA placement logic below.
> + */
> +typedef int* comb_iter_t;
> +
> +static int comb_init(libxl__gc *gc, comb_iter_t *it, int n, int k)
> +{
> + comb_iter_t new_iter;
> + int i;
> +
> + if (n < k)
> + return 0;
> +
> + /* First set is always { 0, 1, 2, ..., k-1 } */
> + GCNEW_ARRAY(new_iter, k);
> + for (i = 0; i < k; i++)
> + new_iter[i] = i;
> +
> + *it = new_iter;
> + return 1;
> +}
> +
> +static int comb_next(comb_iter_t it, int n, int k)
> +{
> + int i;
> +
> + /*
> + * The idea here is to find the leftmost element from where
> + * we should start incrementing the indexes of the iterator.
> + * This means looking for the highest index that can be increased
> + * while still producing value smaller than n-1. In the example
> + * above, when dealing with { 0, 1, 4 }, such an element is the
> + * second one, as the third is already equal to 4 (which actually
> + * is n-1).
> + * Once we found from where to start, we increment that element
> + * and overide the right-hand rest of the iterator with its
override
> + * successors, thus achieving lexicographic ordering.
> + *
> + * Regarding the termination of the generation process, when we
> + * manage in bringing n-k at the very first position of the iterator,
> + * we know that is the last valid combination ( { 2, 3, 4 }, with
> + * n - k = 5 - 2 = 2, in the example above), and thus we start
> + * returning 0 as soon as we cross that border.
> + */
> + for (i = k - 1; it[i] == n - k + i; i--) {
> + if (i <= 0)
> + return 0;
> + }
> + for (it[i]++, i++; i < k; i++)
> + it[i] = it[i - 1] + 1;
> + return 1;
> +}
> +
> +/* NUMA automatic placement (see libxl_internal.h for details) */
> +
> +/*
> + * This function turns a k-combination iterator into a node map.
> + * This means the bits in the node map corresponding to the indexes
> + * of the given combination are the ones that will be set.
> + * For example, if the iterator represents the combination { 0, 2, 4},
> + * the node map will have bits #0, #2 and #4 set to one and i thus
> + * will point to node 0, node 2 and and node 4.
What is "i" in this last bit ("i thus will point to")? I don't think you
are referring to the loop iterator variable.
If you meant "it" then I don't think it needs saying that a node map
with bits 0, 2 and 4 set refers to nodes 0, 2 and 4.
> + */
> +static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k)
> +{
> + int i;
> +
> + libxl_bitmap_set_none(nodemap);
> + for (i = 0; i < k; i++)
> + libxl_bitmap_set(nodemap, it[i]);
> +}
> +
> +/* Retrieve the number of cpus that the nodes that are part of the nodemap
> + * span. */
> +static int nodemap_to_nodes_cpus(libxl_cputopology *tinfo, int nr_cpus,
> + const libxl_bitmap *nodemap)
nodes here == node's ? But can't have an apostrophe in a function which
makes it ready weird to me. Perhaps "nodemap_count_cpus"?
> +{
> + int i, nodes_cpus = 0;
> +
> + for (i = 0; i < nr_cpus; i++) {
> + if (libxl_bitmap_test(nodemap, tinfo[i].node))
> + nodes_cpus++;
> + }
> + return nodes_cpus;
> +}
> +
[...]
> +/* Retrieve the number of domains that can potentially run on the cpus
> + * the nodes that are part of the nodemap. */
> +static int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo,
> + const libxl_bitmap *nodemap)
> +{
> + libxl_dominfo *dinfo = NULL;
> + libxl_bitmap dom_nodemap;
> + int nr_doms, nr_cpus;
> + int nr_domains = 0;
> + int i, j, k;
> +
> + dinfo = libxl_list_domain(CTX, &nr_doms);
> + if (dinfo == NULL)
> + return ERROR_FAIL;
> +
> + if (libxl_node_bitmap_alloc(CTX, &dom_nodemap) < 0) {
> + nr_domains = ERROR_FAIL;
> + goto out;
> + }
> +
> + for (i = 0; i < nr_doms; i++) {
> + libxl_vcpuinfo *vinfo;
> + int nr_vcpus;
> +
> + vinfo = libxl_list_vcpu(CTX, dinfo[i].domid, &nr_vcpus, &nr_cpus);
> + if (vinfo == NULL)
> + continue;
> +
> + libxl_bitmap_set_none(&dom_nodemap);
> + for (j = 0; j < nr_vcpus; j++) {
> + libxl_for_each_set_bit(k, vinfo[j].cpumap)
> + libxl_bitmap_set(&dom_nodemap, tinfo[k].node);
> + }
> +
> + libxl_for_each_set_bit(j, dom_nodemap) {
> + if (libxl_bitmap_test(nodemap, j)) {
> + nr_domains++;
> + goto found;
AKA break?
> + }
> + }
> + found:
> + libxl_vcpuinfo_list_free(vinfo, nr_vcpus);
> + }
> +
> + out:
> + libxl_bitmap_dispose(&dom_nodemap);
You can come to out if libxl_nodebitmap_alloc fails -- in which case
dom_nodemap is (potentially) uninitialised.
You could just move out: down a line.
Otherwise we'd need libxl_bitmap_init which zeroes everything such that
calling dispose is safe even if you don't call alloc. I don't mind
which.
> + libxl_dominfo_list_free(dinfo, nr_doms);
> + return nr_domains;
> +}
> +
> +/*
> + * This function tries to figure out if the host has a consistent number
> + * of cpus along all its NUMA nodes. In fact, if that is the case, we can
> + * calculate the minimum number of nodes needed for a domain by just
> + * dividing its total number of vcpus by this value computed here.
> + * However, we are not allowed to assume that all the nodes have the
> + * same number of cpus. Therefore, in case discrepancies among different
> + * nodes are found, this function just returns 0 and the caller needs
> + * to sort out things in some other way.
If the caller has to deal with this anyway why bother with this check
and/or the other algorithm?
> + */
> +static int cpus_per_node_count(libxl_cputopology *tinfo, int nr_cpus,
> + libxl_numainfo *ninfo, int nr_nodes)
> +{
> + int cpus_per_node = 0;
> + int j, i;
> +
> + /* This makes sense iff # of PCPUs is the same for all nodes */
> + for (j = 0; j < nr_nodes; j++) {
> + int curr_cpus = 0;
> +
> + for (i = 0; i < nr_cpus; i++) {
> + if (tinfo[i].node == j)
> + curr_cpus++;
> + }
> + /* So, if the above does not hold, turn the whole thing off! */
> + cpus_per_node = cpus_per_node == 0 ? curr_cpus : cpus_per_node;
> + if (cpus_per_node != curr_cpus)
> + return 0;
> + }
> + return cpus_per_node;
> +}
> +
> +/* Get all the placement candidates satisfying some specific conditions */
> +int libxl__get_numa_candidates(libxl__gc *gc,
> + uint32_t min_free_memkb, int min_cpus,
> + int min_nodes, int max_nodes,
> + libxl__numa_candidate *cndts[], int *nr_cndts)
> +{
> + libxl__numa_candidate *new_cndts = NULL;
> + libxl_cputopology *tinfo = NULL;
> + libxl_numainfo *ninfo = NULL;
> + libxl_bitmap nodemap;
> + int nr_nodes, nr_cpus;
> + int array_size, rc;
> +
> + /* Get platform info and prepare the map for testing the combinations */
> + ninfo = libxl_get_numainfo(CTX, &nr_nodes);
> + if (ninfo == NULL)
> + return ERROR_FAIL;
> + /* If we don't have at least 2 nodes, it is useless to proceed */
You don't want to return whichever of those 2 nodes meets the
constraints?
> + if (nr_nodes < 2) {
> + rc = 0;
> + goto out;
> + }
> +
> + tinfo = libxl_get_cpu_topology(CTX, &nr_cpus);
> + if (tinfo == NULL) {
> + rc = ERROR_FAIL;
> + goto out;
> + }
> +
> + rc = libxl_node_bitmap_alloc(CTX, &nodemap);
> + if (rc)
> + goto out;
> +
> + /*
> + * Round up and down some of the constraints. For instance, the minimum
> + * number of cpus a candidate should have must at least be non-negative.
> + * Regarding the minimum number of NUMA nodes, if not explicitly
> specified
> + * (i.e., min_nodes <= 0), we try to figure out a sensible number of
> nodes
> + * from where to start generating candidates, if possible (or just start
> + * from 1 otherwise). The maximum number of nodes should not exceed the
> + * number of existent NUMA nodes on the host, or the candidate genaration
generation
> + * won't work properly.
> + */
> + min_cpus = min_cpus <= 0 ? 0 : min_cpus;
> + if (min_nodes <= 0) {
> + int cpus_per_node;
> +
> + cpus_per_node = cpus_per_node_count(tinfo, nr_cpus, ninfo, nr_nodes);
> + if (cpus_per_node == 0)
> + min_nodes = 1;
> + else
> + min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node;
> + }
> + min_nodes = min_nodes > nr_nodes ? nr_nodes : min_nodes;
> + if (max_nodes <= 0)
> + max_nodes = nr_nodes;
> + else
> + max_nodes = max_nodes > nr_nodes ? nr_nodes : max_nodes;
> + if (min_nodes > max_nodes) {
> + rc = ERROR_INVAL;
> + goto out;
> + }
> +
> + /* Initialize the local storage for the combinations */
> + *nr_cndts = 0;
> + array_size = nr_nodes;
> + GCNEW_ARRAY(new_cndts, array_size);
> +
> + /* Generate all the combinations of any size from min_nodes to
> + * max_nodes (see comb_init() and comb_next()). */
> + while (min_nodes <= max_nodes) {
> + comb_iter_t comb_iter;
> + int comb_ok;
> +
> + /*
> + * And here it is. Each step of this cycle generates a combination of
> + * nodes as big as min_nodes mandates. Each of these combinations is
> + * checked against the constraints provided by the caller (namely,
> + * amount of free memory and number of cpus) and it becomes an actual
> + * placement candidate iff it passes the check.
> + */
> + for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes);
> comb_ok;
> + comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) {
> + uint32_t nodes_free_memkb;
> + int nodes_cpus;
> +
> + comb_get_nodemap(comb_iter, &nodemap, min_nodes);
> +
> + /* If there is not enough memoy in this combination, skip it
memory
> + * and go generating the next one... */
> + nodes_free_memkb = nodemap_to_free_memkb(ninfo, &nodemap);
> + if (min_free_memkb > 0 && nodes_free_memkb < min_free_memkb)
> + continue;
> +
> + /* And the same applies if this combination is short in cpus */
> + nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus, &nodemap);
> + if (min_cpus > 0 && nodes_cpus < min_cpus)
> + continue;
> +
> + /*
> + * Conditions are met, we can add this combination to the
> + * NUMA placement candidates list. We first make sure there
> + * is enough space in there, and then we initialize the new
> + * candidate element with the node map corresponding to the
> + * combination we are dealing with. Memory allocation for
> + * expanding the array that hosts the list happens in chunks
> + * equal to the number of NUMA nodes in the system (to
> + * avoid allocating memory each and every time we find a
> + * new candidate).
> + */
> + if (*nr_cndts == array_size)
> + array_size += nr_nodes;
> + GCREALLOC_ARRAY(new_cndts, array_size);
> +
> + libxl__numa_candidate_alloc(gc, &new_cndts[*nr_cndts]);
> + libxl__numa_candidate_put_nodemap(gc, &new_cndts[*nr_cndts],
> + &nodemap);
> + new_cndts[*nr_cndts].nr_domains =
> + nodemap_to_nr_domains(gc, tinfo,
> &nodemap);
> + new_cndts[*nr_cndts].free_memkb = nodes_free_memkb;
> + new_cndts[*nr_cndts].nr_nodes = min_nodes;
> + new_cndts[*nr_cndts].nr_cpus = nodes_cpus;
> +
> + LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d, "
> + "nr_cpus=%d, free_memkb=%"PRIu32"", *nr_cndts,
> + min_nodes, new_cndts[*nr_cndts].nr_cpus,
> + new_cndts[*nr_cndts].free_memkb / 1024);
> +
> + (*nr_cndts)++;
> + }
> + min_nodes++;
> + }
> +
> + *cndts = new_cndts;
> + out:
> + libxl_bitmap_dispose(&nodemap);
nodemap might not have been initialised?
> + libxl_cputopology_list_free(tinfo, nr_cpus);
> + libxl_numainfo_list_free(ninfo, nr_nodes);
It looks like the nr_* may also not have been initialised, but maybe
list_free is safe in that case since the associated array must
necessarily be NULL?
> + return rc;
> +}
> +
> +void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], int nr_cndts,
> + libxl__numa_candidate_cmpf cmpf)
> +{
> + /* Reorder candidates (see the comparison function for
> + * the details on the heuristics) */
> + qsort(cndts, nr_cndts, sizeof(cndts[0]), cmpf);
> +}
> +
> +/*
> + * The NUMA placement candidates are reordered according to the following
> + * heuristics:
> + * - candidates involving fewer nodes come first. In case two (or
> + * more) candidates span the same number of nodes,
> + * - candidates with greater amount of free memory come first. In
> + * case two (or more) candidates differ in their amount of free
> + * memory by less than 10%,
> + * - candidates with fewer domains insisting on them at the time of
> + * this call come first.
> + */
> +static int numa_cmpf(const void *v1, const void *v2)
> +{
> + const libxl__numa_candidate *c1 = (const libxl__numa_candidate*) v1;
> + const libxl__numa_candidate *c2 = (const libxl__numa_candidate*) v2;
Are these casts necessary?
> + double mem_diff = labs(c1->free_memkb - c2->free_memkb);
> + double mem_avg = (c1->free_memkb + c2->free_memkb) / 2.0;
> +
> + if (c1->nr_nodes != c2->nr_nodes)
> + return c1->nr_nodes - c2->nr_nodes;
> +
> + if ((mem_diff / mem_avg) * 100.0 < 10.0 &&
Is all this FP math necessary? Using integers might have some rounding
but is it significant or do we care if it gets it slightly wrong?
> + c1->nr_domains != c2->nr_domains)
> + return c1->nr_domains - c2->nr_domains;
> +
> + return c2->free_memkb - c1->free_memkb;
> +}
> +
> +/* The actual automatic NUMA placement routine */
> +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)
> +{
> + int nr_candidates = 0;
> + libxl__numa_candidate *candidates = NULL;
> + libxl_bitmap candidate_nodemap;
> + libxl_cpupoolinfo *pinfo;
> + int nr_pools, rc = 0;
> + uint32_t memkb;
> +
> + /* First of all, if cpupools are in use, better not to mess with them */
> + pinfo = libxl_list_cpupool(CTX, &nr_pools);
> + if (!pinfo)
> + return ERROR_FAIL;
> + if (nr_pools > 1) {
> + LOG(NOTICE, "skipping NUMA placement as cpupools are in use");
> + goto out;
> + }
> +
> + rc = libxl_domain_need_memory(CTX, info, &memkb);
> + if (rc)
> + goto out;
> + if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap)) {
> + rc = ERROR_FAIL;
> + goto out;
> + }
> +
> + /* Find all the candidates with enough free memory and at least
> + * as much pcpus as the domain has vcpus. */
> + rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0,
> + &candidates, &nr_candidates);
> + if (rc)
> + goto out;
> +
> + LOG(DETAIL, "%d NUMA placement candidates found", nr_candidates);
> +
> + /* No suitable placement candidates. We just return without touching the
> + * domain's info->cpumap. It will have affinity with all nodes/cpus. */
> + if (nr_candidates == 0)
> + goto out;
> +
> + /* Bring the best candidate in front of the list --> candidates[0] */
> + if (nr_candidates > 1)
> + libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf);
Is the start up cost of libxl__sort_numa_candidates significant enough
to make that if worthwhile?
> +
> + /*
> + * At this point, the first candidate in the array is the one we want.
> + * Go for it by mapping its node map to the domain's info->cpumap.
> + */
> + libxl__numa_candidate_get_nodemap(gc, &candidates[0],
> &candidate_nodemap);
> + rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap, &info->cpumap);
> + if (rc)
> + goto out;
> +
> + LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and "
> + "%"PRIu32" KB free selected", candidates[0].nr_nodes,
> + candidates[0].nr_cpus, candidates[0].free_memkb / 1024);
> +
> + out:
> + libxl_bitmap_dispose(&candidate_nodemap);
> + libxl_cpupoolinfo_list_free(pinfo, nr_pools);
> + return rc;
> +}
> +
> int libxl__build_pre(libxl__gc *gc, uint32_t domid,
> libxl_domain_build_info *info, libxl__domain_build_state
> *state)
> {
> @@ -104,7 +557,22 @@ int libxl__build_pre(libxl__gc *gc, uint
> uint32_t rtc_timeoffset;
>
> xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
> +
> + /*
> + * Check if the domain has any CPU affinity. If not, try to build up one.
> + * In case numa_place_domain() find at least a suitable candidate, it
> will
> + * affect info->cpumap accordingly; if it does not, it just leaves it
> + * as it is. This means (unless some weird error manifests) the
> subsequent
> + * call to libxl_set_vcpuaffinity_all() will do the actual placement,
> + * whatever that turns out to be.
> + */
> + if (libxl_bitmap_is_full(&info->cpumap)) {
> + int rc = numa_place_domain(gc, info);
> + if (rc)
> + return rc;
I'm not sure about this, if numa_place_domain fails for any reason would
we be not better off logging and continuing without placement? We
already do that explicitly in a couple of cases.
> + }
> libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap);
> +
> xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb +
> LIBXL_MAXMEM_CONSTANT);
> if (info->type == LIBXL_DOMAIN_TYPE_PV)
> xc_domain_set_memmap_limit(ctx->xch, domid,
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2021,6 +2021,134 @@ static inline void libxl__ctx_unlock(lib
> #define CTX_LOCK (libxl__ctx_lock(CTX))
> #define CTX_UNLOCK (libxl__ctx_unlock(CTX))
>
> +/*
> + * Automatic NUMA placement
> + *
> + * These functions and data structures deal with the initial placement of a
> + * domain onto the host NUMA nodes.
> + *
> + * The key concept here is the one of "numa placement candidate" (or jusr
> "numa
just
> + * candidate", "placement candidate" or even "candidate"), which basically
> is a
Do we really need that level of detail about what we might call it?
There's no minimum word limit you know ;-)
> + * set of nodes whose characteristics have been successfully checked against
> + * some specific requirements. More pecisely, a candidate is the nodemap
precisely
> + * associated with one of the possible subset of the host NUMA nodes
> providing
> + * a certain amount of free memory, or a given number of cpus, or even both
> + * (depending in what the caller wants). For convenience of use, some of this
> + * information are stored within the candidate itselfs, instead of always
> being
itself
> + * dynamically computed. A candidate can consist of just one, up to all the
> + * available nodes on the host (the latter being, of course, not ideal).
I can't parse this sentence.
> For
> + * instance, looking for a numa candidates with 2GB of free memoy means we
> want
memory
> + * all the possible subsets of the host NUMA nodes with, cumulatively, at
> least
> + * 2GB of free memory. That could be possible by just using one particular
> + * node, or may require more nodes, depending on the characteristics of the
> + * host, on how many domains have been created already, on how big they are,
> + * etc.
> + *
> + * The intended usage is as follows:
> + * 1. by, fist of all, calling libxl__get_numa_candidates(), and specifying
> + * the proper constraints to it (e.g., the amount of memory a domain need
> + * as the minimum amount of free memory fo the candidates) one can build
for
> + * up a whole set of suitable placing alternatives for a domain;
> + * 2. after that, one specific candidate should be chosen. That can happen
> + * by looking at their various characteristics;
> + * 3. the chosen candidate's nodemap should be utilized for computing the
> + * actual affinity of the domain which, given the curent NUMA support
current
> + * in the hypervisor, is what determines the placement of the domain's
> + * vcpus and memory.
> + *
> + * To make phase 2 even easier, a sorting helper function for the list of
> + * candidates is povided in the form of libxl__sort_numa_candidates(). The
> only
provided
> + * that is needed is defining a comparison function, containing the chriteria
criteria
> + * for deciding, given two candidates, which one is 'better'. Depending on
> how
> + * the comparison function is defined, the best candidate (where, of course,
> + * best is defined with respect to the heuristics implemented in the
> comparison
> + * function itself, libxl__numa_candidate_cmpf()) could become the first o
> the
or
> + * last element of the list.
> + *
> + * Summarizing, achieving automatic NUMA placement is just a matter of
> + * obtaining the list of suitable placement candidates, perhaps asking for
> each
> + * of them to provide at least the amount of memory the domain needs. After
> + * that just implement a comparison function by means of the various helpers
> + * retrieving the relevant information about the candidates themselves.
> + * Finally, call the sorting helper function and use the candidate that
> became
> + * (typically) the first element of the list fo determining the domain's
for
> + * affinity.
> + */
> +
> +typedef struct {
> + int nr_cpus, nr_nodes;
> + int nr_domains;
> + uint32_t free_memkb;
> + libxl_bitmap nodemap;
> +} libxl__numa_candidate;
> +
> +/*
> + * This generates the list of NUMA placement candidates satisfying some
> + * specific conditions. If min_nodes and/or max_nodes are not 0, their value
> is
> + * used to determine the minimum and maximum number of nodes that are allow
> to
> + * be present in each candidate. If min_nodes and/or max_nodes are 0, the
> + * minimum and maximum number of nodes to be used are automatically selected
> by
> + * the implementation (and that will likely be just 1 node for the minimum
> and
> + * the total number of existent nodes for the maximum). Re min_free_memkb and
> + * min_cpu, if not 0, they specify the caller only wants candidates with at
> + * least that amount of free memory and that number of cpus, respectively. If
> + * min_free_memkb and/or min_cpus are 0, the candidates' free memory and
> number
> + * of cpus won't be checked at all, which means a candidate will always be
> + * considered suitable wrt the specific constraint. cndts is where the list
> of
> + * exactly nr_cndts candidates is returned. Note that, in case no candidates
> + * are found at all, the function returns successfully, but with nr_cndts
> equal
> + * to zero.
> + */
> +_hidden int libxl__get_numa_candidates(libxl__gc *gc,
> + uint32_t min_free_memkb, int min_cpus,
> + int min_nodes, int max_nodes,
> + libxl__numa_candidate *cndts[], int
> *nr_cndts);
> +
> +/* allocation and deallocation for placement candidates */
> +static inline int libxl__numa_candidate_alloc(libxl__gc *gc,
> + libxl__numa_candidate *cndt)
> +{
> + return libxl_node_bitmap_alloc(CTX, &cndt->nodemap);
> +}
> +static inline void libxl__numa_candidate_dispose(libxl__numa_candidate *cndt)
> +{
> + libxl_bitmap_dispose(&cndt->nodemap);
> +}
> +static inline void libxl__numacandidate_list_free(libxl__numa_candidate
> *cndts,
> + int nr_cndts)
> +{
> + int i;
> +
> + for (i = 0; i < nr_cndts; i++)
> + libxl__numa_candidate_dispose(&cndts[i]);
> + free(cndts);
> +}
> +
> +/* retrieve (in nodemap) the node map associated to placement candidate cndt
> */
> +static inline
> +void libxl__numa_candidate_get_nodemap(libxl__gc *gc,
> + const libxl__numa_candidate *cndt,
> + libxl_bitmap *nodemap)
> +{
> + libxl_bitmap_copy(CTX, nodemap, &cndt->nodemap);
> +}
> +/* set the node map of placement candidate cndt to match nodemap */
> +static inline
> +void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
> + libxl__numa_candidate *cndt,
> + const libxl_bitmap *nodemap)
> +{
> + libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
> +}
> +
> +/* signature for the comparison function between two candidates c1 and c2
> + * (the thid parameter is provided to enable thread safety). */
third
But there isn't actually a third parameter so it's a bit moot ;-)
> +typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2);
> +/* sort the list of candidates in cndts (an array with nr_cndts elements in
> + * it) using cmpf for comparing two candidates. Uses libc's qsort(). */
> +_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[],
> + int nr_cndts,
> + libxl__numa_candidate_cmpf cmpf);
>
> /*
> * Inserts "elm_new" into the sorted list "head".
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |