[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes



If a domain does not have a VCPU affinity, try to pin it automatically to some
PCPUs. This is done taking into account the NUMA characteristics of the host.
In fact, we look for a combination of host's NUMA nodes with enough free memory
and number of PCPUs for the new domain, and pin it to the VCPUs of those nodes.

Once we know which ones, among all the possible combinations, represents valid
placement candidates for a domain, use some heuistics for deciding which is the
best. For instance, smaller candidates are considered to be better, both from
the domain's point of view (fewer memory spreading among nodes) and from the
system as a whole point of view (fewer memoy fragmentation).  In case of
candidates of equal sizes (i.e., with the same number of nodes), the one with
the greater amount of memory wins, as this is also good for keeping memory
fragmentation under control.

This all happens internally to libxl, and no API for driving the mechanism is
provided for now. This matches what xend already does.

Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Changes from v1:
 * This patches incorporates the changes from both "libxl, xl: enable automatic
   placement of guests on NUMA nodes" and "libxl, xl: heuristics for reordering
   NUMA placement candidates" from v1.
 * The logic of the algorithm is basically the same as in v1, but the splitting
   of it in the various functions has been completely redesigned from scratch.
 * No public API for placement or candidate generation is now exposed,
   everything happens within libxl, as agreed during v1 review.
 * The relevant documentation have been moved near the actual functions and
   features. Also, the amount and (hopefully!) the quality of the documentation
   has been improved a lot, as requested.
 * All the comments about using the proper libxl facilities and helpers for
   allocations, etc., have been considered and applied.
 * This patch still bails out from NUMA optimizations if it find out cpupools
   are being utilized. It is next patch that makes the two things interact
   properly, as suggested during review.

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -111,8 +111,8 @@ created online and the remainder will be
 
 =item B<cpus="CPU-LIST">
 
-List of which cpus the guest is allowed to use. Default behavior is
-`all cpus`. A C<CPU-LIST> may be specified as follows:
+List of which cpus the guest is allowed to use. By default xl will (via
+libxl) pick some cpus (see below). A C<CPU-LIST> may be specified as follows:
 
 =over 4
 
@@ -132,6 +132,12 @@ run on cpu #3 of the host.
 
 =back
 
+If this option is not specified, libxl automatically tries to place the new
+domain on the host's NUMA nodes (provided the host has more than one NUMA
+node) by pinning it to the cpus of those nodes. An heuristical approach is
+utilized with the goals of maximizing performance for the domain and, at
+the same time, achieving efficient utilization of the host's CPUs and RAM.
+
 =item B<cpu_weight=WEIGHT>
 
 A domain with a weight of 512 will get twice as much CPU as a domain
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -95,6 +95,459 @@ out:
     return sched;
 }
 
+/*
+ * What follows are helpers for generating all the k-combinations
+ * without repetitions of a set S with n elements in it. Formally
+ * speaking, they are subsets of k distinct elements of S and, if
+ * S is n elements big, the number of k-combinations is equal to
+ * the binomial coefficient (n k), which means we get exactly
+ * n!/(k! * (n - k)!) subsets, all of them with k elements.
+ *
+ * The various subset are geneated one after the other by calling
+ * comb_init() first, and, after that, comb_next()
+ * (n k)-1 times. An iterator is used to store the curent status
+ * of the whole generation operation (i.e., basically, the last
+ * combination that has been generated). As soon as all
+ * combinations have been generated, comb_next() will
+ * start returning 0 instead of 1. It is of course impotant that
+ * the same instance of the iterator and the same values for
+ * n and k are used for each call. If that doesn't happen, the
+ * result is unspecified.
+ *
+ * The algorithm is a well known one, and it produces the
+ * combinations in such a way that they (well, more precisely,
+ * their indexes it the array/map representing the set) come in
+ * lexicogaphic order.
+ *
+ * For example, with n = 5 and k = 3, calling comb_init()
+ * will generate { 0, 1, 2 }, while subsequent valid calls to
+ * comb_next() will produce the following:
+ * { { 0, 1, 2 }, { 0, 1, 3 }, { 0, 1, 4 },
+ *   { 0, 2, 3 }, { 0, 2, 4 }, { 0, 3, 4 },
+ *   { 1, 2, 3 }, { 1, 2, 4 }, { 1, 3, 4 },
+ *   { 2, 3, 4 } }
+ *
+ * This is used by the automatic NUMA placement logic below.
+ */
+typedef int* comb_iter_t;
+
+static int comb_init(libxl__gc *gc, comb_iter_t *it, int n, int k)
+{
+    comb_iter_t new_iter;
+    int i;
+
+    if (n < k)
+        return 0;
+
+    /* First set is always { 0, 1, 2, ..., k-1 } */
+    GCNEW_ARRAY(new_iter, k);
+    for (i = 0; i < k; i++)
+        new_iter[i] = i;
+
+    *it = new_iter;
+    return 1;
+}
+
+static int comb_next(comb_iter_t it, int n, int k)
+{
+    int i;
+
+    /*
+     * The idea here is to find the leftmost element from where
+     * we should start incrementing the indexes of the iterator.
+     * This means looking for the highest index that can be increased
+     * while still producing value smaller than n-1. In the example
+     * above, when dealing with { 0, 1, 4 }, such an element is the
+     * second one, as the third is already equal to 4 (which actually
+     * is n-1).
+     * Once we found from where to start, we increment that element
+     * and overide the right-hand rest of the iterator with its
+     * successors, thus achieving lexicographic ordering.
+     *
+     * Regarding the termination of the generation process, when we
+     * manage in bringing n-k at the very first position of the iterator,
+     * we know that is the last valid combination ( { 2, 3, 4 }, with
+     * n - k = 5 - 2 = 2, in the example above), and thus we start
+     * returning 0 as soon as we cross that border.
+     */
+    for (i = k - 1; it[i] == n - k + i; i--) {
+        if (i <= 0)
+            return 0;
+    }
+    for (it[i]++, i++; i < k; i++)
+        it[i] = it[i - 1] + 1;
+    return 1;
+}
+
+/* NUMA automatic placement (see libxl_internal.h for details) */
+
+/*
+ * This function turns a k-combination iterator into a node map.
+ * This means the bits in the node map corresponding to the indexes
+ * of the given combination are the ones that will be set.
+ * For example, if the iterator represents the combination { 0, 2, 4},
+ * the node map will have bits #0, #2 and #4 set to one and i thus
+ * will point to node 0, node 2 and and node 4.
+ */
+static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k)
+{
+    int i;
+
+    libxl_bitmap_set_none(nodemap);
+    for (i = 0; i < k; i++)
+        libxl_bitmap_set(nodemap, it[i]);
+}
+
+/* Retrieve the number of cpus that the nodes that are part of the nodemap
+ * span. */
+static int nodemap_to_nodes_cpus(libxl_cputopology *tinfo, int nr_cpus,
+                                 const libxl_bitmap *nodemap)
+{
+    int i, nodes_cpus = 0;
+
+    for (i = 0; i < nr_cpus; i++) {
+        if (libxl_bitmap_test(nodemap, tinfo[i].node))
+            nodes_cpus++;
+    }
+    return nodes_cpus;
+}
+
+/* Retrieve the amount of free memory within the nodemap */
+static uint32_t nodemap_to_free_memkb(libxl_numainfo *ninfo,
+                                      libxl_bitmap *nodemap)
+{
+    uint32_t free_memkb = 0;
+    int i;
+
+    libxl_for_each_set_bit(i, *nodemap)
+        free_memkb += ninfo[i].free / 1024;
+
+    return free_memkb;
+}
+
+/* Retrieve the number of domains that can potentially run on the cpus
+ * the nodes that are part of the nodemap. */
+static int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo,
+                                 const libxl_bitmap *nodemap)
+{
+    libxl_dominfo *dinfo = NULL;
+    libxl_bitmap dom_nodemap;
+    int nr_doms, nr_cpus;
+    int nr_domains = 0;
+    int i, j, k;
+
+    dinfo = libxl_list_domain(CTX, &nr_doms);
+    if (dinfo == NULL)
+        return ERROR_FAIL;
+
+    if (libxl_node_bitmap_alloc(CTX, &dom_nodemap) < 0) {
+        nr_domains = ERROR_FAIL;
+        goto out;
+    }
+
+    for (i = 0; i < nr_doms; i++) {
+        libxl_vcpuinfo *vinfo;
+        int nr_vcpus;
+
+        vinfo = libxl_list_vcpu(CTX, dinfo[i].domid, &nr_vcpus, &nr_cpus);
+        if (vinfo == NULL)
+            continue;
+
+        libxl_bitmap_set_none(&dom_nodemap);
+        for (j = 0; j < nr_vcpus; j++) {
+            libxl_for_each_set_bit(k, vinfo[j].cpumap)
+                libxl_bitmap_set(&dom_nodemap, tinfo[k].node);
+        }
+
+        libxl_for_each_set_bit(j, dom_nodemap) {
+            if (libxl_bitmap_test(nodemap, j)) {
+                nr_domains++;
+                goto found;
+            }
+        }
+ found:
+        libxl_vcpuinfo_list_free(vinfo, nr_vcpus);
+    }
+
+ out:
+    libxl_bitmap_dispose(&dom_nodemap);
+    libxl_dominfo_list_free(dinfo, nr_doms);
+    return nr_domains;
+}
+
+/*
+ * This function tries to figure out if the host has a consistent number
+ * of cpus along all its NUMA nodes. In fact, if that is the case, we can
+ * calculate the minimum number of nodes needed for a domain by just
+ * dividing its total number of vcpus by this value computed here.
+ * However, we are not allowed to assume that all the nodes have the
+ * same number of cpus. Therefore, in case discrepancies among different
+ * nodes are found, this function just returns 0 and the caller needs
+ * to sort out things in some other way.
+ */
+static int cpus_per_node_count(libxl_cputopology *tinfo, int nr_cpus,
+                               libxl_numainfo *ninfo, int nr_nodes)
+{
+    int cpus_per_node = 0;
+    int j, i;
+
+    /* This makes sense iff # of PCPUs is the same for all nodes */
+    for (j = 0; j < nr_nodes; j++) {
+        int curr_cpus = 0;
+
+        for (i = 0; i < nr_cpus; i++) {
+            if (tinfo[i].node == j)
+                curr_cpus++;
+        }
+        /* So, if the above does not hold, turn the whole thing off! */
+        cpus_per_node = cpus_per_node == 0 ? curr_cpus : cpus_per_node;
+        if (cpus_per_node != curr_cpus)
+            return 0;
+    }
+    return cpus_per_node;
+}
+
+/* Get all the placement candidates satisfying some specific conditions */
+int libxl__get_numa_candidates(libxl__gc *gc,
+                               uint32_t min_free_memkb, int min_cpus,
+                               int min_nodes, int max_nodes,
+                               libxl__numa_candidate *cndts[], int *nr_cndts)
+{
+    libxl__numa_candidate *new_cndts = NULL;
+    libxl_cputopology *tinfo = NULL;
+    libxl_numainfo *ninfo = NULL;
+    libxl_bitmap nodemap;
+    int nr_nodes, nr_cpus;
+    int array_size, rc;
+
+    /* Get platform info and prepare the map for testing the combinations */
+    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
+    if (ninfo == NULL)
+        return ERROR_FAIL;
+    /* If we don't have at least 2 nodes, it is useless to proceed */
+    if (nr_nodes < 2) {
+        rc = 0;
+        goto out;
+    }
+
+    tinfo = libxl_get_cpu_topology(CTX, &nr_cpus);
+    if (tinfo == NULL) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    rc = libxl_node_bitmap_alloc(CTX, &nodemap);
+    if (rc)
+        goto out;
+
+    /*
+     * Round up and down some of the constraints. For instance, the minimum
+     * number of cpus a candidate should have must at least be non-negative.
+     * Regarding the minimum number of NUMA nodes, if not explicitly specified
+     * (i.e., min_nodes <= 0), we try to figure out a sensible number of nodes
+     * from where to start generating candidates, if possible (or just start
+     * from 1 otherwise). The maximum number of nodes should not exceed the
+     * number of existent NUMA nodes on the host, or the candidate genaration
+     * won't work properly.
+     */
+    min_cpus = min_cpus <= 0 ? 0 : min_cpus;
+    if (min_nodes <= 0) {
+        int cpus_per_node;
+
+        cpus_per_node = cpus_per_node_count(tinfo, nr_cpus, ninfo, nr_nodes);
+        if (cpus_per_node == 0)
+            min_nodes = 1;
+        else
+            min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node;
+    }
+    min_nodes = min_nodes > nr_nodes ? nr_nodes : min_nodes;
+    if (max_nodes <= 0)
+        max_nodes = nr_nodes;
+    else
+        max_nodes = max_nodes > nr_nodes ? nr_nodes : max_nodes;
+    if (min_nodes > max_nodes) {
+        rc = ERROR_INVAL;
+        goto out;
+    }
+
+    /* Initialize the local storage for the combinations */
+    *nr_cndts = 0;
+    array_size = nr_nodes;
+    GCNEW_ARRAY(new_cndts, array_size);
+
+    /* Generate all the combinations of any size from min_nodes to
+     * max_nodes (see comb_init() and comb_next()). */
+    while (min_nodes <= max_nodes) {
+        comb_iter_t comb_iter;
+        int comb_ok;
+
+        /*
+        * And here it is. Each step of this cycle generates a combination of
+        * nodes as big as min_nodes mandates.  Each of these combinations is
+        * checked against the constraints provided by the caller (namely,
+        * amount of free memory and number of cpus) and it becomes an actual
+        * placement candidate iff it passes the check.
+         */
+        for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes); comb_ok;
+             comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) {
+            uint32_t nodes_free_memkb;
+            int nodes_cpus;
+
+            comb_get_nodemap(comb_iter, &nodemap, min_nodes);
+
+            /* If there is not enough memoy in this combination, skip it
+             * and go generating the next one... */
+            nodes_free_memkb = nodemap_to_free_memkb(ninfo, &nodemap);
+            if (min_free_memkb > 0 && nodes_free_memkb < min_free_memkb)
+                continue;
+
+            /* And the same applies if this combination is short in cpus */
+            nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus, &nodemap);
+            if (min_cpus > 0 && nodes_cpus < min_cpus)
+                continue;
+
+            /*
+             * Conditions are met, we can add this combination to the
+             * NUMA placement candidates list. We first make sure there
+             * is enough space in there, and then we initialize the new
+             * candidate element with the node map corresponding to the
+             * combination we are dealing with. Memory allocation for
+             * expanding the array that hosts the list happens in chunks
+             * equal to the number of NUMA nodes in the system (to
+             * avoid allocating memory each and every time we find a
+             * new candidate).
+             */
+            if (*nr_cndts == array_size)
+                array_size += nr_nodes;
+            GCREALLOC_ARRAY(new_cndts, array_size);
+
+            libxl__numa_candidate_alloc(gc, &new_cndts[*nr_cndts]);
+            libxl__numa_candidate_put_nodemap(gc, &new_cndts[*nr_cndts],
+                                              &nodemap);
+            new_cndts[*nr_cndts].nr_domains =
+                                    nodemap_to_nr_domains(gc, tinfo, &nodemap);
+            new_cndts[*nr_cndts].free_memkb = nodes_free_memkb;
+            new_cndts[*nr_cndts].nr_nodes = min_nodes;
+            new_cndts[*nr_cndts].nr_cpus = nodes_cpus;
+
+            LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d, "
+                       "nr_cpus=%d, free_memkb=%"PRIu32"", *nr_cndts,
+                       min_nodes, new_cndts[*nr_cndts].nr_cpus,
+                       new_cndts[*nr_cndts].free_memkb / 1024);
+
+            (*nr_cndts)++;
+        }
+        min_nodes++;
+    }
+
+    *cndts = new_cndts;
+ out:
+    libxl_bitmap_dispose(&nodemap);
+    libxl_cputopology_list_free(tinfo, nr_cpus);
+    libxl_numainfo_list_free(ninfo, nr_nodes);
+    return rc;
+}
+
+void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], int nr_cndts,
+                                 libxl__numa_candidate_cmpf cmpf)
+{
+    /* Reorder candidates (see the comparison function for
+     * the details on the heuristics) */
+    qsort(cndts, nr_cndts, sizeof(cndts[0]), cmpf);
+}
+
+/*
+ * The NUMA placement candidates are reordered according to the following
+ * heuristics:
+ *  - candidates involving fewer nodes come first. In case two (or
+ *    more) candidates span the same number of nodes,
+ *  - candidates with greater amount of free memory come first. In
+ *    case two (or more) candidates differ in their amount of free
+ *    memory by less than 10%,
+ *  - candidates with fewer domains insisting on them at the time of
+ *    this call come first.
+ */
+static int numa_cmpf(const void *v1, const void *v2)
+{
+    const libxl__numa_candidate *c1 = (const libxl__numa_candidate*) v1;
+    const libxl__numa_candidate *c2 = (const libxl__numa_candidate*) v2;
+    double mem_diff = labs(c1->free_memkb - c2->free_memkb);
+    double mem_avg = (c1->free_memkb + c2->free_memkb) / 2.0;
+
+    if (c1->nr_nodes != c2->nr_nodes)
+        return c1->nr_nodes - c2->nr_nodes;
+
+    if ((mem_diff / mem_avg) * 100.0 < 10.0 &&
+        c1->nr_domains != c2->nr_domains)
+        return c1->nr_domains - c2->nr_domains;
+
+    return c2->free_memkb - c1->free_memkb;
+}
+
+/* The actual automatic NUMA placement routine */
+static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)
+{
+    int nr_candidates = 0;
+    libxl__numa_candidate *candidates = NULL;
+    libxl_bitmap candidate_nodemap;
+    libxl_cpupoolinfo *pinfo;
+    int nr_pools, rc = 0;
+    uint32_t memkb;
+
+    /* First of all, if cpupools are in use, better not to mess with them */
+    pinfo = libxl_list_cpupool(CTX, &nr_pools);
+    if (!pinfo)
+        return ERROR_FAIL;
+    if (nr_pools > 1) {
+        LOG(NOTICE, "skipping NUMA placement as cpupools are in use");
+        goto out;
+    }
+
+    rc = libxl_domain_need_memory(CTX, info, &memkb);
+    if (rc)
+        goto out;
+    if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap)) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    /* Find all the candidates with enough free memory and at least
+     * as much pcpus as the domain has vcpus.  */
+    rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0,
+                                    &candidates, &nr_candidates);
+    if (rc)
+        goto out;
+
+    LOG(DETAIL, "%d NUMA placement candidates found", nr_candidates);
+
+    /* No suitable placement candidates. We just return without touching the
+     * domain's info->cpumap. It will have affinity with all nodes/cpus. */
+    if (nr_candidates == 0)
+        goto out;
+
+    /* Bring the best candidate in front of the list --> candidates[0] */
+    if (nr_candidates > 1)
+        libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf);
+
+    /*
+     * At this point, the first candidate in the array is the one we want.
+     * Go for it by mapping its node map to the domain's info->cpumap.
+     */
+    libxl__numa_candidate_get_nodemap(gc, &candidates[0], &candidate_nodemap);
+    rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap, &info->cpumap);
+    if (rc)
+        goto out;
+
+    LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and "
+                "%"PRIu32" KB free selected", candidates[0].nr_nodes,
+                candidates[0].nr_cpus, candidates[0].free_memkb / 1024);
+
+ out:
+    libxl_bitmap_dispose(&candidate_nodemap);
+    libxl_cpupoolinfo_list_free(pinfo, nr_pools);
+    return rc;
+}
+
 int libxl__build_pre(libxl__gc *gc, uint32_t domid,
               libxl_domain_build_info *info, libxl__domain_build_state *state)
 {
@@ -104,7 +557,22 @@ int libxl__build_pre(libxl__gc *gc, uint
     uint32_t rtc_timeoffset;
 
     xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
+
+    /*
+     * Check if the domain has any CPU affinity. If not, try to build up one.
+     * In case numa_place_domain() find at least a suitable candidate, it will
+     * affect info->cpumap accordingly; if it does not, it just leaves it
+     * as it is. This means (unless some weird error manifests) the subsequent
+     * call to libxl_set_vcpuaffinity_all() will do the actual placement,
+     * whatever that turns out to be.
+     */
+    if (libxl_bitmap_is_full(&info->cpumap)) {
+        int rc = numa_place_domain(gc, info);
+        if (rc)
+            return rc;
+    }
     libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap);
+
     xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + 
LIBXL_MAXMEM_CONSTANT);
     if (info->type == LIBXL_DOMAIN_TYPE_PV)
         xc_domain_set_memmap_limit(ctx->xch, domid,
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2021,6 +2021,134 @@ static inline void libxl__ctx_unlock(lib
 #define CTX_LOCK (libxl__ctx_lock(CTX))
 #define CTX_UNLOCK (libxl__ctx_unlock(CTX))
 
+/*
+ * Automatic NUMA placement
+ *
+ * These functions and data structures deal with the initial placement of a
+ * domain onto the host NUMA nodes.
+ *
+ * The key concept here is the one of "numa placement candidate" (or jusr "numa
+ * candidate", "placement candidate" or even "candidate"), which basically is a
+ * set of nodes whose characteristics have been successfully checked against
+ * some specific requirements. More pecisely, a candidate is the nodemap
+ * associated with one of the possible subset of the host NUMA nodes providing
+ * a certain amount of free memory, or a given number of cpus, or even both
+ * (depending in what the caller wants). For convenience of use, some of this
+ * information are stored within the candidate itselfs, instead of always being
+ * dynamically computed. A candidate can consist of just one, up to all the
+ * available nodes on the host (the latter being, of course, not ideal).  For
+ * instance, looking for a numa candidates with 2GB of free memoy means we want
+ * all the possible subsets of the host NUMA nodes with, cumulatively, at least
+ * 2GB of free memory. That could be possible by just using one particular
+ * node, or may require more nodes, depending on the characteristics of the
+ * host, on how many domains have been created already, on how big they are,
+ * etc.
+ *
+ * The intended usage is as follows:
+ *  1. by, fist of all, calling libxl__get_numa_candidates(), and specifying
+ *     the proper constraints to it (e.g., the amount of memory a domain need
+ *     as the minimum amount of free memory fo the candidates) one can build
+ *     up a whole set of suitable placing alternatives for a domain;
+ *  2. after that, one specific candidate should be chosen. That can happen
+ *     by looking at their various characteristics;
+ *  3. the chosen candidate's nodemap should be utilized for computing the
+ *     actual affinity of the domain which, given the curent NUMA support
+ *     in the hypervisor, is what determines the placement of the domain's
+ *     vcpus and memory.
+ *
+ * To make phase 2 even easier, a sorting helper function for the list of
+ * candidates is povided in the form of libxl__sort_numa_candidates(). The only
+ * that is needed is defining a comparison function, containing the chriteria
+ * for deciding, given two candidates, which one is 'better'. Depending on how
+ * the comparison function is defined, the best candidate (where, of course,
+ * best is defined with respect to the heuristics implemented in the comparison
+ * function itself, libxl__numa_candidate_cmpf()) could become the first o the
+ * last element of the list.
+ *
+ * Summarizing, achieving automatic NUMA placement is just a matter of
+ * obtaining the list of suitable placement candidates, perhaps asking for each
+ * of them to provide at least the amount of memory the domain needs. After
+ * that just implement a comparison function by means of the various helpers
+ * retrieving the relevant information about the candidates themselves.
+ * Finally, call the sorting helper function and use the candidate that became
+ * (typically) the first element of the list fo determining the domain's
+ * affinity.
+ */
+
+typedef struct {
+    int nr_cpus, nr_nodes;
+    int nr_domains;
+    uint32_t free_memkb;
+    libxl_bitmap nodemap;
+} libxl__numa_candidate;
+
+/*
+ * This generates the list of NUMA placement candidates satisfying some
+ * specific conditions. If min_nodes and/or max_nodes are not 0, their value is
+ * used to determine the minimum and maximum number of nodes that are allow to
+ * be present in each candidate. If min_nodes and/or max_nodes are 0, the
+ * minimum and maximum number of nodes to be used are automatically selected by
+ * the implementation (and that will likely be just 1 node for the minimum and
+ * the total number of existent nodes for the maximum). Re min_free_memkb and
+ * min_cpu, if not 0, they specify the caller only wants candidates with at
+ * least that amount of free memory and that number of cpus, respectively. If
+ * min_free_memkb and/or min_cpus are 0, the candidates' free memory and number
+ * of cpus won't be checked at all, which means a candidate will always be
+ * considered suitable wrt the specific constraint.  cndts is where the list of
+ * exactly nr_cndts candidates is returned. Note that, in case no candidates
+ * are found at all, the function returns successfully, but with nr_cndts equal
+ * to zero.
+ */
+_hidden int libxl__get_numa_candidates(libxl__gc *gc,
+                                uint32_t min_free_memkb, int min_cpus,
+                                int min_nodes, int max_nodes,
+                                libxl__numa_candidate *cndts[], int *nr_cndts);
+
+/* allocation and deallocation for placement candidates */
+static inline int libxl__numa_candidate_alloc(libxl__gc *gc,
+                                              libxl__numa_candidate *cndt)
+{
+    return libxl_node_bitmap_alloc(CTX, &cndt->nodemap);
+}
+static inline void libxl__numa_candidate_dispose(libxl__numa_candidate *cndt)
+{
+    libxl_bitmap_dispose(&cndt->nodemap);
+}
+static inline void libxl__numacandidate_list_free(libxl__numa_candidate *cndts,
+                                                  int nr_cndts)
+{
+    int i;
+
+    for (i = 0; i < nr_cndts; i++)
+        libxl__numa_candidate_dispose(&cndts[i]);
+    free(cndts);
+}
+
+/* retrieve (in nodemap) the node map associated to placement candidate cndt */
+static inline
+void libxl__numa_candidate_get_nodemap(libxl__gc *gc,
+                                       const libxl__numa_candidate *cndt,
+                                       libxl_bitmap *nodemap)
+{
+    libxl_bitmap_copy(CTX, nodemap, &cndt->nodemap);
+}
+/* set the node map of placement candidate cndt to match nodemap */
+static inline
+void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
+                                       libxl__numa_candidate *cndt,
+                                       const libxl_bitmap *nodemap)
+{
+    libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
+}
+
+/* signature for the comparison function between two candidates c1 and c2
+ * (the thid parameter is provided to enable thread safety). */
+typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2);
+/* sort the list of candidates in cndts (an array with nr_cndts elements in
+ * it) using cmpf for comparing two candidates. Uses libc's qsort(). */
+_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[],
+                                         int nr_cndts,
+                                         libxl__numa_candidate_cmpf cmpf);
 
 /*
  * Inserts "elm_new" into the sorted list "head".

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.