[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes
Allow the user to ask for automatic placement of a new domain onto the NUMA nodes of the host. The most important consideration, wrt to this, definitely is how much free memory we have on the various nodes vs. how much memory the domain needs, and this is exactly what is used to drive the algorithm decisions. Some more refinement, e.g., adding more factors, like actual load on the CPUs, etc, to the "equation" can be easily added later. The placement logic is very simple and it uses the First Fit algorithm. This basically means the domain is put in the first node (or, perhaps, set of nodes) with enough free memory. A form of `sanity check' is performed after the memory-wise placement, to be sure the domain has at least as much pCPUs available as its vCPUs, and if that is not the case, more nodes are added to its affinity map. The user can ask for full automatic placement, in which case the least possible number of nodes will be used, or provide some hints, such as how many nodes he wants the domain to be affine to. He can also ask for a more specific placement (an explicit list of nodes), and the algorithm will honour that, if possible, or just fail. Finally, if the user doesn't say anything about node affinity, but the domain has some vcpu affinity, the algorithm will use that information as a starting point for placing it. TODO: * As usual, relationship with cpupools should be better both defined and (then) checked. * This only takes memory into account, forgetting about how busy the CPUs of the various nodes are. Heuristics for taking that into account too need to be thought about, and proper support for gathering the data they need for working (e.g., stats about load on pCPUs?) implemented. * This just ignores node distances, as exporting such info via libxl would be better when we'll have arrays in the IDL; however, spots where such information should be considered are clearly identified and marked, so it should be trivial to put it in place once we'll have the proper backup. * Checking for free memory in `xl' and then allocating it to domains in Xen is prone to race conditions. Domain creation should be fine, given the current --very coarse-- locking discipline implemented in `xl' itself, but, for instance, nothing protect creation and ballooning from running into each other. This affects the present code as well (freemem() already checks for free memory in the creation process), but can get particularly nasty with this changes applied. It needs to be solved somehow... Solutions are under consideration. * Benchmarks, e.g., something like put a bunch of VM on "auto" placement policy and check for some kind of aggregate throughput (I'm working on this, in order to have it happening by means of the same infrastructure I used for producing the results referred to in the #0 mail). Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx> add-firstfit-automatic-domain-placement.patch diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -122,6 +122,36 @@ vcpus on the cpus of the affine node(s). A list of nodes should be specified as follows: `nodes=["0", "3"]` for the guest to be affine with nodes 0 and 3. +It is also possible to ask for automatic placement of the guest on +host nodes by either using `nodes="auto"` or just putting the number +of nodes the guest should span, e.g., `nodes=2`. In the former case, +xl will try to make the guest affine with the lowest possible number +of nodes, in the latter it will try to make it affine to the provided +number of them. + +=item B<nodes_policy="NODE-POLICY"> + +Allows for better specifying how to automatically set the guest NUMA +affinity. Available C<NODE-POLICY>-ies are: + +=over 4 + +=item B<auto> + +use a not better specified (xl-implementation dependant) algorithm +to try automatically fitting the guest on the host's NUMA nodes + +=item B<ffit> + +use the First Fit algorithm to try automatically fitting the +guest on the host's NUMA nodes + +=back + +This option interacts with `nodes=` such that, if the number of +nodes to use is specified there (e.g., `nodes=2`), xl will use +the specified policy to try fitting the guest on that exact +number of NUMA nodes of the host. =item B<memory=MBYTES> diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c --- a/tools/libxl/libxl_utils.c +++ b/tools/libxl/libxl_utils.c @@ -454,6 +454,15 @@ int libxl_map_alloc(libxl_ctx *ctx, stru return 0; } +static void libxl_map_copy(/*XXX libxl_ctx *ctx, */ struct libxl_map *dptr, + const struct libxl_map *sptr) +{ + int sz; + + sz = dptr->size = sptr->size; + memcpy(dptr->map, sptr->map, sz * sizeof(*dptr->map)); +} + int libxl_map_test(struct libxl_map *map, int elem) { if (elem >= map->size * 8) @@ -497,6 +506,12 @@ int libxl_nodemap_alloc(libxl_ctx *ctx, return libxl_map_alloc(ctx, nodemap, max_nodes); } +void libxl_nodemap_copy(/*XXX libxl_ctx *ctx, */ libxl_nodemap *dst, + const libxl_nodemap *src) +{ + libxl_map_copy(/*XXX ctx, */ dst, src); +} + int libxl_get_max_cpus(libxl_ctx *ctx) { return xc_get_max_cpus(ctx->xch); diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h --- a/tools/libxl/libxl_utils.h +++ b/tools/libxl/libxl_utils.h @@ -115,6 +115,8 @@ static inline int libxl_cpumap_cpu_valid if (libxl_cpumap_test(&(m), v)) int libxl_nodemap_alloc(libxl_ctx *ctx, libxl_nodemap *nodemap); +void libxl_nodemap_copy(/*XXX libxl_ctx *ctx, */ libxl_nodemap *dst, + const libxl_nodemap *src); static inline int libxl_nodemap_test(libxl_nodemap *nodemap, int node) { return libxl_map_test(nodemap, node); diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -113,6 +113,55 @@ static const char *action_on_shutdown_na [LIBXL_ACTION_ON_SHUTDOWN_COREDUMP_RESTART] = "coredump-restart", }; +/* Automatic placement policies symbols, with the following meaning: + * NONE : no automatic placement at all; + * STATIC : explicit nodes specification. + * FFIT : use the First Fit algorithm for automatic placement; + * AUTO : an not better specified automatic placement, likely + * to be implemented as a short circuit to (one of) + * the above(s); + */ +#define NODES_POLICY_NONE 0 +#define NODES_POLICY_STATIC 1 +#define NODES_POLICY_FFIT 2 +#define NODES_POLICY_AUTO 3 + +/* Default behaviour wrt to automatic domain placement on nodes, + * maximum number of attempts made for applying the desired + * placement policy and parameters, and of the same for attempts + * made for having enough CPUs available to the domain. + */ +#define NODES_POLICY_DEFAULT NODES_POLICY_NONE +#define NODES_POLICY_FIT_ATTEMPTS 5 +#define NODES_POLICY_VCPU_ATTEMPTS 5 + +/* Whether, in case it is not possible to fit a domain on + * the requested nodes, we should give up (_NO_RETRY), try + * to increment (_RETRY_INC), decrement (_RETRY_DEC) or both + * (_RETRY_BOTH) the number of nodes. + * + * XXX Maybe the actual value of _INC and _DEC, as well as + * _DEFAULT and _*_ATTEMPTS above is something we want + * to be configurable, e.g., in xl.conf or alike? + */ +#define NODES_POLICY_NO_RETRY ( 0) +#define NODES_POLICY_RETRY_INC (+1) +#define NODES_POLICY_RETRY_DEC (-1) +#define NODES_POLICY_RETRY_BOTH (INT_MAX) + +static const char *nodes_policy_names[] = { + [NODES_POLICY_NONE] = "none", + [NODES_POLICY_STATIC] = "static", + [NODES_POLICY_FFIT] = "ffit", + [NODES_POLICY_AUTO] = "auto", +}; + +/* Store the policy for the domain while parsing */ +static int nodes_policy = NODES_POLICY_DEFAULT; + +/* Store the number of nodes to be used while parsing */ +static int num_nodes_policy = 0; + /* Optional data, in order: * 4 bytes uint32_t config file size * n bytes config file in Unix text file format @@ -507,6 +556,314 @@ afp_out: return rc; } +static int nodes_policy_parse(const char *pol, long int *policy) +{ + int i; + const char *n; + + for (i = 0; i < sizeof(nodes_policy_names) / sizeof(nodes_policy_names[0]); i++) { + n = nodes_policy_names[i]; + + if (!n) continue; + + if (strcmp(pol, n) == 0) { + *policy = i; + return 0; + } + } + + return EINVAL; +} + +static inline void __add_nodes_to_nodemap(libxl_nodemap *nodemap, + const libxl_numainfo *numa, + int nr_nodes, int howmany) +{ + int i; + + /* XXX Once we have the node-distance information, maybe we can + * use it here in trying to set nodes closer to the ones that + * are already in the map. */ + for (i = 0 ; i < nr_nodes && howmany > 0; i++) { + if (!libxl_nodemap_test(nodemap, i) && numa[i].size > 0) { + libxl_nodemap_set(nodemap, i); + howmany--; + } + } +} + +/* + * First Fit automatic placement. Just scan all the nodes in the + * provided map (@nodemap) linearly and pick up the fist @nodes + * that fit the memory requirements (@memkb) of the domain. + * This also returns the actual number of used nodes (still via + * @nodes) and the actual nodes map to be used (still via @nodemap), + * as they both can change depending on the retry policy (@retry). + */ +static int place_domain_ffit(const libxl_numainfo *numa, uint64_t memb, + int retry, int nr_nodes, int *nodes, + libxl_nodemap *nodemap) +{ + uint64_t memb_node; + libxl_nodemap suitable_nodes; + int attempts = 1, rc = 0; + int i, found_nodes; + + /* XXX This is one of the nastiest piece of code I've + * ever written... Mhuahahahah!! */ + if (retry == NODES_POLICY_RETRY_BOTH) { + int dec_nodes; + /* Order (_DEC before than _INC) should stay like this, + * as NODES_POLICY_RETRY_INC can broaden the node map. */ + retry = NODES_POLICY_RETRY_INC; + dec_nodes = *nodes; + if (!place_domain_ffit(numa, memb, NODES_POLICY_RETRY_DEC, + nr_nodes, &dec_nodes, nodemap)) { + *nodes = dec_nodes; + return 0; + } + } + + if (libxl_nodemap_alloc(ctx, &suitable_nodes) < 0) { + fprintf(stderr, "libxl_nodemap_alloc failed\n"); + return ENOMEM; + } + + do { + rc = ENOSPC; + + /* Avoid trying to look for silly number of nodes */ + if (*nodes <= 0 || *nodes > nr_nodes) + break; + + /* Determine how much free memory we want on each of the nodes + * that will end up in suitable_nodes. Either adding or ignoring + * the modulus of the integer division should be fine (the total + * number of nodes should be in the order of tens of them), so + * let's add it as it should be more safe. + */ + memb_node = (memb / (*nodes)) + (memb % (*nodes)); + + /* Let's see if there are enough (valid!) nodes in the map + * with such an amount of memory. */ + found_nodes = 0; + libxl_nodemap_set_none(&suitable_nodes); + libxl_for_each_set_node(i, *nodemap) { + if (numa[i].size > 0 && numa[i].free >= memb_node) { + libxl_nodemap_set(&suitable_nodes, i); + if (++found_nodes >= *nodes) + break; + } + } + if (found_nodes == (*nodes)) { + /* We're done, let's commit the resulting nodemap */ + libxl_nodemap_copy(nodemap, &suitable_nodes); + rc = 0; + break; + } + + /* Apply the asked retry policy for the next round. Notice + * that it would be pointless to increase nodes without also + * adding some nodes to the map! */ + *nodes += retry; + if (retry == NODES_POLICY_RETRY_INC) + __add_nodes_to_nodemap(nodemap, numa, nr_nodes, retry); + + attempts++; + } while (retry != NODES_POLICY_NO_RETRY && + attempts < NODES_POLICY_FIT_ATTEMPTS); + + libxl_nodemap_dispose(&suitable_nodes); + + return rc; +} + +/* Try to achieve optimal node-affinity for the domain */ +static int place_domain(libxl_domain_build_info *b_info) +{ + uint32_t need_memkb; + libxl_numainfo *numa; + libxl_cputopology *info; + libxl_nodemap new_nodemap; + int nr_nodes, nr_cpus; + int use_nodes = 0, use_cpus; + int attempts = 1, rc = 0; + int i, retry_policy; + + if (nodes_policy == NODES_POLICY_NONE || + nodes_policy == NODES_POLICY_STATIC) + /* Nothing to be done: if we're here no policy is being set, either + * because domain doesn't want one, or it asked for specific nodes. */ + return 0; + + rc = libxl_domain_need_memory(ctx, b_info, &need_memkb); + if (rc < 0) { + fprintf(stderr, "libxl_domain_need_memory failed.\n"); + goto out; + } + + /* We need to know both how much free memory we have on the + * nodes, and to what node the various CPUS belong. */ + numa = libxl_get_numainfo(ctx, &nr_nodes); + if (numa == NULL) { + fprintf(stderr, "libxl_get_numainfo failed.\n"); + rc = ENOMEM; + goto out; + } + info = libxl_get_cpu_topology(ctx, &nr_cpus); + if (info == NULL) { + fprintf(stderr, "libxl_get_topologyinfo failed.\n"); + rc = ENOMEM; + goto out_numa; + } + + /* In order not to alter the actual b_info->nodemap during the + * process (what if something goes wrong in the middle?) use + * this copy of it and only commit the result at the end. */ + if (libxl_nodemap_alloc(ctx, &new_nodemap) < 0) { + fprintf(stderr, "libxl_nodemap_alloc failed\n"); + goto out_topology;; + } + + /* If a nodemap has been specified, just comply with that. + * OTOH, if there's no nodemap, rely on cpumap (if any), and + * fallback to all node if even that is empty. Also consider + * all the nodes to be valid if only the number (i.e., _how_ + * _many_ of them instead of _which_ of them) of nodes the + * domain wants is provided. + * + * Concering retries, if a specific set of nodes is specified, + * just try that and give up if we fail. If, instead, a set of + * CPUs onto which to pin the domain is specified, try first + * the exact nodes those CPUs belongs to, then also try both + * a smaller and a bigger number of nodes. The same happens if + * we just have the number of nodes to be used. Finally, if + * there is no node-affinity, no cpu-pinning, no number of nodes, + * start trying with one node and increase at the configured + * rate (considering all the nodes to be suitable). + * + * XXX some sort of libxl_{cpu,node}map_weight can be + * implemented and used here to both speedup and + * restructure this ugly code below! + */ + if (num_nodes_policy != 0) { + /* The num. of nodes is all we have */ + retry_policy = NODES_POLICY_RETRY_BOTH; + libxl_nodemap_set_any(&new_nodemap); + use_nodes = num_nodes_policy; + } else { + /* Is a list of suitable nodes there? */ + retry_policy = NODES_POLICY_NO_RETRY; + libxl_for_each_set_node(i, b_info->nodemap) { + libxl_nodemap_set(&new_nodemap, i); + use_nodes++; + } + if (use_nodes == 0) { + int mask_cpus = 0; + + /* No list of nodes, maybe the domain is pinned somewhere? */ + retry_policy = NODES_POLICY_RETRY_BOTH; + libxl_for_each_set_cpu(i, b_info->cpumap) { + mask_cpus++; + if (!libxl_nodemap_test(&new_nodemap, info[i].node)) { + libxl_nodemap_set(&new_nodemap, info[i].node); + use_nodes++; + } + } + /* All cpus just means no affinity, so reset nodes anyway */ + if (mask_cpus == nr_cpus) + use_nodes = 0; + } + if (use_nodes == 0) { + /* No hints about placement at all */ + retry_policy = NODES_POLICY_RETRY_INC; + libxl_nodemap_set_any(&new_nodemap); + use_nodes = 1; + } + } + do { + rc = ESRCH; + + if (use_nodes > nr_nodes) + break; + + /* Kick off the chosen placement policy. + * + * XXX This is blind wrt to what happens on the CPUs + * of the various nodes. Many solutions (most of which + * mainly constituted by some heuristics) can be found, + * but they all require the hyppervisor to provide some + * information on how loaded and busy they are, and + * that will be material for future work. However, if + * there are any thoughts about that, they are welcome + * here. */ + switch(nodes_policy) { + case NODES_POLICY_AUTO: + case NODES_POLICY_FFIT: + rc = place_domain_ffit(numa, (uint64_t) need_memkb * 1024, + retry_policy, nr_nodes, &use_nodes, + &new_nodemap); + break; + } + + if (rc) + goto out_nodemap; + + /* Check whether the (set of) node(s) selected so far is able + * to provide the domain with enough CPUs for his VCPUs. In + * case it does not respin the whole procedure with one more + * node available for it to use. + * + * XXX This does not work as expected! Point is there are a lot + * of entries in info (as it is returned by libxl_get_cpu_topology + * that does not represent an actual physical CPU, they're + * just free space on the array, as it ranges from 0 up to + * max_cpu_id (nr_cpus here), which is the maximum number of + * supported CPUs. For example, on my testbox, info array has + * 64 elements, while I only have 16 CPUs. Moreover, all the + * "empty spots" are filled with 0s, including their node field. + * This means the cycle below will se all of them belonging to + * node #0, which will always have a lot of CPUs (56 here!) + * and will thus always (well, mostly!) be suitable for + * accomodating a guest, which might not be true! + * Of course we can change all this in many ways, I just + * would appreciate some feedback on which of those to + * go for. :-) + */ + use_cpus = 0; + libxl_for_each_cpu(i, b_info->cpumap) { + /* We only want CPUs belonging to (valid!) nodes in the mask */ + if (libxl_nodemap_test(&new_nodemap, info[i].node) && + numa[info[i].node].size > 0) { + use_cpus++; + } + } + + if (use_cpus >= b_info->max_vcpus) { + rc = 0; + break; + } + /* Add one more node and retry fitting the domain */ + __add_nodes_to_nodemap(&new_nodemap, numa, nr_nodes, 1); + use_nodes += 1; + + attempts++; + } while (attempts <= NODES_POLICY_VCPU_ATTEMPTS); + + /* Things went fine. Commit the map into domain's build info */ + if (!rc) + libxl_nodemap_copy(&b_info->nodemap, &new_nodemap); + +out_nodemap: + libxl_nodemap_dispose(&new_nodemap); +out_topology: + libxl_cputopology_list_free(info, nr_cpus); +out_numa: + libxl_numainfo_list_free(numa, nr_nodes); +out: + return rc; +} + static inline int vcpupin_parse(char *cpu, libxl_cpumap *cpumap) { return affinity_parse(cpu, cpumap, libxl_get_max_cpus(ctx)); @@ -638,7 +995,7 @@ static void parse_config_data(const char free(buf2); } - if (!xlu_cfg_get_list (config, "nodes", &nodes, 0, 0)) { + if (!xlu_cfg_get_list (config, "nodes", &nodes, 0, 1)) { int i, n_nodes = 0; if (libxl_nodemap_alloc(ctx, &b_info->nodemap)) { @@ -646,6 +1003,7 @@ static void parse_config_data(const char exit(1); } + nodes_policy = NODES_POLICY_STATIC; libxl_nodemap_set_none(&b_info->nodemap); while ((buf = xlu_cfg_get_listitem(nodes, n_nodes)) != NULL) { i = atoi(buf); @@ -660,6 +1018,37 @@ static void parse_config_data(const char if (n_nodes == 0) fprintf(stderr, "No valid node found: no affinity will be set\n"); } + else if (!xlu_cfg_get_long (config, "nodes", &l, 1)) { + if (l <= 0) { + fprintf(stderr, "Only strictly positive values for \"nodes\"\n"); + exit(1); + } + + if (libxl_nodemap_alloc(ctx, &b_info->nodemap)) { + fprintf(stderr, "Unable to allocate nodemap\n"); + exit(1); + } + + libxl_nodemap_set_none(&b_info->nodemap); + nodes_policy = NODES_POLICY_AUTO; + num_nodes_policy = l; + } + else if (!xlu_cfg_get_string (config, "nodes", &buf, 1)) { + if (strcmp(buf, "auto")) { + fprintf(stderr, "ERROR: invalid value \"%s\" for \"nodes\"" + " (only \"auto\" supported here)\n", buf); + exit(1); + } + + if (libxl_nodemap_alloc(ctx, &b_info->nodemap)) { + fprintf(stderr, "Unable to allocate nodemap\n"); + exit(1); + } + + /* Automatic placement will take care */ + libxl_nodemap_set_none(&b_info->nodemap); + nodes_policy = NODES_POLICY_AUTO; + } else { /* * XXX What would a sane default be? I think doing nothing @@ -671,6 +1060,32 @@ static void parse_config_data(const char */ } + if (!xlu_cfg_get_string (config, "nodes_policy", &buf, 0)) { + fprintf(stderr, "got a nodes policy string: \"%s\"\n", buf); + + if (nodes_policy_parse(buf, &l)) { + fprintf(stderr, "ERROR: invalid value for \"nodes_policy\"\n"); + exit(1); + } + if (l == NODES_POLICY_STATIC || l == NODES_POLICY_NONE) { + fprintf(stderr, "ERROR: \"%s\" can't be used here\n", buf); + exit(1); + } + + /* Actually set the policy. If there is no nodemap, build + * a new one for the placement to use. Otherwise, don't touch it, + * i.e., tell the policy to try to respect that and act on + * those nodes only (if possible). */ + nodes_policy = l; + if (!b_info->nodemap.size) { + if (libxl_nodemap_alloc(ctx, &b_info->nodemap)) { + fprintf(stderr, "Unable to allocate nodemap\n"); + exit(1); + } + libxl_nodemap_set_none(&b_info->nodemap); + } + } + if (!xlu_cfg_get_long (config, "memory", &l, 0)) { b_info->max_memkb = l * 1024; b_info->target_memkb = b_info->max_memkb; @@ -1703,6 +2118,16 @@ start: goto error_out; } + ret = place_domain(&d_config.b_info); + if (ret == ESRCH || ret == ENOSPC) { + fprintf(stderr, "failed to place the domain, fallback to all nodes\n"); + libxl_nodemap_set_any(&d_config.b_info.nodemap); + } else if (ret < 0) { + fprintf(stderr, "failed to put the domain on the requested node(s)\n"); + ret = ERROR_FAIL; + goto error_out; + } + if ( dom_info->console_autoconnect ) { cb = autoconnect_console; }else{ diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c --- a/xen/arch/x86/numa.c +++ b/xen/arch/x86/numa.c @@ -365,10 +365,11 @@ static void dump_numa(unsigned char key) for_each_online_node(i) { paddr_t pa = (paddr_t)(NODE_DATA(i)->node_start_pfn + 1)<< PAGE_SHIFT; - printk("idx%d -> NODE%d start->%lu size->%lu\n", + printk("idx%d -> NODE%d start->%lu size->%lu free->%lu\n", i, NODE_DATA(i)->node_id, NODE_DATA(i)->node_start_pfn, - NODE_DATA(i)->node_spanned_pages); + NODE_DATA(i)->node_spanned_pages, + avail_node_heap_pages(i)); /* sanity check phys_to_nid() */ printk("phys_to_nid(%"PRIpaddr") -> %d should be %d\n", pa, phys_to_nid(pa), NODE_DATA(i)->node_id); _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |