[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file
On 11/04/12 14:17, Dario Faggioli wrote: I would probably break this into three separate patches, starting with the hypervisor, then libxc, then libxl, especially since they tend to have different maintainers and committers.Let the user specify the NUMA node affinity of a guest via the 'nodes=' config switch. Explicitly listing the intended target host nodes is required as of now. A valid setting will directly impact the node_affinity mask of the guest, i.e., it will change the behaviour of the low level memory allocator. On the other hand, this commit does not affect by any means how the guest's vCPUs are scheduled on the host's pCPUs. Hmm, I think that using "affine" here is technically correct, and is what one would use if writing a research paper; but it's unusual to hear the word in more common English; it would be more common to hear someone describe a VM as "having affinity with". How about something like this:TODO: * Better investigate and test interactions with cpupools. Signed-off-by: Dario Faggioli<dario.faggioli@xxxxxxxxxx> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -112,6 +112,15 @@ List of which cpus the guest is allowed (all vcpus will run on cpus 0,2,3,5), or `cpus=["2", "3"]` (all vcpus will run on cpus 2 and 3). +=item B<nodes=[ NODE, NODE, ...]> + +List of the NUMA nodes of the host the guest should be considered +`affine` with. Being affine to a (set of) NUMA node(s) mainly means +the guest's memory is going to be allocated on those node(s). + +A list of nodes should be specified as follows: `nodes=["0", "3"]` +for the guest to be affine with nodes 0 and 3. + "The list of NUMA nodes the domain is considered to have affinity with. Memory from the guest will be allocated from these nodes." (Technically you're also not supposed to end a sentence with a preposition, but I think "...with which the domain is considered to have affinity" is just to awkward.) Also, is there a way to specify that the affinity to be to all nodes and/or based on the vcpu mask of the vcpus? =item B<memory=MBYTES> Start the guest with MBYTES megabytes of RAM. diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -110,6 +110,44 @@ int xc_domain_shutdown(xc_interface *xch } +int xc_domain_node_affinity(xc_interface *xch, + uint32_t domid, + xc_nodemap_t node_affinity) +{ + DECLARE_DOMCTL; + DECLARE_HYPERCALL_BUFFER(uint8_t, local); + int ret = -1; + int nodesize; + + nodesize = xc_get_nodemap_size(xch); + if (!nodesize) + { + PERROR("Could not get number of nodes"); + goto out; + } + + local = xc_hypercall_buffer_alloc(xch, local, nodesize); + if ( local == NULL ) + { + PERROR("Could not allocate memory for domain_node_affinity domctl hypercall"); + goto out; + } + + domctl.cmd = XEN_DOMCTL_node_affinity; + domctl.domain = (domid_t)domid; + + memcpy(local, node_affinity, nodesize); + set_xen_guest_handle(domctl.u.node_affinity.nodemap.bitmap, local); + domctl.u.node_affinity.nodemap.nr_elems = nodesize * 8; + + ret = do_domctl(xch,&domctl); + + xc_hypercall_buffer_free(xch, local); + + out: + return ret; +} + int xc_vcpu_setaffinity(xc_interface *xch, uint32_t domid, int vcpu, diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -520,6 +520,19 @@ int xc_watchdog(xc_interface *xch, uint32_t id, uint32_t timeout); +/** + * This function explicitly sets the affinity of a domain with the + * host NUMA nodes. + * + * @parm xch a handle to an open hypervisor interface. + * @parm domid the domain id in which vcpus are to be created. + * @parm node_affinity the map of the affine nodes. + * @return 0 on success, -1 on failure. + */ +int xc_domain_node_affinity(xc_interface *xch, + uint32_t domind, + xc_nodemap_t node_affinity); + Seems like it would be useful to have both a get and a set. Hmm -- is there really no libxl_get_vcpuaffinity()? That seems to be a missing component...int xc_vcpu_setaffinity(xc_interface *xch, uint32_t domid, int vcpu, diff --git a/tools/libxl/gentest.py b/tools/libxl/gentest.py --- a/tools/libxl/gentest.py +++ b/tools/libxl/gentest.py @@ -20,7 +20,7 @@ def randomize_case(s): def randomize_enum(e): return random.choice([v.name for v in e.values]) -handcoded = ["libxl_cpumap", "libxl_key_value_list", +handcoded = ["libxl_cpumap", "libxl_nodemap", "libxl_key_value_list", "libxl_cpuid_policy_list", "libxl_file_reference", "libxl_string_list"] @@ -119,6 +119,19 @@ static void libxl_cpumap_rand_init(libxl } } +static void libxl_nodemap_rand_init(libxl_nodemap *nodemap) +{ + int i; + nodemap->size = rand() % 16; + nodemap->map = calloc(nodemap->size, sizeof(*nodemap->map)); + libxl_for_each_node(i, *nodemap) { + if (rand() % 2) + libxl_nodemap_set(nodemap, i); + else + libxl_nodemap_reset(nodemap, i); + } +} + static void libxl_key_value_list_rand_init(libxl_key_value_list *pkvl) { int i, nr_kvp = rand() % 16; diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -3082,6 +3082,16 @@ int libxl_set_vcpuaffinity_all(libxl_ctx return rc; } +int libxl_set_node_affinity(libxl_ctx *ctx, uint32_t domid, + libxl_nodemap *nodemap) +{ + if (xc_domain_node_affinity(ctx->xch, domid, nodemap->map)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting node affinity"); + return ERROR_FAIL; + } + return 0; +} + int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_cpumap *cpumap) { GC_INIT(ctx); diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -727,6 +727,8 @@ int libxl_set_vcpuaffinity(libxl_ctx *ct libxl_cpumap *cpumap); int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, unsigned int max_vcpus, libxl_cpumap *cpumap); +int libxl_set_node_affinity(libxl_ctx *ctx, uint32_t domid, + libxl_nodemap *nodemap); I think that for this patch, doing nothing is fine (which means removing the whole else clause). But once we have the auto placement, I think that "nodes=auto" should be the default.int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_cpumap *cpumap); libxl_scheduler libxl_get_scheduler(libxl_ctx *ctx); diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -121,6 +121,12 @@ int libxl__domain_build_info_setdefault( libxl_cpumap_set_any(&b_info->cpumap); } + if (!b_info->nodemap.size) { + if (libxl_nodemap_alloc(CTX,&b_info->nodemap)) + return ERROR_NOMEM; + libxl_nodemap_set_none(&b_info->nodemap); + } + if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT) b_info->max_memkb = 32 * 1024; if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT) diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -67,6 +67,7 @@ int libxl__build_pre(libxl__gc *gc, uint char *xs_domid, *con_domid; xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus); libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus,&info->cpumap); + libxl_set_node_affinity(ctx, domid,&info->nodemap); xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); if (info->type == LIBXL_DOMAIN_TYPE_PV) xc_domain_set_memmap_limit(ctx->xch, domid, diff --git a/tools/libxl/libxl_json.c b/tools/libxl/libxl_json.c --- a/tools/libxl/libxl_json.c +++ b/tools/libxl/libxl_json.c @@ -119,6 +119,26 @@ out: return s; } +yajl_gen_status libxl_nodemap_gen_json(yajl_gen hand, + libxl_nodemap *nodemap) +{ + yajl_gen_status s; + int i; + + s = yajl_gen_array_open(hand); + if (s != yajl_gen_status_ok) goto out; + + libxl_for_each_node(i, *nodemap) { + if (libxl_nodemap_test(nodemap, i)) { + s = yajl_gen_integer(hand, i); + if (s != yajl_gen_status_ok) goto out; + } + } + s = yajl_gen_array_close(hand); +out: + return s; +} + yajl_gen_status libxl_key_value_list_gen_json(yajl_gen hand, libxl_key_value_list *pkvl) { diff --git a/tools/libxl/libxl_json.h b/tools/libxl/libxl_json.h --- a/tools/libxl/libxl_json.h +++ b/tools/libxl/libxl_json.h @@ -27,6 +27,7 @@ yajl_gen_status libxl_domid_gen_json(yaj yajl_gen_status libxl_uuid_gen_json(yajl_gen hand, libxl_uuid *p); yajl_gen_status libxl_mac_gen_json(yajl_gen hand, libxl_mac *p); yajl_gen_status libxl_cpumap_gen_json(yajl_gen hand, libxl_cpumap *p); +yajl_gen_status libxl_nodemap_gen_json(yajl_gen hand, libxl_nodemap *p); yajl_gen_status libxl_cpuid_policy_list_gen_json(yajl_gen hand, libxl_cpuid_policy_list *p); yajl_gen_status libxl_string_list_gen_json(yajl_gen hand, libxl_string_list *p); diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -11,6 +11,7 @@ libxl_domid = Builtin("domid", json_fn = libxl_uuid = Builtin("uuid", passby=PASS_BY_REFERENCE) libxl_mac = Builtin("mac", passby=PASS_BY_REFERENCE) libxl_cpumap = Builtin("cpumap", dispose_fn="libxl_cpumap_dispose", passby=PASS_BY_REFERENCE) +libxl_nodemap = Builtin("nodemap", dispose_fn="libxl_nodemap_dispose", passby=PASS_BY_REFERENCE) libxl_cpuid_policy_list = Builtin("cpuid_policy_list", dispose_fn="libxl_cpuid_dispose", passby=PASS_BY_REFERENCE) libxl_string_list = Builtin("string_list", dispose_fn="libxl_string_list_dispose", passby=PASS_BY_REFERENCE) @@ -233,6 +234,7 @@ libxl_domain_build_info = Struct("domain ("max_vcpus", integer), ("cur_vcpus", integer), ("cpumap", libxl_cpumap), + ("nodemap", libxl_nodemap), ("tsc_mode", libxl_tsc_mode), ("max_memkb", MemKB), ("target_memkb", MemKB), diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -515,7 +515,7 @@ static void parse_config_data(const char const char *buf; long l; XLU_Config *config; - XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids; + XLU_ConfigList *cpus, *nodes, *vbds, *nics, *pcis, *cvfbs, *cpuids; int pci_power_mgmt = 0; int pci_msitranslate = 1; int pci_permissive = 0; @@ -628,6 +628,39 @@ static void parse_config_data(const char free(buf2); } + if (!xlu_cfg_get_list (config, "nodes",&nodes, 0, 0)) { + int i, n_nodes = 0; + + if (libxl_nodemap_alloc(ctx,&b_info->nodemap)) { + fprintf(stderr, "Unable to allocate nodemap\n"); + exit(1); + } + + libxl_nodemap_set_none(&b_info->nodemap); + while ((buf = xlu_cfg_get_listitem(nodes, n_nodes)) != NULL) { + i = atoi(buf); + if (!libxl_nodemap_node_valid(&b_info->nodemap, i)) { + fprintf(stderr, "node %d illegal\n", i); + exit(1); + } + libxl_nodemap_set(&b_info->nodemap, i); + n_nodes++; + } + + if (n_nodes == 0) + fprintf(stderr, "No valid node found: no affinity will be set\n"); + } + else { + /* + * XXX What would a sane default be? I think doing nothing + * (i.e., relying on cpu-affinity/cpupool ==> the current + * behavior) should be just fine. + * That would mean we're saying to the user "if you want us + * to take care of NUMA, please tell us, maybe just putting + * 'nodes=auto', but tell us... otherwise we do as usual". + */ + } + if (!xlu_cfg_get_long (config, "memory",&l, 0)) { b_info->max_memkb = l * 1024; b_info->target_memkb = b_info->max_memkb; diff --git a/tools/python/xen/lowlevel/xl/xl.c b/tools/python/xen/lowlevel/xl/xl.c --- a/tools/python/xen/lowlevel/xl/xl.c +++ b/tools/python/xen/lowlevel/xl/xl.c @@ -243,6 +243,18 @@ int attrib__libxl_cpumap_set(PyObject *v return 0; } +int attrib__libxl_nodemap_set(PyObject *v, libxl_nodemap *pptr) +{ + int i; + long node; + + for (i = 0; i< PyList_Size(v); i++) { + node = PyInt_AsLong(PyList_GetItem(v, i)); + libxl_nodemap_set(pptr, node); + } + return 0; +} + int attrib__libxl_file_reference_set(PyObject *v, libxl_file_reference *pptr) { return genwrap__string_set(v,&pptr->path); diff --git a/xen/common/domain.c b/xen/common/domain.c --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -222,6 +222,7 @@ struct domain *domain_create( spin_lock_init(&d->node_affinity_lock); d->node_affinity = NODE_MASK_ALL; + d->has_node_affinity = 0; spin_lock_init(&d->shutdown_lock); d->shutdown_code = -1; @@ -337,7 +338,7 @@ void domain_update_node_affinity(struct { cpumask_var_t cpumask; cpumask_var_t online_affinity; - const cpumask_t *online; + const cpumask_t *online = cpupool_online_cpumask(d->cpupool); nodemask_t nodemask = NODE_MASK_NONE; struct vcpu *v; unsigned int node; @@ -350,9 +351,22 @@ void domain_update_node_affinity(struct return; } - online = cpupool_online_cpumask(d->cpupool); + spin_lock(&d->node_affinity_lock); - spin_lock(&d->node_affinity_lock); + /* + * If someone explicitly told us our NUMA affinity, avoid messing + * that up. Notice, however, that vcpu affinity is still what + * drives all scheduling decisions. + * + * XXX I'm quite sure this is fine wrt to the various v->cpu_affinity + * (at least, that was the intended design!). Could it be useful + * to cross-check d->node_affinity against `online`? The basic + * idea here is "Hey, if you shoot yourself in the foot... You've + * shot in the foot!", but, you know... + */ + if ( d->has_node_affinity ) + goto out; + for_each_vcpu ( d, v ) { @@ -365,6 +379,8 @@ void domain_update_node_affinity(struct node_set(node, nodemask); d->node_affinity = nodemask; + +out: spin_unlock(&d->node_affinity_lock); free_cpumask_var(online_affinity); @@ -372,6 +388,31 @@ void domain_update_node_affinity(struct } +int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity) +{ + spin_lock(&d->node_affinity_lock); + + /* + * Being/becoming affine to _no_ nodes is not going to work, + * let's take it as the `reset node affinity` command. + */ + if ( nodes_empty(*affinity) ) + { + d->has_node_affinity = 0; + spin_unlock(&d->node_affinity_lock); + domain_update_node_affinity(d); + return 0; + } + + d->has_node_affinity = 1; + d->node_affinity = *affinity; + + spin_unlock(&d->node_affinity_lock); + + return 0; +} + + struct domain *get_domain_by_id(domid_t dom) { struct domain *d; diff --git a/xen/common/domctl.c b/xen/common/domctl.c --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -621,6 +621,27 @@ long do_domctl(XEN_GUEST_HANDLE(xen_domc } break; + case XEN_DOMCTL_node_affinity: + { + domid_t dom = op->domain; + struct domain *d = rcu_lock_domain_by_id(dom); + nodemask_t new_affinity; + + ret = -ESRCH; + if ( d == NULL ) + break; + + /* XXX We need an xsm_* for this I guess, right? */ Yes. :-) There's something that seems a bit clunky about this; but I can't really think of a better way. At every least I'd change the name of the element here to something more descriptive; perhaps "auto_node_affinity" (which would invert the meaning)?+ + ret = xenctl_nodemap_to_nodemask(&new_affinity, +&op->u.node_affinity.nodemap); + if ( !ret ) + ret = domain_set_node_affinity(d,&new_affinity); + + rcu_unlock_domain(d); + } + break; + case XEN_DOMCTL_setvcpuaffinity: case XEN_DOMCTL_getvcpuaffinity: { diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c --- a/xen/common/keyhandler.c +++ b/xen/common/keyhandler.c @@ -217,6 +217,14 @@ static void cpuset_print(char *set, int *set++ = '\0'; } +static void nodeset_print(char *set, int size, const nodemask_t *mask) +{ + *set++ = '['; + set += nodelist_scnprintf(set, size-2, mask); + *set++ = ']'; + *set++ = '\0'; +} + static void periodic_timer_print(char *str, int size, uint64_t period) { if ( period == 0 ) @@ -272,6 +280,9 @@ static void dump_domains(unsigned char k dump_pageframe_info(d); + nodeset_print(tmpstr, sizeof(tmpstr),&d->node_affinity); + printk("NODE affinity for domain %d: %s\n", d->domain_id, tmpstr); + printk("VCPU information and callbacks for domain %u:\n", d->domain_id); for_each_vcpu ( d, v ) diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -278,6 +278,15 @@ typedef struct xen_domctl_getvcpuinfo xe DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvcpuinfo_t); +/* Set the NUMA node(s) with which the guest is `affine`. */ +/* XEN_DOMCTL_node_affinity */ +struct xen_domctl_node_affinity { + struct xenctl_map nodemap; /* IN */ +}; +typedef struct xen_domctl_node_affinity xen_domctl_node_affinity_t; +DEFINE_XEN_GUEST_HANDLE(xen_domctl_node_affinity_t); + + /* Get/set which physical cpus a vcpu can execute on. */ /* XEN_DOMCTL_setvcpuaffinity */ /* XEN_DOMCTL_getvcpuaffinity */ @@ -914,6 +923,7 @@ struct xen_domctl { #define XEN_DOMCTL_set_access_required 64 #define XEN_DOMCTL_audit_p2m 65 #define XEN_DOMCTL_set_virq_handler 66 +#define XEN_DOMCTL_node_affinity 67 #define XEN_DOMCTL_gdbsx_guestmemio 1000 #define XEN_DOMCTL_gdbsx_pausevcpu 1001 #define XEN_DOMCTL_gdbsx_unpausevcpu 1002 @@ -927,6 +937,7 @@ struct xen_domctl { struct xen_domctl_getpageframeinfo getpageframeinfo; struct xen_domctl_getpageframeinfo2 getpageframeinfo2; struct xen_domctl_getpageframeinfo3 getpageframeinfo3; + struct xen_domctl_node_affinity node_affinity; struct xen_domctl_vcpuaffinity vcpuaffinity; struct xen_domctl_shadow_op shadow_op; struct xen_domctl_max_mem max_mem; diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h --- a/xen/include/xen/nodemask.h +++ b/xen/include/xen/nodemask.h @@ -8,8 +8,9 @@ * See detailed comments in the file linux/bitmap.h describing the * data type on which these nodemasks are based. * - * For details of nodemask_scnprintf() and nodemask_parse(), - * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c. + * For details of nodemask_scnprintf(), nodelist_scnpintf() and + * nodemask_parse(), see bitmap_scnprintf() and bitmap_parse() + * in lib/bitmap.c. * * The available nodemask operations are: * @@ -48,6 +49,7 @@ * unsigned long *nodes_addr(mask) Array of unsigned long's in mask * * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing + * int nodelist_scnprintf(buf, len, mask) Format nodemask as a list for printing * int nodemask_parse(ubuf, ulen, mask) Parse ascii string as nodemask * * for_each_node_mask(node, mask) for-loop node over mask @@ -280,6 +282,14 @@ static inline int __first_unset_node(con #define nodes_addr(src) ((src).bits) +#define nodelist_scnprintf(buf, len, src) \ + __nodelist_scnprintf((buf), (len), (src), MAX_NUMNODES) +static inline int __nodelist_scnprintf(char *buf, int len, + const nodemask_t *srcp, int nbits) +{ + return bitmap_scnlistprintf(buf, len, srcp->bits, nbits); +} + #if 0 #define nodemask_scnprintf(buf, len, src) \ __nodemask_scnprintf((buf), (len),&(src), MAX_NUMNODES) diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -346,8 +346,12 @@ struct domain /* Various mem_events */ struct mem_event_per_domain *mem_event; - /* Currently computed from union of all vcpu cpu-affinity masks. */ + /* + * Can be specified by the user. If that is not the case, it is + * computed from the union of all the vcpu cpu-affinity masks. + */ nodemask_t node_affinity; + int has_node_affinity; unsigned int last_alloc_node; spinlock_t node_affinity_lock; }; @@ -416,6 +420,7 @@ static inline void get_knownalive_domain ASSERT(!(atomic_read(&d->refcnt)& DOMAIN_DESTROYED)); } +int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity); void domain_update_node_affinity(struct domain *d); struct domain *domain_create( _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |