Xen project Mailing List

Re: [Xen-devel] [PATCH v3 4/7] libxl: vNUMA supporting interface

On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote: > Signed-off-by: Elena Ufimtseva <ufimtseva@xxxxxxxxx> > > Changes since v2: > * Added vnuma memory pfn alignment which takes into > account e820_host option and non-contiguous e820 memory map > in that case; > Wow, cool that you got this one working too! ;-P > diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c > index 0de1112..2bd3653 100644 > +int libxl_domain_setvnuma(libxl_ctx *ctx, > + uint32_t domid, > + uint16_t nr_vnodes, > + uint16_t nr_vcpus, > + vmemrange_t *vmemrange, > + unsigned int *vdistance, > + unsigned int *vcpu_to_vnode, > + unsigned int *vnode_to_pnode) > +{ > + GC_INIT(ctx); > Do you need this? I don't think so: you don't need a gc, and ctx appears to be enough for you. > + int ret; > + ret = xc_domain_setvnuma(ctx->xch, domid, nr_vnodes, > + nr_vcpus, vmemrange, > + vdistance, > + vcpu_to_vnode, > + vnode_to_pnode); > + GC_FREE; > + return ret; > +} > + libxl function should always return libxl error codes, such as ERROR_FAIL, ERROR_INVAL, ecc. It's nice to show what actually went wrong in the xc_* call, but that's usually done by: rc = xc_call(); if (rc < 0) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "error xxx"); return ERROR_FAIL; } On the grounds that libxc would properly set errno and return -1 on failure. > diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h > index abe6685..b95abab 100644 > int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config, > uint32_t domid); > > +int libxl__vnuma_align_mem(libxl__gc *gc, > + uint32_t domid, > + struct libxl_domain_build_info *b_info, > + vmemrange_t *memblks); > + > +unsigned long e820_memory_hole_size(unsigned long start, > + unsigned long end, > + struct e820entry e820[], > + int nr); > + > +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc, > + libxl_domain_build_info *info); > + > I think we'll better have a libxl maintainer comment on this but, FWIW, I don't think I'd put these functions in libxl_arch.h. We do have a libxl_numa.c file, for host their implementation (rather than in libxl_dom.c) and, as libxl hidden functions, their prototype is probably fine in libxl_internal.h (unless there is some specific reason for having them here which I'm overlooking). As per e820_memory_hole_size(), it looks like it's only called from inside e820_memory_hole_size(), which means it can be just a static (still inside libxl_numa.c). > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > index a1c16b0..378249e 100644 > +/* prepares vnode to pnode map for domain vNUMA memory allocation */ > +int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid, > + libxl_domain_build_info *info) > +{ > + int i, n, nr_nodes, rc; > Please, initialize nr_nodes to 0. Given how libxl_get_numainfo() works, that's the best and easiest way for being sure we won't leak any of the memory it deals with. > + uint64_t *mems; > + unsigned long long *claim = NULL; > + libxl_numainfo *ninfo = NULL; > + > + rc = -1; > As said above, use libxl error codes. > + if (info->vnode_to_pnode == NULL) { > + info->vnode_to_pnode = calloc(info->nr_vnodes, > + sizeof(*info->vnode_to_pnode)); > + if (info->vnode_to_pnode == NULL) > + return rc; > + } > + Also, in libxl, memory allocation never fails. :-) What this really mean is that such kind of failure is handled for you by the library itself, if you use the proper wrappers (and doing so is not optional, is the way you deal with memory in libxl). For this specific example, look for libxl__calloc() and some usage examples of it (you will find there never is any error handling code). > + /* default setting */ > + for (i = 0; i < info->nr_vnodes; i++) > + info->vnode_to_pnode[i] = VNUMA_NO_NODE; > + > + /* Get NUMA info */ > + ninfo = libxl_get_numainfo(CTX, &nr_nodes); > + if (ninfo == NULL || nr_nodes == 0) { > + LOG(DEBUG, "No HW NUMA found.\n"); > + rc = 0; > + goto vnmapout; > + } > + I'm not so sure about this. What do you mean with "No HW NUMA found"? IIRC, on a non-NUMA box, libxl_get_numainfo returns something. I just checked on my (non-NUMA) laptop, and I do see a node 0 with all the memory and 10 as distance to itself. So, if what you get is ninfo=NULL and nr_nodes=0, it means something went wrong, and we shouldn't return success. OTOH, if you're interested to detect the case where the call succeeded, and there really is only one node, you should probably check for nr_nodes == 1. > + /* > + * check if we have any hardware NUMA nodes selected, > + * otherwise VNUMA_NO_NODE set and used default allocation > + */ > + if (libxl_bitmap_is_empty(&info->nodemap)) > + return 0; > Why _is_empty()? I mean, under what circumstances info->nodemap is empty? I'm asking because info->nodemap is initialized as full in libxl__domain_build_info_setdefault() and can only change in libxl__build_pre() in a way that makes it really unlikely for it to become empty... so... > + mems = info->vnuma_memszs; > + > + /* check if all vnodes will fit in one node */ > + libxl_for_each_set_bit(n, info->nodemap) { > + if(ninfo[n].free/1024 >= info->max_memkb && > + libxl_bitmap_test(&info->nodemap, n)) > + { > + /* > + * all domain v-nodes will fit one p-node n, > + * p-node n is a best candidate selected by automatic > + * NUMA placement. > + */ > + for (i = 0; i < info->nr_vnodes; i++) > + info->vnode_to_pnode[i] = n; > + /* we can exit, as we are happy with placement */ > + return 0; > + } > + } > Wait, why do we want to do that? If info->nodemap is not fully set (which BTW is the actual situation used to exemplify "auto", or "do this for me", rather than it being empty) it already contains some instructions that, I think, everyone in the downstream chain --be it other parts of libxl, libxc and Xen-- should follow. What I mean is, if info->nodemap has two bits sets, that means the domain should be "created on" the (two) nodes corresponding to the bits themselves. OTOH, if info->nodemap has only one bit set, then the whole domain should stay there and in that case there isn't much choice, is it? (basically, your "libxl_for_each_set_bit()" will only take one step.) Here, it looks like, no matter what you we have in info->nodemap, you look for one of the nodes that can contain the whole domain, which doesn't look right. Below... > + /* Determine the best nodes to fit vNUMA nodes */ > + /* TODO: change algorithm. The current just fits the nodes > + * Will be nice to have them also sorted by size > + * If no p-node found, will be set to NUMA_NO_NODE > + */ > + claim = calloc(info->nr_vnodes, sizeof(*claim)); > + if (claim == NULL) > + return rc; > + > + libxl_for_each_set_bit(n, info->nodemap) > + { > + for (i = 0; i < info->nr_vnodes; i++) > + { > + if (((claim[n] + (mems[i] << 20)) <= ninfo[n].free) && > + /*vnode was not set yet */ > + (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) ) > + { > + info->vnode_to_pnode[i] = n; > + claim[n] += (mems[i] << 20); > + } > + } > + } > + ...you're trying to do something similar, i.e., pack the domain within a subset of what you find in info->cpumap instead of just use it. Anyway, I know that automatic placement code may look tricky... Let me see if I can hack something together quickly, to show what I mean and what I'd have put here instead of all this. :-) > int libxl__build_pre(libxl__gc *gc, uint32_t domid, > libxl_domain_config *d_config, libxl__domain_build_state > *state) > { > @@ -235,6 +318,66 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, > if (rc) > return rc; > } > + > + if (info->nr_vnodes > 0) { > + /* The memory blocks will be formed here from sizes */ > + struct vmemrange *memrange = libxl__calloc(gc, info->nr_vnodes, > + sizeof(*memrange)); > + if (memrange == NULL) { > + LOG(DETAIL, "Failed to allocate memory for memory ranges.\n"); > + return ERROR_FAIL; > + } > AhA, here you're using the correct memory allocation wrapper, so you can ditch the error handling. > + if (libxl__vnuma_align_mem(gc, domid, info, memrange) < 0) { > + LOG(DETAIL, "Failed to align memory map.\n"); > + return ERROR_FAIL; > + } > + > + /* > + * If vNUMA vnode_to_pnode map defined, determine if we > + * can disable automatic numa placement and place vnodes > + * on specified pnodes. > + * For now, if vcpu affinity specified, we will use > + * specified vnode to pnode map. > + */ > + if (libxl__vnodemap_is_usable(gc, info) == 1) { > + Extra blank line. > + /* Will use user-defined vnode to pnode mask */ > + Here too, and with spurious spaces. > + libxl_defbool_set(&info->numa_placement, false); > Setting info->numa_placement to false only makes sense if you do it _before_ the following block of code: if (libxl_defbool_val(info->numa_placement)) { if (!libxl_bitmap_is_full(&info->cpumap)) { LOG(ERROR, "Can run NUMA placement only if no vcpu " "affinity is specified"); return ERROR_INVAL; } rc = numa_place_domain(gc, domid, info); if (rc) return rc; } There isn't any other place where that flag is checked and, if you get past this, automatic placement either already happened (if numa_placement was true and the other condition is met), or it never will. The right thing to do would probably be having libxl__vnodemap_is_usable() called *before* (even immediately above is fine) we get here, and have _it_ disable info->numa_placement, if that reveals to be the case. At this point, most of the code below (until '***') can just live inside the above "if (libxl_defbool_val(info->numa_placement))" above, without the need of introducing another identical block. > + } > + else { > + LOG(ERROR, "The allocation mask for vnuma nodes cannot be > used.\n"); > + if (libxl_defbool_val(info->vnuma_placement)) { > + > + LOG(DETAIL, "Switching to automatic vnuma to pnuma > placement\n."); > + /* Construct the vnode to pnode mapping if possible */ > + if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) { > + LOG(DETAIL, "Failed to call init_vnodemap\n"); > + /* vnuma_nodemap will not be used if nr_vnodes == 0 */ > + return ERROR_FAIL; > + } > + } > + else { > + LOG(ERROR, "The vnodes cannot be mapped to pnodes this > way\n."); > + info->nr_vnodes = 0; > + return ERROR_FAIL; > + } > + } > + /* plumb domain with vNUMA topology */ > + if (xc_domain_setvnuma(ctx->xch, domid, info->nr_vnodes, > + info->max_vcpus, memrange, > + info->vdistance, info->vcpu_to_vnode, > + info->vnode_to_pnode) < 0) { > + > + LOG(DETAIL, "Failed to set vnuma topology for domain from\n."); > + info->nr_vnodes = 0; > + return ERROR_FAIL; > + } > + } > + else > + LOG(DEBUG, "Will not construct vNUMA topology.\n"); > + *** <-- until here > +/* Checks if vnuma_nodemap defined in info can be used > + * for allocation of vnodes on physical NUMA nodes by > + * verifying that there is enough memory on corresponding > + * NUMA nodes. > + */ > +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc, > libxl_domain_build_info *info) > +{ > + unsigned int i; > + libxl_numainfo *ninfo = NULL; > + unsigned long long *claim; > + unsigned int node; > + uint64_t *mems; > + int rc, nr_nodes; > + > + rc = nr_nodes = 0; > + if (info->vnode_to_pnode == NULL || info->vnuma_memszs == NULL) > + return rc; > + /* > + * Cannot use specified mapping if not NUMA machine > + */ > + ninfo = libxl_get_numainfo(CTX, &nr_nodes); > + if (ninfo == NULL) { > + return rc; > + } > + mems = info->vnuma_memszs; > + claim = calloc(info->nr_vnodes, sizeof(*claim)); > + if (claim == NULL) > + return rc; > + /* Sum memory request on per pnode basis */ > + for (i = 0; i < info->nr_vnodes; i++) > + { > + node = info->vnode_to_pnode[i]; > + /* Correct pnode number? */ > + if (node < nr_nodes) > + claim[node] += (mems[i] << 20); > + else > + goto vmapu; > + } > + for (i = 0; i < nr_nodes; i++) { > + if (claim[i] > ninfo[i].free) > + /* Cannot complete user request, falling to default */ > + goto vmapu; > + } > + rc = 1; > + > + vmapu: > + if(claim) free(claim); > + return rc; > +} > This function looks conceptually right. Despite that, it has comes with coding style, trailing whitespaces, inconsistent (wrt libxl conventions) error and memory allocation/failure handling. :-D :-D > diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c > index 87a8110..206f5be 100644 > + > +/* > + * Checks for the beginnig and end of RAM in e820 map for domain > + * and aligns start of first and end of last vNUMA memory block to > + * that map. vnode memory size are passed here Megabytes. > + * For PV guest e820 map has fixed hole sizes. > + */ > +int libxl__vnuma_align_mem(libxl__gc *gc, > + uint32_t domid, > + libxl_domain_build_info *b_info, /* IN: mem > sizes */ > + vmemrange_t *memblks) /* OUT: linux numa > blocks in pfn */ > +{ > + int i, j, rc; > + uint64_t next_start_pfn, end_max = 0, size, mem_hole; > + uint32_t nr; > + struct e820entry map[E820MAX]; > + > + if (b_info->nr_vnodes == 0) > + return -EINVAL; > + libxl_ctx *ctx = libxl__gc_owner(gc); > + > + /* retreive e820 map for this host */ > + rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX); > + > + if (rc < 0) { > + errno = rc; > + return -EINVAL; > + } > + nr = rc; > + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, > + (b_info->max_memkb - b_info->target_memkb) + > + b_info->u.pv.slack_memkb); > + if (rc) > + { > + errno = rc; > + return -EINVAL; > + } > + > + /* max pfn for this host */ > + for (j = nr - 1; j >= 0; j--) > + if (map[j].type == E820_RAM) { > + end_max = map[j].addr + map[j].size; > + break; > + } > + > + memset(memblks, 0, sizeof(*memblks) * b_info->nr_vnodes); > + next_start_pfn = 0; > + > + memblks[0].start = map[0].addr; > + > + for(i = 0; i < b_info->nr_vnodes; i++) { > + /* start can be not zero */ > + memblks[i].start += next_start_pfn; > + memblks[i].end = memblks[i].start + (b_info->vnuma_memszs[i] << 20); > + > + size = memblks[i].end - memblks[i].start; > + /* > + * For pv host with e820_host option turned on we need > + * to rake into account memory holes. For pv host with > + * e820_host disabled or unset, the map is contiguous > + * RAM region. > + */ > + if (libxl_defbool_val(b_info->u.pv.e820_host)) { > + while (mem_hole = e820_memory_hole_size(memblks[i].start, > + memblks[i].end, map, nr), > + memblks[i].end - memblks[i].start - mem_hole < size) > + { > + memblks[i].end += mem_hole; > + > + if (memblks[i].end > end_max) { > + memblks[i].end = end_max; > + break; > + } > + } > + } > + next_start_pfn = memblks[i].end; > + } > + > + if (memblks[i-1].end > end_max) > + memblks[i-1].end = end_max; > + > + return 0; > +} > + All this code here would require for someone who knows more about e820 to chime in (or for me to learn something more about it, which I definitely want to, but will take a bit). Perhaps you can help me at leas a bit. Can you comment on/describe what is it that you do when you find an hole? It looks from the above that you just need to move the end of the node(s) memory range(s)... Is that the case? For some reason, I was assuming that dealing with holes would have required to add support for more than one memory block per node, but it's apparently not the case, is it? Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.