[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 4/7] libxl: vNUMA supporting interface



On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote:
> Signed-off-by: Elena Ufimtseva <ufimtseva@xxxxxxxxx>
> 
> Changes since v2:
> *   Added vnuma memory pfn alignment which takes into
> account e820_host option and non-contiguous e820 memory map
> in that case;
>
Wow, cool that you got this one working too! ;-P

> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
> index 0de1112..2bd3653 100644
 
> +int libxl_domain_setvnuma(libxl_ctx *ctx,
> +                            uint32_t domid,
> +                            uint16_t nr_vnodes,
> +                            uint16_t nr_vcpus,
> +                            vmemrange_t *vmemrange,
> +                            unsigned int *vdistance,
> +                            unsigned int *vcpu_to_vnode,
> +                            unsigned int *vnode_to_pnode)
> +{
> +    GC_INIT(ctx);
>
Do you need this? I don't think so: you don't need a gc, and ctx appears
to be enough for you.

> +    int ret;
> +    ret = xc_domain_setvnuma(ctx->xch, domid, nr_vnodes,
> +                                nr_vcpus, vmemrange,
> +                                vdistance, 
> +                                vcpu_to_vnode,
> +                                vnode_to_pnode);
> +    GC_FREE;
> +    return ret;
> +}
> +
libxl function should always return libxl error codes, such as
ERROR_FAIL, ERROR_INVAL, ecc. It's nice to show what actually went wrong
in the xc_* call, but that's usually done by:

 rc = xc_call();
 if (rc < 0) {
  LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "error xxx");
  return ERROR_FAIL;
 }

On the grounds that libxc would properly set errno and return -1 on
failure.


> diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h
> index abe6685..b95abab 100644

>  int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config,
>                 uint32_t domid);
>  
> +int libxl__vnuma_align_mem(libxl__gc *gc,
> +                            uint32_t domid,
> +                            struct libxl_domain_build_info *b_info,
> +                            vmemrange_t *memblks); 
> +
> +unsigned long e820_memory_hole_size(unsigned long start,
> +                                    unsigned long end,
> +                                    struct e820entry e820[],
> +                                    int nr);
> +
> +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc,
> +                                libxl_domain_build_info *info);
> +
>
I think we'll better have a libxl maintainer comment on this but, FWIW,
I don't think I'd put these functions in libxl_arch.h.

We do have a libxl_numa.c file, for host their implementation (rather
than in libxl_dom.c) and, as libxl hidden functions, their prototype is
probably fine in libxl_internal.h (unless there is some specific reason
for having them here which I'm overlooking).

As per e820_memory_hole_size(), it looks like it's only called from
inside e820_memory_hole_size(), which means it can be just a static
(still inside libxl_numa.c).

> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> index a1c16b0..378249e 100644
 
> +/* prepares vnode to pnode map for domain vNUMA memory allocation */
> +int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid,
> +                        libxl_domain_build_info *info)
> +{
> +    int i, n, nr_nodes, rc;
>
Please, initialize nr_nodes to 0. Given how libxl_get_numainfo() works,
that's the best and easiest way for being sure we won't leak any of the
memory it deals with.

> +    uint64_t *mems;
> +    unsigned long long *claim = NULL;
> +    libxl_numainfo *ninfo = NULL;
> +
> +    rc = -1;
>
As said above, use libxl error codes.

> +    if (info->vnode_to_pnode == NULL) {
> +        info->vnode_to_pnode = calloc(info->nr_vnodes,
> +                                      sizeof(*info->vnode_to_pnode));
> +        if (info->vnode_to_pnode == NULL)                                    
> +            return rc; 
> +    }
> +
Also, in libxl, memory allocation never fails. :-) What this really mean
is that such kind of failure is handled for you by the library itself,
if you use the proper wrappers (and doing so is not optional, is the way
you deal with memory in libxl).

For this specific example, look for libxl__calloc() and some usage
examples of it (you will find there never is any error handling code).

> +    /* default setting */
> +    for (i = 0; i < info->nr_vnodes; i++)
> +        info->vnode_to_pnode[i] = VNUMA_NO_NODE;
> +
> +    /* Get NUMA info */
> +    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
> +    if (ninfo == NULL || nr_nodes == 0) {
> +        LOG(DEBUG, "No HW NUMA found.\n");
> +        rc = 0;
> +        goto vnmapout;
> +    }
> +
I'm not so sure about this. What do you mean with "No HW NUMA found"?
IIRC, on a non-NUMA box, libxl_get_numainfo returns something. I just
checked on my (non-NUMA) laptop, and I do see a node 0 with all the
memory and 10 as distance to itself.

So, if what you get is ninfo=NULL and nr_nodes=0, it means something
went wrong, and we shouldn't return success. OTOH, if you're interested
to detect the case where the call succeeded, and there really is only
one node, you should probably check for nr_nodes == 1.

> +    /* 
> +     * check if we have any hardware NUMA nodes selected,
> +     * otherwise VNUMA_NO_NODE set and used default allocation
> +     */ 
> +    if (libxl_bitmap_is_empty(&info->nodemap))
> +        return 0;
>
Why _is_empty()? I mean, under what circumstances info->nodemap is
empty?

I'm asking because info->nodemap is initialized as full in
libxl__domain_build_info_setdefault() and can only change in
libxl__build_pre() in a way that makes it really unlikely for it to
become empty... so...

> +    mems = info->vnuma_memszs;
> +    
> +    /* check if all vnodes will fit in one node */
> +    libxl_for_each_set_bit(n, info->nodemap) {
> +        if(ninfo[n].free/1024 >= info->max_memkb  && 
> +           libxl_bitmap_test(&info->nodemap, n))
> +           {
> +               /* 
> +                * all domain v-nodes will fit one p-node n, 
> +                * p-node n is a best candidate selected by automatic 
> +                * NUMA placement.
> +                */
> +               for (i = 0; i < info->nr_vnodes; i++)
> +                    info->vnode_to_pnode[i] = n;
> +               /* we can exit, as we are happy with placement */
> +               return 0;
> +           }
> +    }
>
Wait, why do we want to do that? If info->nodemap is not fully set
(which BTW is the actual situation used to exemplify "auto", or "do this
for me", rather than it being empty) it already contains some
instructions that, I think, everyone in the downstream chain --be it
other parts of libxl, libxc and Xen-- should follow.

What I mean is, if info->nodemap has two bits sets, that means the
domain should be "created on" the (two) nodes corresponding to the bits
themselves. OTOH, if info->nodemap has only one bit set, then the whole
domain should stay there and in that case there isn't much choice, is
it? (basically, your "libxl_for_each_set_bit()" will only take one
step.)

Here, it looks like, no matter what you we have in info->nodemap, you
look for one of the nodes that can contain the whole domain, which
doesn't look right.

Below...
> +    /* Determine the best nodes to fit vNUMA nodes */
> +    /* TODO: change algorithm. The current just fits the nodes
> +     * Will be nice to have them also sorted by size 
> +     * If no p-node found, will be set to NUMA_NO_NODE
> +     */
> +    claim = calloc(info->nr_vnodes, sizeof(*claim));
> +    if (claim == NULL)
> +        return rc;
> +     
> +    libxl_for_each_set_bit(n, info->nodemap)
> +    {
> +        for (i = 0; i < info->nr_vnodes; i++)
> +        {
> +            if (((claim[n] + (mems[i] << 20)) <= ninfo[n].free) &&
> +                 /*vnode was not set yet */
> +                 (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) )
> +            {
> +                info->vnode_to_pnode[i] = n;
> +                claim[n] += (mems[i] << 20);
> +            }
> +        }
> +    }
> +
...you're trying to do something similar, i.e., pack the domain within a
subset of what you find in info->cpumap instead of just use it.

Anyway, I know that automatic placement code may look tricky... Let me
see if I can hack something together quickly, to show what I mean and
what I'd have put here instead of all this. :-)

>  int libxl__build_pre(libxl__gc *gc, uint32_t domid,
>                libxl_domain_config *d_config, libxl__domain_build_state 
> *state)
>  {
> @@ -235,6 +318,66 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
>          if (rc)
>              return rc;
>      }
> +
> +    if (info->nr_vnodes > 0) {
> +        /* The memory blocks will be formed here from sizes */
> +        struct vmemrange *memrange = libxl__calloc(gc, info->nr_vnodes,
> +                                                sizeof(*memrange));
> +        if (memrange == NULL) {
> +            LOG(DETAIL, "Failed to allocate memory for memory ranges.\n");
> +            return ERROR_FAIL;
> +        }
>
AhA, here you're using the correct memory allocation wrapper, so you can
ditch the error handling.

> +        if (libxl__vnuma_align_mem(gc, domid, info, memrange) < 0) {
> +            LOG(DETAIL, "Failed to align memory map.\n");
> +            return ERROR_FAIL;
> +        }
> +
> +        /* 
> +        * If vNUMA vnode_to_pnode map defined, determine if we
> +        * can disable automatic numa placement and place vnodes
> +        * on specified pnodes.
> +        * For now, if vcpu affinity specified, we will use 
> +        * specified vnode to pnode map.
> +        */
> +        if (libxl__vnodemap_is_usable(gc, info) == 1) {
> +
Extra blank line.
> +            /* Will use user-defined vnode to pnode mask */
> +        
Here too, and with spurious spaces.

> +            libxl_defbool_set(&info->numa_placement, false);
>
Setting info->numa_placement to false only makes sense if you do it
_before_ the following block of code:

    if (libxl_defbool_val(info->numa_placement)) {

        if (!libxl_bitmap_is_full(&info->cpumap)) {
            LOG(ERROR, "Can run NUMA placement only if no vcpu "
                       "affinity is specified");
            return ERROR_INVAL;
        }

        rc = numa_place_domain(gc, domid, info);
        if (rc)
            return rc;
    }

There isn't any other place where that flag is checked and, if you get
past this, automatic placement either already happened (if
numa_placement was true and the other condition is met), or it never
will.

The right thing to do would probably be having
libxl__vnodemap_is_usable() called *before* (even immediately above is
fine) we get here, and have _it_ disable info->numa_placement, if that
reveals to be the case.

At this point, most of the code below (until '***') can just live inside
the above "if (libxl_defbool_val(info->numa_placement))" above, without
the need of introducing another identical block.

> +        }
> +        else {
> +            LOG(ERROR, "The allocation mask for vnuma nodes cannot be 
> used.\n");
> +            if (libxl_defbool_val(info->vnuma_placement)) {
> +                
> +                LOG(DETAIL, "Switching to automatic vnuma to pnuma 
> placement\n.");
> +                /* Construct the vnode to pnode mapping if possible */
> +                if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) {
> +                    LOG(DETAIL, "Failed to call init_vnodemap\n");
> +                    /* vnuma_nodemap will not be used if nr_vnodes == 0 */
> +                    return ERROR_FAIL;
> +                }
> +            }
> +            else {
> +                LOG(ERROR, "The vnodes cannot be mapped to pnodes this 
> way\n.");
> +                info->nr_vnodes = 0;
> +                return ERROR_FAIL;
> +            }
> +        }
> +        /* plumb domain with vNUMA topology */
> +        if (xc_domain_setvnuma(ctx->xch, domid, info->nr_vnodes,
> +                                info->max_vcpus, memrange,
> +                                info->vdistance, info->vcpu_to_vnode,
> +                                info->vnode_to_pnode) < 0) {
> +        
> +           LOG(DETAIL, "Failed to set vnuma topology for domain from\n.");
> +           info->nr_vnodes = 0;
> +           return ERROR_FAIL;
> +        }
> +    }
> +    else
> +        LOG(DEBUG, "Will not construct vNUMA topology.\n");
> +    
*** <-- until here

> +/* Checks if vnuma_nodemap defined in info can be used
> + * for allocation of vnodes on physical NUMA nodes by
> + * verifying that there is enough memory on corresponding
> + * NUMA nodes.
> + */
> +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc, 
> libxl_domain_build_info *info)
> +{
> +    unsigned int i;
> +    libxl_numainfo *ninfo = NULL;
> +    unsigned long long *claim;
> +    unsigned int node;
> +    uint64_t *mems;
> +    int rc, nr_nodes;
> +
> +    rc = nr_nodes = 0;
> +    if (info->vnode_to_pnode == NULL || info->vnuma_memszs == NULL)
> +        return rc;
> +    /*
> +     * Cannot use specified mapping if not NUMA machine 
> +     */
> +    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
> +    if (ninfo == NULL) {
> +        return rc;   
> +    }
> +    mems = info->vnuma_memszs;   
> +    claim = calloc(info->nr_vnodes, sizeof(*claim));
> +    if (claim == NULL)
> +        return rc;
> +    /* Sum memory request on per pnode basis */ 
> +    for (i = 0; i < info->nr_vnodes; i++)
> +    {
> +        node = info->vnode_to_pnode[i];
> +        /* Correct pnode number? */
> +        if (node < nr_nodes)
> +            claim[node] += (mems[i] << 20);
> +        else
> +            goto vmapu;
> +   }
> +   for (i = 0; i < nr_nodes; i++) {
> +       if (claim[i] > ninfo[i].free)
> +          /* Cannot complete user request, falling to default */
> +          goto vmapu;
> +   }
> +   rc = 1;
> +
> + vmapu:
> +   if(claim) free(claim);
> +   return rc;
> +}
>
This function looks conceptually right. Despite that, it has comes with
coding style, trailing whitespaces, inconsistent (wrt libxl conventions)
error and memory allocation/failure handling. :-D :-D

> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
> index 87a8110..206f5be 100644

> +
> +/*
> + * Checks for the beginnig and end of RAM in e820 map for domain
> + * and aligns start of first and end of last vNUMA memory block to
> + * that map. vnode memory size are passed here Megabytes.
> + * For PV guest e820 map has fixed hole sizes.
> + */
> +int libxl__vnuma_align_mem(libxl__gc *gc,
> +                            uint32_t domid,
> +                            libxl_domain_build_info *b_info, /* IN: mem 
> sizes */
> +                            vmemrange_t *memblks)        /* OUT: linux numa 
> blocks in pfn */
> +{
> +    int i, j, rc;
> +    uint64_t next_start_pfn, end_max = 0, size, mem_hole;
> +    uint32_t nr;
> +    struct e820entry map[E820MAX];
> +    
> +    if (b_info->nr_vnodes == 0)
> +        return -EINVAL;
> +    libxl_ctx *ctx = libxl__gc_owner(gc);
> +
> +    /* retreive e820 map for this host */
> +    rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
> +
> +    if (rc < 0) {
> +        errno = rc;
> +        return -EINVAL;
> +    }
> +    nr = rc;
> +    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
> +                       (b_info->max_memkb - b_info->target_memkb) +
> +                       b_info->u.pv.slack_memkb);
> +    if (rc)
> +    {   
> +        errno = rc;
> +        return -EINVAL;
> +    }
> +
> +    /* max pfn for this host */
> +    for (j = nr - 1; j >= 0; j--)
> +        if (map[j].type == E820_RAM) {
> +            end_max = map[j].addr + map[j].size;
> +            break;
> +        }
> +    
> +    memset(memblks, 0, sizeof(*memblks) * b_info->nr_vnodes);
> +    next_start_pfn = 0;
> +
> +    memblks[0].start = map[0].addr;
> +
> +    for(i = 0; i < b_info->nr_vnodes; i++) {
> +        /* start can be not zero */
> +        memblks[i].start += next_start_pfn; 
> +        memblks[i].end = memblks[i].start + (b_info->vnuma_memszs[i] << 20);
> +        
> +        size = memblks[i].end - memblks[i].start;
> +        /*
> +         * For pv host with e820_host option turned on we need
> +         * to rake into account memory holes. For pv host with
> +         * e820_host disabled or unset, the map is contiguous 
> +         * RAM region.
> +         */ 
> +        if (libxl_defbool_val(b_info->u.pv.e820_host)) {
> +            while (mem_hole = e820_memory_hole_size(memblks[i].start,
> +                                                 memblks[i].end, map, nr),
> +                    memblks[i].end - memblks[i].start - mem_hole < size)
> +            {
> +                memblks[i].end += mem_hole;
> +
> +                if (memblks[i].end > end_max) {
> +                    memblks[i].end = end_max;
> +                    break;
> +                }
> +            }
> +        }
> +        next_start_pfn = memblks[i].end;
> +    }
> +    
> +    if (memblks[i-1].end > end_max)
> +        memblks[i-1].end = end_max;
> +
> +    return 0;
> +}
> +
All this code here would require for someone who knows more about e820
to chime in (or for me to learn something more about it, which I
definitely want to, but will take a bit).

Perhaps you can help me at leas a bit. Can you comment on/describe what
is it that you do when you find an hole?

It looks from the above that you just need to move the end of the
node(s) memory range(s)... Is that the case? For some reason, I was
assuming that dealing with holes would have required to add support for
more than one memory block per node, but it's apparently not the case,
is it?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.