Xen project Mailing List

Re: [Xen-devel] [PATCH 1 of 4 v6/leftover] libxl: enable automatic placement of guests on NUMA nodes

To: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>

From: Dario Faggioli <raistlin@xxxxxxxx>

Date: Mon, 23 Jul 2012 16:39:39 +0200

Cc: Andre Przywara <andre.przywara@xxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Stefano Stabellini <Stefano.Stabellini@xxxxxxxxxxxxx>, George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Mon, 23 Jul 2012 14:40:09 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Mon, 2012-07-23 at 12:38 +0100, Ian Jackson wrote: > Dario Faggioli writes ("[PATCH 1 of 4 v6/leftover] libxl: enable automatic > placement of guests on NUMA nodes"): > > If a domain does not have a VCPU affinity, try to pin it automatically > > to some PCPUs. This is done taking into account the NUMA characteristics > > of the host. In fact, we look for a combination of host's NUMA nodes > > with enough free memory and number of PCPUs for the new domain, and pin > > it to the VCPUs of those nodes. > > I'm afraid your new algorithm appears to rely on an inconsistent > comparison function: > It's not that new, at least not the comparison function, it's being there from a couple of versions... Anyway... > > +static int numa_cmpf(const libxl__numa_candidate *c1, > > + const libxl__numa_candidate *c2) > > +{ > > + double freememkb_diff = normalized_diff(c2->free_memkb, > > c1->free_memkb); > > + double nrvcpus_diff = normalized_diff(c1->nr_vcpus, c2->nr_vcpus); > > + > > + return SIGN(freememkb_diff + 3*nrvcpus_diff); > > +} > > I don't think this comparison function necessarily defines a partial > order. Can you provide a proof that it does ? I think you probably > can't. If you think it _is_ a partial order please let me know and I > will try to construct a counterexample. > Nhaa, don't bother, I'll change it into something simpler. > >[..] > > > That clearly defines a partial order. The important point is that the > functions score() must each be calculated only from the node in > question. score() must not take the other candidate as an argument. > Your normalized_diff function violates this. > I think it should be fine, as I'm only using the other candidate's characteristics for normalization. But again, I'm fine with turning this into something simpler than it is now, and adhering with the criterion above, which will make things even more clear. Thanks. > > + /* Not even a suitable placement candidate! Let's just don't touch the > > + * domain's info->cpumap. It will have affinity with all nodes/cpus. */ > > + if (found == 0) > > + goto out; > > This probably deserves a log message. > It's there: it's being printed by libxl__get_numa_candidate(). It's like that to avoid printing more of them, which would be confusing, e.g., something like this: libxl: ERROR Too many nodes libxl: ERROR No placement found Is that acceptable? > > @@ -107,7 +203,22 @@ int libxl__build_pre(libxl__gc *gc, uint > > uint32_t rtc_timeoffset; > > > > xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus); > > + > > + /* > > + * Check if the domain has any CPU affinity. If not, try to build up > > one. > > + * In case numa_place_domain() find at least a suitable candidate, it > > will > > This line is 78 characters, leading to wrap damage in my email reply > to you. The official coding style says you may use up to 80 which is > tolerable in the odd line of code but it would be nice if wordwrapped > comments were wrapped to a shorter distance. > Ok, will take a look at this, here and everywhere else. Thanks. > > + * affect info->cpumap accordingly; if it does not, it just leaves it > > + * as it is. This means (unless some weird error manifests) the > > subsequent > > + * call to libxl_set_vcpuaffinity_all() will do the actual placement, > > + * whatever that turns out to be. > > + */ > > + if (libxl_bitmap_is_full(&info->cpumap)) { > > + int rc = numa_place_domain(gc, info); > > + if (rc) > > + return rc; > > + } > > libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); > > + > > As discussed, this is wrong; the fix is in your 2/4 but needs to be > folded into this patch I think. Patches should not introduce new > bugs even if they're about to be fixed in the next commit. > Ok, I'll merge the two patches. > > + /* > > + * The good thing about this solution is that it is based on heuristics > > + * (implemented in numa_cmpf() ), but we at least can evaluate it on > > + * all the possible placement candidates. That can happen because the > > + * number of nodes present in current NUMA systems is quite small. > > + * In fact, even if a sum of binomials is involved, if the system has > > + * up to 8 nodes it "only" takes 256 steps. At the same time, 8 is a > > + * sensible value, as it is exactly the number of nodes the biggest > > + * NUMA systems provide at the time of this writing (and will probably > > + * continue to do so for a while). However, computanional complexity > > + * would explode on systems much bigger than that. 16 nodes system > > would > > + * still be fine (it will be 65536 steps), but it's really importante > > we > important > > Isn't this an argument that the limit should be 16 ? (Also 65535 > steps since placement on 0 nodes is typically not an option.) > Well, it is for me, but as we agreed on 8, I went for that. What I wanted was just make sure we (or whoever else) know/remember that 16 cold work too, in case that turns out to be useful in future. That being said, if you agree on raising the cap to 16 right now, I'll be glad to do that. :-D > > + * avoid trying to run this on monsters with 32, 64 or more nodes (if > > + * they ever pop into being). Therefore, here it comes a safety catch > > + * that disables the algorithm for the cases when it wouldn't work > > well. > > + */ > > + if (nr_nodes > 8) { > > + /* Log we did nothing and return 0, as no real error occurred */ > > + LOG(WARN, "System has %d NUMA nodes, which is too big for the " > > + "placement algorithm to work effectively. Skipping it", > > + nr_nodes); > > It would be nice if this message suggested pinning the vcpus. > > > + if (min_nodes > nr_nodes) > > + min_nodes = nr_nodes; > > + if (!max_nodes || max_nodes > nr_nodes) > > + max_nodes = nr_nodes; > > + if (min_nodes > max_nodes) { > > + rc = ERROR_INVAL; > > + goto out; > > This should be accompanied by a log message. > Ok to both. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.