[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity`
On 11/04/12 14:17, Dario Faggioli wrote: Basic idea is: cpu-affinity tells where a vcpu can run, while node-affinity tells where (in terms of on what NUMA nodes, but that of course imply a set of CPUs) all the vcpus of the domain should/prefer to run... Problems starts when you have both of them speaking at the same time! Respecting vcpu-affinity should remain the primary concern, thus what we do here is add some two-step logic here and there in the scheduler (i.e., where vcpu-affinity related decisions are being undertaken). During the first step, we check the `&&` of vcpu and node affinity masks, aiming at giving some precedence to the CPUs that are both suitable and preferrable for the domain. However, if that fails in finding a valid CPU, the node-affinity is just ignored and, in the second step, we just fallback to vcpu-affinity, as the original behaviour was. Both the introduced overhead and the benefits this provides has to be investigated and compared against each other, possibly in varying conditions and running different workloads. Finally, although still at the RFC level, efforts have been made to write the code in a flexible enough fashion so that, if we ever want to introduce further balancing steps, it shouldn't be too much of a pain. TODO: * Better investigate and test interactions with cpupools. * Test, verify and benchmark. Then test, verify and benchmark again, and again and again. What I know as of know is that it does not explode that easily, but whether it works properly and, more important, brings any benefit, this has to be proven (and yes, I'm out running these tests and benchs already, but do not esitate to manifest your will for helping with that :-D). * Add some counter/stats, e.g., to serve the purpose outlined right above. XXX I'm benchmarking this just right now, and peeking at the results they don't seem too bad. Numbers are mostly close to the case where the VM is already pinned to a node when created. I'll post the results as soon as I can, and I'll be happy to investigate any issue, if you feel like the approach could be the right one. Hey Dario, Sorry for the long delay in reviewing this. Overall I think the approach is good. A few points: This thing of defining "START_STEP" and "END_STEP" seems a bit fragile. I think it would be better to always start at 0, and go until CSCHED_BALANCE_MAX./* + * Node Balancing + */ +#define CSCHED_BALANCE_NODE_AFFINITY 1 +#define CSCHED_BALANCE_CPU_AFFINITY 0 +#define CSCHED_BALANCE_START_STEP CSCHED_BALANCE_NODE_AFFINITY +#define CSCHED_BALANCE_END_STEP CSCHED_BALANCE_CPU_AFFINITY + + + /* + * Let's cache domain's dom->node_affinity here as an + * optimization for a couple of hot paths. In fact, + * knowing whether or not dom->node_affinity has changed + * would allow us to avoid rebuilding node_affinity_cpumask + * (below) duing node balancing and/or scheduling. + */ + nodemask_t node_affinity_cache; + /* Basing on what dom->node_affinity says, + * on what CPUs would we like to run most? */ + cpumask_t node_affinity_cpumask; I think the comments here need to be more clear. The main points are:* node_affinity_cpumask is the dom->node_affinity translated from a nodemask into a cpumask * Because doing the nodemask -> cpumask translation may be expensive, node_affinity_cache stores the last translated value, so we can avoid doing the translation if nothing has changed. I don't like this. For one, it's fragile; if for some reason you switched NODE_AFFINITY and CPU_AFFINITY above, suddenly this loop would go the wrong direction.+#define csched_balance(__step) \ + for ( (__step) = CSCHED_BALANCE_START_STEP; \ + (__step)>= CSCHED_BALANCE_END_STEP; \ + (__step)-- ) I don't think there's any reason not to just use "for(step=0; step < CSCHED_BALANCE_MAX; step++)" -- it's not ugly and it means you know exactly what's going on without having to go search for the macro. This needs to be clearer -- vcpu-affinity doesn't have anything to do with this function, and there's nothing "sort-of" about the conversion. :-)+ +/* + * Sort-of conversion between node-affinity and vcpu-affinity for the domain, + * i.e., a cpumask containing all the cpus from all the set nodes in the + * node-affinity mask of the domain. I think you mean to say, "Create a cpumask from the node affinity mask." + * + * Notice that we completely forget about serializing this (both here and + * in the various sites where node_affinity_cpumask is used) against + * d->node_affinity_lock. That would be hard to do properly, as that lock + * is a non IRQ-safe one, and it would be nearly impossible to access it + * from the scheduling code. However, although this leaves some room for + * races with code paths that updates d->node_affinity, it all should still + * be fine, considering: + * (1) d->node_affinity updates are going to be quite rare; + * (2) this balancing logic is kind-of "best effort" anyway; + * (3) given (1), any inconsistencies are likely to be fixed by the next + * call to this same function without risking going into instability. + * + * XXX Validate (via testing/benchmarking) whether this is true, as it + * likely sounds to be, or it causes unexpected issues. Ack. Hmm -- this mechanism seems kind of fragile. It seems like returning -1 or something, and having the caller call "continue", would be easier for future coders (potentially you or I) to reason about.+/* Put together the cpumask we are going to use for this balancing step */ +static int +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) +{ + struct domain *d = vc->domain; + struct csched_dom *sdom = CSCHED_DOM(d); + + /* + * Only two possible steps exists for now: node and vcpu affinity. + * If this domain does not have a node-affinity or is affine to all the + * nodes, just return th vcpu-affinity mask (for *this* vcpu) and + * stop the balancing process. + */ + if ( !d->has_node_affinity || nodes_full(d->node_affinity) || + step == CSCHED_BALANCE_CPU_AFFINITY ) + { + cpumask_copy(mask, vc->cpu_affinity); + return CSCHED_BALANCE_CPU_AFFINITY; + } Beware of early optimization. :-) Just make sure to mark this down for something to look at for profiling.+ /* + * XXX As of now (with only two possible steps, this is easy and readable + * enough. If more steps become necessary at some point in time, we + * can keep an array of cpumask_t in dom_data and return the proper + * element via step-indexing such an array. Of course, we can turn + * this like that even right now... Thoughts? + */ I don't see any reason to make this into a function -- it's only called once, and it's not that long. Unless you're concerned about too many indentations making the lines too short?static inline void +__cpumask_tickle(cpumask_t *mask, const cpumask_t *idle_mask) +{ + CSCHED_STAT_CRANK(tickle_idlers_some); + if ( opt_tickle_one_idle ) + { + this_cpu(last_tickle_cpu) = + cpumask_cycle(this_cpu(last_tickle_cpu), idle_mask); + cpumask_set_cpu(this_cpu(last_tickle_cpu), mask); + } + else + cpumask_or(mask, mask, idle_mask); +} + csched_balance(balance_step) { Again, I'd make this a for() loop. We might as well do what you've got here, unless it's likely to produce garbage. This isn't exactly a hot path. :-)sdom->dom = dom; + /* + *XXX This would be 'The Right Thing', but as it is still too + * early and d->node_affinity has not settled yet, maybe we + * can just init the two masks with something like all-nodes + * and all-cpus and rely on the first balancing call for + * having them updated? + */ + csched_build_balance_cpumask(sdom); Other than that, looks good -- Thanks! -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |