Xen project Mailing List

Re: [Xen-devel] [PATCH 2/2] sched: credit2: consider per-vcpu soft affinity

To: "Justin T. Weaver" <jtweaver@xxxxxxxxxx>

From: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Date: Tue, 13 Jan 2015 14:06:18 +0100

Cc: george.dunlap@xxxxxxxxxxxxx, henric@xxxxxxxxxx, xen-devel@xxxxxxxxxxxxx

Delivery-date: Tue, 13 Jan 2015 13:07:00 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Sat, 2014-11-29 at 14:33 -1000, Justin T. Weaver wrote: > when deciding which run queue to assign each vcpu to. > > There are two main changes in functionality that this patch introduces. > > First, in function runq_tickle, it tries to find idle pcpus in other run > queues that the vcpu would prefer to run on (soft affinity) or can run on > (hard affinity), in that order. > I see. I don't think a function like runq_tickle() should be able to move a vcpu to a different runqueue. That is done in other places, and I would leave this as it is. Tickling usually happens right after a vcpu has been put on a runqueue, which is something controlled by other parts of the Credit2 algorithm and code, and that is by design. More specifically, moving vcpus among runqueues is the job of the load balancing logic. Turning runq_tickle() into a potential "runqueue-changer" would mix the two concepts of inter-runqueue and intra-runqueue balancing/migration up to a point that I do not like. It also makes the code look rather weird... E.g., look at when runq_tickle() is called in migrate(). In there, right after going through: - __runq_remove(); - __runq_deassign(); - __runq_assign(); - runq_insert(); we call runq_tickle() and... Do it all aver again!?!? :-O Even just because of this, I'd say that, if such a change to runq_tickle() would be necessary (which I don't think), that should happen differently, and should involve changing its various callers, to avoid the repetition. OTOH, it looks like you are not touching the existing for_each_cpu(i,&mask) loop, still inside runq_tickle() itself, while I'd say that looking for a pcpu that the vcpu has soft affinity with, among all the non idle and non-tickled pcpus in the runqueue, really is worthwhile. Actually, the same applies to idle pcpus in the runqueue, i.e., to the code before the for_each_cpu() loop, still in runq_tickle(). So, instead of adding to runq_tickle() the capability of moving vcpus among runqueues, which is not his responsibility, I would add the usual affinity balancing logic (== the for_each_csched2_balance_step2) to it. When we'll have this in place, and we will be able to bench it, we'll see if we think that there is a need for making such function capable of more advanced things (but even at that point, I doubt that changing the runqueue of a vcpu would be one of these). > Second, in function balance_load, if moving a vcpu with soft affinity from > one run queue to another means moving it away from its soft affinity to hard > affinity only, then that move is not considered for load balancing. > This is a key part of the whole thing so, please, spend a few more words explaining what is your idea and how it actually works, both here and in the sources, with comments. For what I've understood, I like the idea. However, why don't use the usual two step approach? In the soft-affinity step, we won't consider vcpus that would see their soft-affinity 'impaired' by being moved, and, perhaps, give a preference boost to vcpus that would see their soft-affinity 'improved' by being moved. If we don't find any suitable candidate, in the next (hard-affinity) step, we fall back to ignoring soft-affinity, and only care about hard constraints. About 'impaired' and 'improoved' soft-affinity. This is just an idea popping out my head right now, so bear with me. :-) Affinity is to pcpus, while, in principle, here we are talking about migration between runqueues, with each runqueue spanning a certain set of pcpus. There seems to me to be room for defining something like the "soft affinity of a vcpu to a runqueue", as, e.g., the number of pcpus of that runqueue with which the vcpu has soft-affinity. This would measure how likely the vcpus will run on a pcpu it has soft-affinity with, if migrated to the runqueue in question. How do you like this? BTW, understand why yesterday I said I'd rather get hard affinity in place before tackling dealing with soft affinity? :-P > @@ -127,6 +127,15 @@ > #define CSCHED2_CREDIT_RESET 0 > /* Max timer: Maximum time a guest can be run for. */ > #define CSCHED2_MAX_TIMER MILLISECS(2) > +/* Two step balancing logic; consider soft affinity first. */ > +#define CSCHED2_BALANCE_SOFT_AFFINITY 0 > +#define CSCHED2_BALANCE_HARD_AFFINITY 1 > +/* vcpu runq migration away from vcpu's soft affinity. */ > +#define CSCHED2_MIGRATE_SOFT_TO_HARD 0 > +/* vcpu runq migration to vcpu's soft affinity. */ > +#define CSCHED2_MIGRATE_HARD_TO_SOFT 1 > +/* vcpu runq migration soft to soft or vcpu has no soft affinity. */ > +#define CSCHED2_MIGRATE_ANY_TO_ANY 2 > > > #define CSCHED2_IDLE_CREDIT (-(1<<30)) > @@ -176,6 +185,8 @@ integer_param("sched_credit2_migrate_resist", > opt_migrate_resist); > #define c2r(_ops, _cpu) (CSCHED2_PRIV(_ops)->runq_map[(_cpu)]) > /* CPU to runqueue struct macro */ > #define RQD(_ops, _cpu) (&CSCHED2_PRIV(_ops)->rqd[c2r(_ops, _cpu)]) > +#define for_each_csched2_balance_step(step) \ > + for ( (step) = 0; (step) <= CSCHED2_BALANCE_HARD_AFFINITY; (step)++ ) > This is identical (apart from the CSCHED2 prefix) to what we have in Credit1. I wonder whether it wouldn't make sense to factor it out in a common header. I'm saying I wonder because I am actually not sure. As things stands, it makes perfect sense. Even if looking forward, I don't see why the two schedulers (as well as any other scheduler wanting to support affinities) should do things differently between each others. The only doubt I have is not about the structure (i.e., the loop, etc.) but about the fact that a scheduler may perhaps want to add more steps, or things like this. If I'd have to decide now, I'd say let's factor this out, and when --if ever-- someone with different needs will come, we'll see what to do. Thoughts? > +/* > + * A vcpu has meaningful soft affinity if... > + * - its soft affinity mask is not full, and > + * - the passed in mask (usually its hard affinity mask) intersects > + * with its soft affinity mask > + */ > +static inline int __vcpu_has_soft_affinity(const struct vcpu *vc, > + const cpumask_t *mask) > +{ > + if ( cpumask_full(vc->cpu_soft_affinity) || > + !cpumask_intersects(vc->cpu_soft_affinity, mask) ) > + return 0; > + > + return 1; > +} > About this, I have far less doubts when asking you to factor this out from sched_credit.c and use it here. :-) > +static void > +csched2_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) > +{ > + if ( step == CSCHED2_BALANCE_SOFT_AFFINITY ) > + { > + cpumask_and(mask, vc->cpu_soft_affinity, vc->cpu_hard_affinity); > + > + if ( unlikely(cpumask_empty(mask)) ) > + cpumask_copy(mask, vc->cpu_hard_affinity); > + } > + else /* step == CSCHED2_BALANCE_HARD_AFFINITY */ > + cpumask_copy(mask, vc->cpu_hard_affinity); > +} > Same here. Oh, and have also a look at how this is used, in sched_credit.c, in conjunction with the per-csched_pcpu variable balance_mask. As I said yesterday, too many cpumask_t variables on stack should be avoided, and that was how I managed to comply to this when working on Credit1. There for sure are more than a few places where you can do the same, or something very similar. > @@ -513,6 +559,68 @@ runq_tickle(const struct scheduler *ops, unsigned int > cpu, struct csched2_vcpu * > goto tickle; > } > > + /* Look for idle cpus in other runqs; consider soft affinity first. */ > + for_each_csched2_balance_step( balance_step ) > + { > + cpumask_t balance_mask; > + > + /* Skip the soft affinity balance step if new doesn't have any. */ > + if ( balance_step == CSCHED2_BALANCE_SOFT_AFFINITY && > + !__vcpu_has_soft_affinity( > + new->vcpu, new->vcpu->cpu_hard_affinity) ) > + continue; > + > + /* Skip this step if can't get a lock on the credit2 private data. */ > + if ( !prv_lock_held || !spin_trylock(&prv->lock) ) > + continue; > + prv_lock_held = 1; > + > + csched2_balance_cpumask(new->vcpu, balance_step, &balance_mask); > + > + for_each_cpu(i, &prv->active_queues) > + { > + struct csched2_runqueue_data *temp_rqd; > + > + temp_rqd = prv->rqd + i; > + > + if ( temp_rqd == rqd || !spin_trylock(&temp_rqd->lock) ) > + continue; > + > + /* Find idle cpus in the balance mask that are not tickled. */ > + cpumask_andnot(&mask, &temp_rqd->idle, &temp_rqd->tickled); > + cpumask_and(&mask, &mask, &balance_mask); > + > + if ( !cpumask_empty(&mask) ) > + { > + /* Found an idle cpu on another run queue; move new. */ > + s_time_t now = 0; > + > + ipid = cpumask_any(&mask); > + new->vcpu->processor = ipid; > + __runq_remove(new); > + now = NOW(); > + update_load(ops, rqd, new, -1, now); > + __runq_deassign(new); > + __runq_assign(new, temp_rqd); > + update_load(ops, temp_rqd, new, 1, now); > + runq_insert(ops, ipid, new); > + cpumask_set_cpu(ipid, &temp_rqd->tickled); > + cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ); > + > + spin_unlock(&temp_rqd->lock); > + spin_unlock(&prv->lock); > + return; > + } > + else > + /* No suitable idlers found in runq temp_rqd. */ > + spin_unlock(&temp_rqd->lock); > + } > + > + if ( prv_lock_held && balance_step == CSCHED2_BALANCE_HARD_AFFINITY ) > + /* No suitable other-runq idlers found; unlock private data. */ > + spin_unlock(&prv->lock); > + } > + > I commented about this above. BTW, look at the code you'd have to add for making this function behave as you wanted, especially locking, and I'm sure you'll concur that all that does not belong here. Where I would put something similar to the above is in choose_cpu() below! > @@ -1039,12 +1147,22 @@ choose_cpu(const struct scheduler *ops, struct vcpu > *vc) > { > struct csched2_private *prv = CSCHED2_PRIV(ops); > int i, min_rqi = -1, new_cpu; > + int max_soft_cpus = 0, max_soft_cpus_rqi = -1; > + bool_t consider_soft_affinity = 0; > struct csched2_vcpu *svc = CSCHED2_VCPU(vc); > s_time_t min_avgload; > - cpumask_t temp_mask; > + cpumask_t temp_mask, vc_soft_affinity; > > BUG_ON(cpumask_empty(&prv->active_queues)); > > + /* Consider soft affinity in the cpu descision? */ > + if ( __vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) ) > + { > + consider_soft_affinity = 1; > + cpumask_and(&vc_soft_affinity, vc->cpu_soft_affinity, > + vc->cpu_hard_affinity); > + } > + > Why "black and white"? Why not using for_each_csched2_balance_step() and differentiate which one affinity to consider depending on the step, rather than here and for all? > @@ -1112,11 +1237,15 @@ choose_cpu(const struct scheduler *ops, struct vcpu > *vc) > > min_avgload = MAX_LOAD; > > - /* Find the runqueue with the lowest instantaneous load */ > + /* > + * Find the run queue with the most cpus in vc's soft affniity, or the > + * the lowest instantaneous load if not considering soft affinity. > + */ > Exactly my point: if the vcpu does not have any soft-affinity, then, as usual, the soft-affinity balancing step must be skipped, and the behaviour should fall back to finding the runq with the lowest instantaneous load, taking only hard affinity constraint into account. And if the vcpu does have a soft-affinity, why forget completely about instantaneous load? I like the idea of "the most cpus in vc's soft affinity" (it's actually really similar to what I was talking about at the beginning of this message), but why can't we combine the two things? More important, why does it have to be just an option, instead than the first step of the soft/hard balancing loop? Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.