[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/2] sched: credit2: consider per-vcpu soft affinity



On Sat, 2014-11-29 at 14:33 -1000, Justin T. Weaver wrote:
> when deciding which run queue to assign each vcpu to.
> 
> There are two main changes in functionality that this patch introduces.
> 
> First, in function runq_tickle, it tries to find idle pcpus in other run
> queues that the vcpu would prefer to run on (soft affinity) or can run on
> (hard affinity), in that order.
> 
I see. I don't think a function like runq_tickle() should be able to
move a vcpu to a different runqueue. That is done in other places, and I
would leave this as it is. Tickling usually happens right after a vcpu
has been put on a runqueue, which is something controlled by other parts
of the Credit2 algorithm and code, and that is by design. More
specifically, moving vcpus among runqueues is the job of the load
balancing logic.

Turning runq_tickle() into a potential "runqueue-changer" would mix the
two concepts of inter-runqueue and intra-runqueue balancing/migration up
to a point that I do not like. It also makes the code look rather
weird... E.g., look at when runq_tickle() is called in migrate(). In
there, right after going through:
 - __runq_remove();
 - __runq_deassign();
 - __runq_assign();
 - runq_insert();
we call runq_tickle() and... Do it all aver again!?!? :-O
Even just because of this, I'd say that, if such a change to
runq_tickle() would be necessary (which I don't think), that should
happen differently, and should involve changing its various callers, to
avoid the repetition.

OTOH, it looks like you are not touching the existing
for_each_cpu(i,&mask) loop, still inside runq_tickle() itself, while I'd
say that looking for a pcpu that the vcpu has soft affinity with, among
all the non idle and non-tickled pcpus in the runqueue, really is
worthwhile. Actually, the same applies to idle pcpus in the runqueue,
i.e., to the code before the for_each_cpu() loop, still in
runq_tickle().

So, instead of adding to runq_tickle() the capability of moving vcpus
among runqueues, which is not his responsibility, I would add the usual
affinity balancing logic (== the for_each_csched2_balance_step2) to it. 

When we'll have this in place, and we will be able to bench it, we'll
see if we think that there is a need for making such function capable of
more advanced things (but even at that point, I doubt that changing the
runqueue of a vcpu would be one of these).

> Second, in function balance_load, if moving a vcpu with soft affinity from
> one run queue to another means moving it away from its soft affinity to hard
> affinity only, then that move is not considered for load balancing.
> 
This is a key part of the whole thing so, please, spend a few more words
explaining what is your  idea and how it actually works, both here and
in the sources, with comments.

For what I've understood, I like the idea. However, why don't use the
usual two step approach? In the soft-affinity step, we won't consider
vcpus that would see their soft-affinity 'impaired' by being moved, and,
perhaps, give a preference boost to vcpus that would see their
soft-affinity 'improved' by being moved. If we don't find any suitable
candidate, in the next (hard-affinity) step, we fall back to ignoring
soft-affinity, and only care about hard constraints.

About 'impaired' and 'improoved' soft-affinity. This is just an idea
popping out my head right now, so bear with me. :-)
Affinity is to pcpus, while, in principle, here we are talking about
migration between runqueues, with each runqueue spanning a certain set
of pcpus. There seems to me to be room for defining something like the
"soft affinity of a vcpu to a runqueue", as, e.g., the number of pcpus
of that runqueue with which the vcpu has soft-affinity. This would
measure how likely the vcpus will run on a pcpu it has soft-affinity
with, if migrated to the runqueue in question. How do you like this?

BTW, understand why yesterday I said I'd rather get hard affinity in
place before tackling dealing with soft affinity? :-P

> @@ -127,6 +127,15 @@
>  #define CSCHED2_CREDIT_RESET         0
>  /* Max timer: Maximum time a guest can be run for. */
>  #define CSCHED2_MAX_TIMER            MILLISECS(2)
> +/* Two step balancing logic; consider soft affinity first. */
> +#define CSCHED2_BALANCE_SOFT_AFFINITY 0
> +#define CSCHED2_BALANCE_HARD_AFFINITY 1
> +/* vcpu runq migration away from vcpu's soft affinity. */
> +#define CSCHED2_MIGRATE_SOFT_TO_HARD 0
> +/* vcpu runq migration to vcpu's soft affinity. */
> +#define CSCHED2_MIGRATE_HARD_TO_SOFT 1
> +/* vcpu runq migration soft to soft or vcpu has no soft affinity. */
> +#define CSCHED2_MIGRATE_ANY_TO_ANY   2
>  
> 
>  #define CSCHED2_IDLE_CREDIT                 (-(1<<30))
> @@ -176,6 +185,8 @@ integer_param("sched_credit2_migrate_resist", 
> opt_migrate_resist);
>  #define c2r(_ops, _cpu)     (CSCHED2_PRIV(_ops)->runq_map[(_cpu)])
>  /* CPU to runqueue struct macro */
>  #define RQD(_ops, _cpu)     (&CSCHED2_PRIV(_ops)->rqd[c2r(_ops, _cpu)])
> +#define for_each_csched2_balance_step(step) \
> +    for ( (step) = 0; (step) <= CSCHED2_BALANCE_HARD_AFFINITY; (step)++ )
>
This is identical (apart from the CSCHED2 prefix) to what we have in
Credit1. I wonder whether it wouldn't make sense to factor it out in a
common header.

I'm saying I wonder because I am actually not sure. As things stands, it
makes perfect sense. Even if looking forward, I don't see why the two
schedulers (as well as any other scheduler wanting to support
affinities) should do things differently between each others.

The only doubt I have is not about the structure (i.e., the loop, etc.)
but about the fact that a scheduler may perhaps want to add more steps,
or things like this.

If I'd have to decide now, I'd say let's factor this out, and when --if
ever-- someone with different needs will come, we'll see what to do.
Thoughts?

> +/*
> + * A vcpu has meaningful soft affinity if...
> + * - its soft affinity mask is not full, and
> + * - the passed in mask (usually its hard affinity mask) intersects
> + *   with its soft affinity mask
> + */
> +static inline int __vcpu_has_soft_affinity(const struct vcpu *vc,
> +                                           const cpumask_t *mask)
> +{
> +    if ( cpumask_full(vc->cpu_soft_affinity) ||
> +        !cpumask_intersects(vc->cpu_soft_affinity, mask) )
> +        return 0;
> +
> +    return 1;
> +}
>
About this, I have far less doubts when asking you to factor this out
from sched_credit.c and use it here. :-)

> +static void
> +csched2_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
> +{
> +    if ( step == CSCHED2_BALANCE_SOFT_AFFINITY )
> +    {
> +        cpumask_and(mask, vc->cpu_soft_affinity, vc->cpu_hard_affinity);
> +
> +        if ( unlikely(cpumask_empty(mask)) )
> +            cpumask_copy(mask, vc->cpu_hard_affinity);
> +    }
> +    else /* step == CSCHED2_BALANCE_HARD_AFFINITY */
> +        cpumask_copy(mask, vc->cpu_hard_affinity);
> +}
>
Same here.

Oh, and have also a look at how this is used, in sched_credit.c, in
conjunction with the per-csched_pcpu variable balance_mask.

As I said yesterday, too many cpumask_t variables on stack should be
avoided, and that was how I managed to comply to this when working on
Credit1.

There for sure are more than a few places where you can do the same, or
something very similar.

> @@ -513,6 +559,68 @@ runq_tickle(const struct scheduler *ops, unsigned int 
> cpu, struct csched2_vcpu *
>          goto tickle;
>      }
>  
> +    /* Look for idle cpus in other runqs; consider soft affinity first. */
> +    for_each_csched2_balance_step( balance_step )
> +    {
> +        cpumask_t balance_mask;
> +
> +        /* Skip the soft affinity balance step if new doesn't have any. */
> +        if ( balance_step == CSCHED2_BALANCE_SOFT_AFFINITY &&
> +            !__vcpu_has_soft_affinity(
> +                new->vcpu, new->vcpu->cpu_hard_affinity) )
> +            continue;
> +
> +        /* Skip this step if can't get a lock on the credit2 private data. */
> +        if ( !prv_lock_held || !spin_trylock(&prv->lock) )
> +            continue;
> +        prv_lock_held = 1;
> +
> +        csched2_balance_cpumask(new->vcpu, balance_step, &balance_mask);
> +
> +        for_each_cpu(i, &prv->active_queues)
> +        {
> +            struct csched2_runqueue_data *temp_rqd;
> +
> +            temp_rqd = prv->rqd + i;
> +
> +            if ( temp_rqd == rqd || !spin_trylock(&temp_rqd->lock) )
> +                continue;
> +
> +            /* Find idle cpus in the balance mask that are not tickled. */
> +            cpumask_andnot(&mask, &temp_rqd->idle, &temp_rqd->tickled);
> +            cpumask_and(&mask, &mask, &balance_mask);
> +
> +            if ( !cpumask_empty(&mask) )
> +            {
> +                /* Found an idle cpu on another run queue; move new. */
> +                s_time_t now = 0;
> +
> +                ipid = cpumask_any(&mask);
> +                new->vcpu->processor = ipid;
> +                __runq_remove(new);
> +                now = NOW();
> +                update_load(ops, rqd, new, -1, now);
> +                __runq_deassign(new);
> +                __runq_assign(new, temp_rqd);
> +                update_load(ops, temp_rqd, new, 1, now);
> +                runq_insert(ops, ipid, new);
> +                cpumask_set_cpu(ipid, &temp_rqd->tickled);
> +                cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
> +
> +                spin_unlock(&temp_rqd->lock);
> +                spin_unlock(&prv->lock);
> +                return;
> +            }
> +            else
> +                /* No suitable idlers found in runq temp_rqd. */
> +                spin_unlock(&temp_rqd->lock);
> +        }
> +
> +        if ( prv_lock_held && balance_step == CSCHED2_BALANCE_HARD_AFFINITY )
> +            /* No suitable other-runq idlers found; unlock private data. */
> +            spin_unlock(&prv->lock);
> +    }
> +
>
I commented about this above.

BTW, look at the code you'd have to add for making this function behave
as you wanted, especially locking, and I'm sure you'll concur that all
that does not belong here.

Where I would put something similar to the above is in choose_cpu()
below!

> @@ -1039,12 +1147,22 @@ choose_cpu(const struct scheduler *ops, struct vcpu 
> *vc)
>  {
>      struct csched2_private *prv = CSCHED2_PRIV(ops);
>      int i, min_rqi = -1, new_cpu;
> +    int max_soft_cpus = 0, max_soft_cpus_rqi = -1;
> +    bool_t consider_soft_affinity = 0;
>      struct csched2_vcpu *svc = CSCHED2_VCPU(vc);
>      s_time_t min_avgload;
> -    cpumask_t temp_mask;
> +    cpumask_t temp_mask, vc_soft_affinity;
>  
>      BUG_ON(cpumask_empty(&prv->active_queues));
>  
> +    /* Consider soft affinity in the cpu descision? */
> +    if ( __vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
> +    {
> +        consider_soft_affinity = 1;
> +        cpumask_and(&vc_soft_affinity, vc->cpu_soft_affinity,
> +            vc->cpu_hard_affinity);
> +    }
> +
>
Why "black and white"? Why not using for_each_csched2_balance_step() and
differentiate which one affinity to consider depending on the step,
rather than here and for all?

> @@ -1112,11 +1237,15 @@ choose_cpu(const struct scheduler *ops, struct vcpu 
> *vc)
>  
>      min_avgload = MAX_LOAD;
>  
> -    /* Find the runqueue with the lowest instantaneous load */
> +    /*
> +     * Find the run queue with the most cpus in vc's soft affniity, or the
> +     * the lowest instantaneous load if not considering soft affinity.
> +     */
>
Exactly my point: if the vcpu does not have any soft-affinity, then, as
usual, the soft-affinity balancing step must be skipped, and the
behaviour should fall back to finding the runq with the lowest
instantaneous load, taking only hard affinity constraint into account.

And if the vcpu does have a soft-affinity, why forget completely about
instantaneous load? I like the idea of "the most cpus in vc's soft
affinity" (it's actually really similar to what I was talking about at
the beginning of this message), but why can't we combine the two things?
More important, why does it have to be just an option, instead than the
first step of the soft/hard balancing loop?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.