[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 2/2] sched: credit2: consider per-vcpu soft affinity



Dario,

I'm working on soft affinity while you review v2 of the credit 2 hard
affinity patch (no rush, of course).

On Tue, Jan 13, 2015 at 3:06 AM, Dario Faggioli
<dario.faggioli@xxxxxxxxxx> wrote:
> On Sat, 2014-11-29 at 14:33 -1000, Justin T. Weaver wrote:
>> when deciding which run queue to assign each vcpu to.
>>
>> There are two main changes in functionality that this patch introduces.
>>
>> First, in function runq_tickle, it tries to find idle pcpus in other run
>> queues that the vcpu would prefer to run on (soft affinity) or can run on
>> (hard affinity), in that order.
>>
> I see. I don't think a function like runq_tickle() should be able to
> move a vcpu to a different runqueue. That is done in other places, and I
> would leave this as it is. Tickling usually happens right after a vcpu
> has been put on a runqueue, which is something controlled by other parts
> of the Credit2 algorithm and code, and that is by design. More
> specifically, moving vcpus among runqueues is the job of the load
> balancing logic.
>
> Turning runq_tickle() into a potential "runqueue-changer" would mix the
> two concepts of inter-runqueue and intra-runqueue balancing/migration up
> to a point that I do not like. It also makes the code look rather
> weird... E.g., look at when runq_tickle() is called in migrate(). In
> there, right after going through:
>  - __runq_remove();
>  - __runq_deassign();
>  - __runq_assign();
>  - runq_insert();
> we call runq_tickle() and... Do it all aver again!?!? :-O
> Even just because of this, I'd say that, if such a change to
> runq_tickle() would be necessary (which I don't think), that should
> happen differently, and should involve changing its various callers, to
> avoid the repetition.
>
> OTOH, it looks like you are not touching the existing
> for_each_cpu(i,&mask) loop, still inside runq_tickle() itself, while I'd
> say that looking for a pcpu that the vcpu has soft affinity with, among
> all the non idle and non-tickled pcpus in the runqueue, really is
> worthwhile. Actually, the same applies to idle pcpus in the runqueue,
> i.e., to the code before the for_each_cpu() loop, still in
> runq_tickle().
>
> So, instead of adding to runq_tickle() the capability of moving vcpus
> among runqueues, which is not his responsibility, I would add the usual
> affinity balancing logic (== the for_each_csched2_balance_step2) to it.

What I was going for is using runq_tickle as a last resort for finding
an idle processor to run a vcpu. Also I was thinking about how the
mechanism would work if say balance_load does move a vcpu away from
its hard affinity, where/when would it get moved back to soft? But I
see that this isn't the place to make those kind of decisions. I'll
implement the balancing logic here like you say.

> When we'll have this in place, and we will be able to bench it, we'll
> see if we think that there is a need for making such function capable of
> more advanced things (but even at that point, I doubt that changing the
> runqueue of a vcpu would be one of these).
>
>> Second, in function balance_load, if moving a vcpu with soft affinity from
>> one run queue to another means moving it away from its soft affinity to hard
>> affinity only, then that move is not considered for load balancing.
>>
> This is a key part of the whole thing so, please, spend a few more words
> explaining what is your  idea and how it actually works, both here and
> in the sources, with comments.
>
> For what I've understood, I like the idea. However, why don't use the
> usual two step approach? In the soft-affinity step, we won't consider
> vcpus that would see their soft-affinity 'impaired' by being moved, and,
> perhaps, give a preference boost to vcpus that would see their
> soft-affinity 'improved' by being moved. If we don't find any suitable
> candidate, in the next (hard-affinity) step, we fall back to ignoring
> soft-affinity, and only care about hard constraints.

I did try a few different things here like you suggest, but I kept
seeing odd behavior due to how the system is currently calculating
load. I observed a vcpu moving back and forth between two run queues
when it was the only vcpu on one of them. It would move away from soft
to hard (even though it was the only vcpu associated with the run
queue) because of a load calculation, and then get moved back to soft
because I told it to prefer that. I need to get a better understanding
of how run queue load is calculated and then see what makes the most
sense to do here. Also with a two step approach won't we encounter
even more looping through the vcpus like from the hard affinity v1
patch?

> About 'impaired' and 'improoved' soft-affinity. This is just an idea
> popping out my head right now, so bear with me. :-)
> Affinity is to pcpus, while, in principle, here we are talking about
> migration between runqueues, with each runqueue spanning a certain set
> of pcpus. There seems to me to be room for defining something like the
> "soft affinity of a vcpu to a runqueue", as, e.g., the number of pcpus
> of that runqueue with which the vcpu has soft-affinity. This would
> measure how likely the vcpus will run on a pcpu it has soft-affinity
> with, if migrated to the runqueue in question. How do you like this?

I do like this and I've been trying to come up with a way to maintain
this information. Run queues and per-vcpu affinity can change so
having the proper locks to read and update this information is
something I've been thinking about. For example in choose_cpu during
an affinity change there's a chance that it doesn't get the private
lock, and so it wouldn't be able to compare the new affinity with each
of the run queues. But then I was thinking that maybe it can have a
state where the "soft affinity of a vcpu to a runqueue" is unknown
until there is a situation with the proper locks where it can be
calculated and readily available the next time it's needed.

> BTW, understand why yesterday I said I'd rather get hard affinity in
> place before tackling dealing with soft affinity? :-P

Yes, definitely.

>> @@ -127,6 +127,15 @@
>>  #define CSCHED2_CREDIT_RESET         0
>>  /* Max timer: Maximum time a guest can be run for. */
>>  #define CSCHED2_MAX_TIMER            MILLISECS(2)
>> +/* Two step balancing logic; consider soft affinity first. */
>> +#define CSCHED2_BALANCE_SOFT_AFFINITY 0
>> +#define CSCHED2_BALANCE_HARD_AFFINITY 1
>> +/* vcpu runq migration away from vcpu's soft affinity. */
>> +#define CSCHED2_MIGRATE_SOFT_TO_HARD 0
>> +/* vcpu runq migration to vcpu's soft affinity. */
>> +#define CSCHED2_MIGRATE_HARD_TO_SOFT 1
>> +/* vcpu runq migration soft to soft or vcpu has no soft affinity. */
>> +#define CSCHED2_MIGRATE_ANY_TO_ANY   2
>>
>>
>>  #define CSCHED2_IDLE_CREDIT                 (-(1<<30))
>> @@ -176,6 +185,8 @@ integer_param("sched_credit2_migrate_resist", 
>> opt_migrate_resist);
>>  #define c2r(_ops, _cpu)     (CSCHED2_PRIV(_ops)->runq_map[(_cpu)])
>>  /* CPU to runqueue struct macro */
>>  #define RQD(_ops, _cpu)     (&CSCHED2_PRIV(_ops)->rqd[c2r(_ops, _cpu)])
>> +#define for_each_csched2_balance_step(step) \
>> +    for ( (step) = 0; (step) <= CSCHED2_BALANCE_HARD_AFFINITY; (step)++ )
>>
> This is identical (apart from the CSCHED2 prefix) to what we have in
> Credit1. I wonder whether it wouldn't make sense to factor it out in a
> common header.
>
> I'm saying I wonder because I am actually not sure. As things stands, it
> makes perfect sense. Even if looking forward, I don't see why the two
> schedulers (as well as any other scheduler wanting to support
> affinities) should do things differently between each others.
>
> The only doubt I have is not about the structure (i.e., the loop, etc.)
> but about the fact that a scheduler may perhaps want to add more steps,
> or things like this.
>
> If I'd have to decide now, I'd say let's factor this out, and when --if
> ever-- someone with different needs will come, we'll see what to do.
> Thoughts?

More steps could be added if the new scheduler #defines them as
negative numbers I suppose. I will factor this and the sections you
mention below out. I was thinking about putting them in sched-if.h.
Does that sound OK, or do you think they should be in a new or
different file?

>> +/*
>> + * A vcpu has meaningful soft affinity if...
>> + * - its soft affinity mask is not full, and
>> + * - the passed in mask (usually its hard affinity mask) intersects
>> + *   with its soft affinity mask
>> + */
>> +static inline int __vcpu_has_soft_affinity(const struct vcpu *vc,
>> +                                           const cpumask_t *mask)
>> +{
>> +    if ( cpumask_full(vc->cpu_soft_affinity) ||
>> +        !cpumask_intersects(vc->cpu_soft_affinity, mask) )
>> +        return 0;
>> +
>> +    return 1;
>> +}
>>
> About this, I have far less doubts when asking you to factor this out
> from sched_credit.c and use it here. :-)
>
>> +static void
>> +csched2_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
>> +{
>> +    if ( step == CSCHED2_BALANCE_SOFT_AFFINITY )
>> +    {
>> +        cpumask_and(mask, vc->cpu_soft_affinity, vc->cpu_hard_affinity);
>> +
>> +        if ( unlikely(cpumask_empty(mask)) )
>> +            cpumask_copy(mask, vc->cpu_hard_affinity);
>> +    }
>> +    else /* step == CSCHED2_BALANCE_HARD_AFFINITY */
>> +        cpumask_copy(mask, vc->cpu_hard_affinity);
>> +}
>>
> Same here.
>
> Oh, and have also a look at how this is used, in sched_credit.c, in
> conjunction with the per-csched_pcpu variable balance_mask.
>
> As I said yesterday, too many cpumask_t variables on stack should be
> avoided, and that was how I managed to comply to this when working on
> Credit1.
>
> There for sure are more than a few places where you can do the same, or
> something very similar.
>
>> @@ -513,6 +559,68 @@ runq_tickle(const struct scheduler *ops, unsigned int 
>> cpu, struct csched2_vcpu *
>>          goto tickle;
>>      }
>>
>> +    /* Look for idle cpus in other runqs; consider soft affinity first. */
>> +    for_each_csched2_balance_step( balance_step )
>> +    {
>> +        cpumask_t balance_mask;
>> +
>> +        /* Skip the soft affinity balance step if new doesn't have any. */
>> +        if ( balance_step == CSCHED2_BALANCE_SOFT_AFFINITY &&
>> +            !__vcpu_has_soft_affinity(
>> +                new->vcpu, new->vcpu->cpu_hard_affinity) )
>> +            continue;
>> +
>> +        /* Skip this step if can't get a lock on the credit2 private data. 
>> */
>> +        if ( !prv_lock_held || !spin_trylock(&prv->lock) )
>> +            continue;
>> +        prv_lock_held = 1;
>> +
>> +        csched2_balance_cpumask(new->vcpu, balance_step, &balance_mask);
>> +
>> +        for_each_cpu(i, &prv->active_queues)
>> +        {
>> +            struct csched2_runqueue_data *temp_rqd;
>> +
>> +            temp_rqd = prv->rqd + i;
>> +
>> +            if ( temp_rqd == rqd || !spin_trylock(&temp_rqd->lock) )
>> +                continue;
>> +
>> +            /* Find idle cpus in the balance mask that are not tickled. */
>> +            cpumask_andnot(&mask, &temp_rqd->idle, &temp_rqd->tickled);
>> +            cpumask_and(&mask, &mask, &balance_mask);
>> +
>> +            if ( !cpumask_empty(&mask) )
>> +            {
>> +                /* Found an idle cpu on another run queue; move new. */
>> +                s_time_t now = 0;
>> +
>> +                ipid = cpumask_any(&mask);
>> +                new->vcpu->processor = ipid;
>> +                __runq_remove(new);
>> +                now = NOW();
>> +                update_load(ops, rqd, new, -1, now);
>> +                __runq_deassign(new);
>> +                __runq_assign(new, temp_rqd);
>> +                update_load(ops, temp_rqd, new, 1, now);
>> +                runq_insert(ops, ipid, new);
>> +                cpumask_set_cpu(ipid, &temp_rqd->tickled);
>> +                cpu_raise_softirq(ipid, SCHEDULE_SOFTIRQ);
>> +
>> +                spin_unlock(&temp_rqd->lock);
>> +                spin_unlock(&prv->lock);
>> +                return;
>> +            }
>> +            else
>> +                /* No suitable idlers found in runq temp_rqd. */
>> +                spin_unlock(&temp_rqd->lock);
>> +        }
>> +
>> +        if ( prv_lock_held && balance_step == CSCHED2_BALANCE_HARD_AFFINITY 
>> )
>> +            /* No suitable other-runq idlers found; unlock private data. */
>> +            spin_unlock(&prv->lock);
>> +    }
>> +
>>
> I commented about this above.
>
> BTW, look at the code you'd have to add for making this function behave
> as you wanted, especially locking, and I'm sure you'll concur that all
> that does not belong here.
>
> Where I would put something similar to the above is in choose_cpu()
> below!

I'm not sure I understand what you mean. choose_cpu is only called in
response to a SCHED_OP call to pick_cpu in vcpu_migrate in schedule.c.
Its only job is to return a pcpu to the generic scheduler that it
recommends for the given vcpu. vcpu_migrate eventually calls credit 2
migrate to do the actual run queue moving. Please let me know if I'm
missing something here.

>> @@ -1039,12 +1147,22 @@ choose_cpu(const struct scheduler *ops, struct vcpu 
>> *vc)
>>  {
>>      struct csched2_private *prv = CSCHED2_PRIV(ops);
>>      int i, min_rqi = -1, new_cpu;
>> +    int max_soft_cpus = 0, max_soft_cpus_rqi = -1;
>> +    bool_t consider_soft_affinity = 0;
>>      struct csched2_vcpu *svc = CSCHED2_VCPU(vc);
>>      s_time_t min_avgload;
>> -    cpumask_t temp_mask;
>> +    cpumask_t temp_mask, vc_soft_affinity;
>>
>>      BUG_ON(cpumask_empty(&prv->active_queues));
>>
>> +    /* Consider soft affinity in the cpu descision? */
>> +    if ( __vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) )
>> +    {
>> +        consider_soft_affinity = 1;
>> +        cpumask_and(&vc_soft_affinity, vc->cpu_soft_affinity,
>> +            vc->cpu_hard_affinity);
>> +    }
>> +
>>
> Why "black and white"? Why not using for_each_csched2_balance_step() and
> differentiate which one affinity to consider depending on the step,
> rather than here and for all?
>
>> @@ -1112,11 +1237,15 @@ choose_cpu(const struct scheduler *ops, struct vcpu 
>> *vc)
>>
>>      min_avgload = MAX_LOAD;
>>
>> -    /* Find the runqueue with the lowest instantaneous load */
>> +    /*
>> +     * Find the run queue with the most cpus in vc's soft affniity, or the
>> +     * the lowest instantaneous load if not considering soft affinity.
>> +     */
>>
> Exactly my point: if the vcpu does not have any soft-affinity, then, as
> usual, the soft-affinity balancing step must be skipped, and the
> behaviour should fall back to finding the runq with the lowest
> instantaneous load, taking only hard affinity constraint into account.
>
> And if the vcpu does have a soft-affinity, why forget completely about
> instantaneous load? I like the idea of "the most cpus in vc's soft
> affinity" (it's actually really similar to what I was talking about at
> the beginning of this message), but why can't we combine the two things?
> More important, why does it have to be just an option, instead than the
> first step of the soft/hard balancing loop?

Yes, I will re-work choose_cpu to use the affinity balancing steps.
I'll combine soft cpu count and instantaneous load in the soft
balancing step.

Thanks for the feedback!

Justin

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.