[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity



On Thu, 2012-12-20 at 20:21 +0000, George Dunlap wrote: 
> > -    /*
> > -     * Pick from online CPUs in VCPU's affinity mask, giving a
> > -     * preference to its current processor if it's in there.
> > -     */
> >       online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> > -    cpumask_and(&cpus, online, vc->cpu_affinity);
> > -    cpu = cpumask_test_cpu(vc->processor, &cpus)
> > -            ? vc->processor
> > -            : cpumask_cycle(vc->processor, &cpus);
> > -    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
> > +    for_each_csched_balance_step( balance_step )
> > +    {
> > +        /* Pick an online CPU from the proper affinity mask */
> > +        ret = csched_balance_cpumask(vc, balance_step, &cpus);
> > +        cpumask_and(&cpus, &cpus, online);
> >
> > -    /*
> > -     * Try to find an idle processor within the above constraints.
> > -     * 
> > -     * In multi-core and multi-threaded CPUs, not all idle execution
> > -     * vehicles are equal!
> > -     * 
> > -     * We give preference to the idle execution vehicle with the most
> > -     * idling neighbours in its grouping. This distributes work across
> > -     * distinct cores first and guarantees we don't do something stupid
> > -     * like run two VCPUs on co-hyperthreads while there are idle cores
> > -     * or sockets.
> > -     * 
> > -     * Notice that, when computing the "idleness" of cpu, we may want to
> > -     * discount vc. That is, iff vc is the currently running and the only
> > -     * runnable vcpu on cpu, we add cpu to the idlers.
> > -     */ 
> > -    cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
> > -    if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
> > -        cpumask_set_cpu(cpu, &idlers);
> > -    cpumask_and(&cpus, &cpus, &idlers);
> > -    cpumask_clear_cpu(cpu, &cpus);
> > +        /* If present, prefer vc's current processor */
> > +        cpu = cpumask_test_cpu(vc->processor, &cpus)
> > +                ? vc->processor
> > +                : cpumask_cycle(vc->processor, &cpus); 
> > +        ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
> > 
> > -    while ( !cpumask_empty(&cpus) )
> > -    {
> > -        cpumask_t cpu_idlers;
> > -        cpumask_t nxt_idlers;
> > -        int nxt, weight_cpu, weight_nxt;
> > -        int migrate_factor;
> > +        /*
> > +         * Try to find an idle processor within the above constraints.
> > +         *
> > +         * In multi-core and multi-threaded CPUs, not all idle execution
> > +         * vehicles are equal!
> > +         *
> > +         * We give preference to the idle execution vehicle with the most
> > +         * idling neighbours in its grouping. This distributes work across
> > +         * distinct cores first and guarantees we don't do something stupid
> > +         * like run two VCPUs on co-hyperthreads while there are idle cores
> > +         * or sockets.
> > +         *
> > +         * Notice that, when computing the "idleness" of cpu, we may want 
> > to
> > +         * discount vc. That is, iff vc is the currently running and the 
> > only
> > +         * runnable vcpu on cpu, we add cpu to the idlers.
> > +         */
> > +        cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
> > +        if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
> > +            cpumask_set_cpu(cpu, &idlers);
> > +        cpumask_and(&cpus, &cpus, &idlers);
> > +        /* If there are idlers and cpu is still not among them, pick one */
> > +        if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) )
> > +            cpu = cpumask_cycle(cpu, &cpus);
> 
> This seems to be an addition to the algorithm -- particularly hidden in 
> this kind of "indent a big section that's almost exactly the same", I 
> think this at least needs to be called out in the changelog message, 
> perhaps put in a separate patch.
> 
You're right, it is an addition, although a minor enough one (at least
from the amount of code point of view, the effect of not having it was
pretty bad! :-P) that I thought it can "hide" here. :-)

But I guess I can put it in a separate patch.

> Can you comment on to why you think it's necessary?  Was there a 
> particular problem you were seeing?
> 
Yep. Suppose vc is for some reason running on a pcpu which is outside
its node-affinity, but that now some pcpus within vc's node-affinity
have become idle. What we would like is vc start running there as soon
as possible, so we expect this call to _csched_pick_cpu() to determine
that.

What happens is we do not use vc->processor (as it is outside of vc's
node-affinity) and 'cpu' gets set to cpumask_cycle(vc->processor,
&cpus), where 'cpu' is the cpumask_and(&cpus, balance_mask, online).
This means 'cpu'. Let's also suppose that 'cpu' now points to a busy
thread but with an idle sibling, and that there aren't any other idle
pcpus (either core or threads). Now, the algorithm evaluates the
idleness of 'cpu', and compares it with the idleness of all the other
pcpus, and it won't find anything better that 'cpu' itself, as all the
other pcpus except its sibling thread are busy, while its sibling thread
has the very same idleness it has (2 threads, 1 idle 1 busy).

The neat effect is vc being moved to 'cpu', which is busy, while it
could have been moved to 'cpu''s sibling thread, which is indeed idle.

The if() I added fixes this by making sure that the reference cpu is an
idle one (if that is possible).

I hope I've explained it correctly, and sorry if it is a little bit
tricky, especially to explain like this (although, believe me, it was
tricky to hunt it out too! :-P). I've seen that happening and I'm almost
sure I kept a trace somewhere, so let me know if you want to see the
"smoking gun". :-)

> > -        weight_cpu = cpumask_weight(&cpu_idlers);
> > -        weight_nxt = cpumask_weight(&nxt_idlers);
> > -        /* smt_power_savings: consolidate work rather than spreading it */
> > -        if ( sched_smt_power_savings ?
> > -             weight_cpu > weight_nxt :
> > -             weight_cpu * migrate_factor < weight_nxt )
> > -        {
> > -            cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
> > -            spc = CSCHED_PCPU(nxt);
> > -            cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
> > -            cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu));
> > -        }
> > -        else
> > -        {
> > -            cpumask_andnot(&cpus, &cpus, &nxt_idlers);
> > -        }
> > +        /* Stop if cpu is idle (or if csched_balance_cpumask() says we 
> > can) */
> > +        if ( cpumask_test_cpu(cpu, &idlers) || ret )
> > +            break;
> 
> Right -- OK, I think everything looks good here, except the "return -1 
> from csched_balance_cpumask" thing.  I think it would be better if we 
> explicitly checked cpumask_full(...->node_affinity_cpumask) and skipped 
> the NODE step if that's the case.
> 
Yep. Will do this, or something along this line, all over the place.
Thanks.

> Also -- and sorry to have to ask this kind of thing, but after sorting 
> through the placement algorithm my head hurts -- under what 
> circumstances would "cpumask_test_cpu(cpu, &idlers)" be false at this 
> point?  It seems like the only possibility would be if:
> ( (vc->processor was not in the original &cpus [1])
>    || !IS_RUNQ_IDLE(vc->processor) )
> && (there are no idlers in the original &cpus)
> 
> Which I suppose probably matches the time when we want to move on from 
> looking at NODE affinity and look for CPU affinity.
> 
> [1] This could happen either if the vcpu/node affinity has changed, or 
> if we're currently running outside our node affinity and we're doing the 
> NODE step.
> 
> OK -- I think I've convinced myself that this is OK as well (apart from 
> the hidden check).  I'll come back to look at your response to the load 
> balancing thing tomorrow.
> 
Mmm... Sorry, not sure I follow, does this means that you figured out
and understood why I need that 'if(){break;}' ? Sounds like so, but I
can't be sure (my head hurts a bit too, after having written that
thing! :-D).

If no, consider that, for example, it can be false if all the pcpus in
the mask for this step are busy, and if this step is the node-affinity
step, I do _not_ want to exit the balancing loop, and check the other
pcpus in the vcpu-affinity. OTOH, if I don't put a break somewhere, even
if an idle pcpu is found during the node-affinity balancing step, it
will just go on and check all the others pcpus in the vcpu-affinity,
which would kill having tried to do the balancing here. Makes sense?

Thanks again and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.