[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
- To: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
- From: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
- Date: Thu, 20 Dec 2012 16:48:13 +0000
- Cc: Marcus Granado <Marcus.Granado@xxxxxxxxxxxxx>, Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Anil Madhavapeddy <anil@xxxxxxxxxx>, Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Daniel De Graaf <dgdegra@xxxxxxxxxxxxx>, Matt Wilson <msw@xxxxxxxxxx>
- Delivery-date: Thu, 20 Dec 2012 16:54:49 +0000
- List-id: Xen developer discussion <xen-devel.lists.xen.org>
On 19/12/12 19:07, Dario Faggioli wrote:
static inline int
-__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu)
+__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
{
/*
* Don't pick up work that's in the peer's scheduling tail or hot on
- * peer PCPU. Only pick up work that's allowed to run on our CPU.
+ * peer PCPU. Only pick up work that prefers and/or is allowed to run
+ * on our CPU.
*/
return !vc->is_running &&
!__csched_vcpu_is_cache_hot(vc) &&
- cpumask_test_cpu(dest_cpu, vc->cpu_affinity);
+ cpumask_test_cpu(dest_cpu, mask);
+}
+
+static inline int
+__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
+{
+ /*
+ * Consent to migration if cpu is one of the idlers in the VCPU's
+ * affinity mask. In fact, if that is not the case, it just means it
+ * was some other CPU that was tickled and should hence come and pick
+ * VCPU up. Migrating it to cpu would only make things worse.
+ */
+ return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask);
}
I don't get what this function is for. The only time you call it is in
csched_runq_steal(), immediately after calling
__csched_vcpu_is_migrateable(). But is_migrateable() has already
checked cpu_mask_test_cpu(cpu, mask). So why do we need to check it again?
We could just replace this with cpumask_test_cpu(cpu, prv->idlers). But
that clause is going to be either true or false for every single
iteration of all the loops, including the loops in
csched_load_balance(). Wouldn't it make more sense to check it once in
csched_load_balance(), rather than doing all those nested loops?
And in any case, looking at the caller of csched_load_balance(), it
explicitly says to steal work if the next thing on the runqueue of cpu
has a priority of TS_OVER. That was chosen for a reason -- if you want
to change that, you should change it there at the top (and make a
justification for doing so), not deeply nested in a function like this.
Or am I completely missing something?
static struct csched_vcpu *
-csched_runq_steal(int peer_cpu, int cpu, int pri)
+csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
{
const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
+ struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, peer_cpu));
const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
struct csched_vcpu *speer;
struct list_head *iter;
@@ -1265,11 +1395,24 @@ csched_runq_steal(int peer_cpu, int cpu,
if ( speer->pri <= pri )
break;
- /* Is this VCPU is runnable on our PCPU? */
+ /* Is this VCPU runnable on our PCPU? */
vc = speer->vcpu;
BUG_ON( is_idle_vcpu(vc) );
- if (__csched_vcpu_is_migrateable(vc, cpu))
+ /*
+ * Retrieve the correct mask for this balance_step or, if we're
+ * dealing with node-affinity and the vcpu has no node affinity
+ * at all, just skip this vcpu. That is needed if we want to
+ * check if we have any node-affine work to steal first (wrt
+ * any vcpu-affine work).
+ */
+ if ( csched_balance_cpumask(vc, balance_step,
+ &scratch_balance_mask) )
+ continue;
Again, I think for clarity the best thing to do here is:
if ( balance_step == NODE
&& cpumask_full(speer->sdom->node_affinity_cpumask) )
continue;
csched_balance_cpumask();
/* Etc. */
@@ -1295,7 +1438,8 @@ csched_load_balance(struct csched_privat
struct csched_vcpu *speer;
cpumask_t workers;
cpumask_t *online;
- int peer_cpu;
+ int peer_cpu, peer_node, bstep;
+ int node = cpu_to_node(cpu);
BUG_ON( cpu != snext->vcpu->processor );
online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu));
@@ -1312,42 +1456,68 @@ csched_load_balance(struct csched_privat
SCHED_STAT_CRANK(load_balance_other);
/*
- * Peek at non-idling CPUs in the system, starting with our
- * immediate neighbour.
+ * Let's look around for work to steal, taking both vcpu-affinity
+ * and node-affinity into account. More specifically, we check all
+ * the non-idle CPUs' runq, looking for:
+ * 1. any node-affine work to steal first,
+ * 2. if not finding anything, any vcpu-affine work to steal.
*/
- cpumask_andnot(&workers, online, prv->idlers);
- cpumask_clear_cpu(cpu, &workers);
- peer_cpu = cpu;
+ for_each_csched_balance_step( bstep )
+ {
+ /*
+ * We peek at the non-idling CPUs in a node-wise fashion. In fact,
+ * it is more likely that we find some node-affine work on our same
+ * node, not to mention that migrating vcpus within the same node
+ * could well expected to be cheaper than across-nodes (memory
+ * stays local, there might be some node-wide cache[s], etc.).
+ */
+ peer_node = node;
+ do
+ {
+ /* Find out what the !idle are in this node */
+ cpumask_andnot(&workers, online, prv->idlers);
+ cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
+ cpumask_clear_cpu(cpu, &workers);
- while ( !cpumask_empty(&workers) )
- {
- peer_cpu = cpumask_cycle(peer_cpu, &workers);
- cpumask_clear_cpu(peer_cpu, &workers);
+ if ( cpumask_empty(&workers) )
+ goto next_node;
- /*
- * Get ahold of the scheduler lock for this peer CPU.
- *
- * Note: We don't spin on this lock but simply try it. Spinning could
- * cause a deadlock if the peer CPU is also load balancing and trying
- * to lock this CPU.
- */
- if ( !pcpu_schedule_trylock(peer_cpu) )
- {
- SCHED_STAT_CRANK(steal_trylock_failed);
- continue;
- }
+ peer_cpu = cpumask_first(&workers);
+ do
+ {
+ /*
+ * Get ahold of the scheduler lock for this peer CPU.
+ *
+ * Note: We don't spin on this lock but simply try it. Spinning
+ * could cause a deadlock if the peer CPU is also load
+ * balancing and trying to lock this CPU.
+ */
+ if ( !pcpu_schedule_trylock(peer_cpu) )
+ {
+ SCHED_STAT_CRANK(steal_trylock_failed);
+ peer_cpu = cpumask_cycle(peer_cpu, &workers);
+ continue;
+ }
- /*
- * Any work over there to steal?
- */
- speer = cpumask_test_cpu(peer_cpu, online) ?
- csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
- pcpu_schedule_unlock(peer_cpu);
- if ( speer != NULL )
- {
- *stolen = 1;
- return speer;
- }
+ /* Any work over there to steal? */
+ speer = cpumask_test_cpu(peer_cpu, online) ?
+ csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL;
+ pcpu_schedule_unlock(peer_cpu);
+
+ /* As soon as one vcpu is found, balancing ends */
+ if ( speer != NULL )
+ {
+ *stolen = 1;
+ return speer;
+ }
+
+ peer_cpu = cpumask_cycle(peer_cpu, &workers);
+
+ } while( peer_cpu != cpumask_first(&workers) );
+
+ next_node:
+ peer_node = cycle_node(peer_node, node_online_map);
+ } while( peer_node != node );
}
These changes all look right. But then, I'm a bit tired, so I'll give
it another once-over tomorrow. :-)
[To be continued]
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|