[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity

To: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
From: George Dunlap <george.dunlap@xxxxxxxxxxxxx>
Date: Thu, 20 Dec 2012 16:48:13 +0000
Cc: Marcus Granado <Marcus.Granado@xxxxxxxxxxxxx>, Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, Anil Madhavapeddy <anil@xxxxxxxxxx>, Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Daniel De Graaf <dgdegra@xxxxxxxxxxxxx>, Matt Wilson <msw@xxxxxxxxxx>
Delivery-date: Thu, 20 Dec 2012 16:54:49 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 19/12/12 19:07, Dario Faggioli wrote:

  static inline int
-__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu)
+__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
  {
      /*
       * Don't pick up work that's in the peer's scheduling tail or hot on
-     * peer PCPU. Only pick up work that's allowed to run on our CPU.
+     * peer PCPU. Only pick up work that prefers and/or is allowed to run
+     * on our CPU.
       */
      return !vc->is_running &&
             !__csched_vcpu_is_cache_hot(vc) &&
-           cpumask_test_cpu(dest_cpu, vc->cpu_affinity);
+           cpumask_test_cpu(dest_cpu, mask);
+}
+
+static inline int
+__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
+{
+    /*
+     * Consent to migration if cpu is one of the idlers in the VCPU's
+     * affinity mask. In fact, if that is not the case, it just means it
+     * was some other CPU that was tickled and should hence come and pick
+     * VCPU up. Migrating it to cpu would only make things worse.
+     */
+    return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask);
  }

I don't get what this function is for. The only time you call it is incsched_runq_steal(), immediately after calling__csched_vcpu_is_migrateable(). But is_migrateable() has alreadychecked cpu_mask_test_cpu(cpu, mask). So why do we need to check it again?

We could just replace this with cpumask_test_cpu(cpu, prv->idlers). Butthat clause is going to be either true or false for every singleiteration of all the loops, including the loops incsched_load_balance(). Wouldn't it make more sense to check it once incsched_load_balance(), rather than doing all those nested loops?

And in any case, looking at the caller of csched_load_balance(), itexplicitly says to steal work if the next thing on the runqueue of cpuhas a priority of TS_OVER. That was chosen for a reason -- if you wantto change that, you should change it there at the top (and make ajustification for doing so), not deeply nested in a function like this.


Or am I completely missing something?

  static struct csched_vcpu *
-csched_runq_steal(int peer_cpu, int cpu, int pri)
+csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
  {
      const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
+    struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, peer_cpu));
      const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
      struct csched_vcpu *speer;
      struct list_head *iter;
@@ -1265,11 +1395,24 @@ csched_runq_steal(int peer_cpu, int cpu,
              if ( speer->pri <= pri )
                  break;

-            /* Is this VCPU is runnable on our PCPU? */
+            /* Is this VCPU runnable on our PCPU? */
              vc = speer->vcpu;
              BUG_ON( is_idle_vcpu(vc) );

-            if (__csched_vcpu_is_migrateable(vc, cpu))
+            /*
+             * Retrieve the correct mask for this balance_step or, if we're
+             * dealing with node-affinity and the vcpu has no node affinity
+             * at all, just skip this vcpu. That is needed if we want to
+             * check if we have any node-affine work to steal first (wrt
+             * any vcpu-affine work).
+             */
+            if ( csched_balance_cpumask(vc, balance_step,
+                                        &scratch_balance_mask) )
+                continue;


Again, I think for clarity the best thing to do here is:

if ( balance_step == NODE

     && cpumask_full(speer->sdom->node_affinity_cpumask) )

    continue;

csched_balance_cpumask();

/* Etc. */

@@ -1295,7 +1438,8 @@ csched_load_balance(struct csched_privat
      struct csched_vcpu *speer;
      cpumask_t workers;
      cpumask_t *online;
-    int peer_cpu;
+    int peer_cpu, peer_node, bstep;
+    int node = cpu_to_node(cpu);

      BUG_ON( cpu != snext->vcpu->processor );
      online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu));
@@ -1312,42 +1456,68 @@ csched_load_balance(struct csched_privat
          SCHED_STAT_CRANK(load_balance_other);

      /*
-     * Peek at non-idling CPUs in the system, starting with our
-     * immediate neighbour.
+     * Let's look around for work to steal, taking both vcpu-affinity
+     * and node-affinity into account. More specifically, we check all
+     * the non-idle CPUs' runq, looking for:
+     *  1. any node-affine work to steal first,
+     *  2. if not finding anything, any vcpu-affine work to steal.
       */
-    cpumask_andnot(&workers, online, prv->idlers);
-    cpumask_clear_cpu(cpu, &workers);
-    peer_cpu = cpu;
+    for_each_csched_balance_step( bstep )
+    {
+        /*
+         * We peek at the non-idling CPUs in a node-wise fashion. In fact,
+         * it is more likely that we find some node-affine work on our same
+         * node, not to mention that migrating vcpus within the same node
+         * could well expected to be cheaper than across-nodes (memory
+         * stays local, there might be some node-wide cache[s], etc.).
+         */
+        peer_node = node;
+        do
+        {
+            /* Find out what the !idle are in this node */
+            cpumask_andnot(&workers, online, prv->idlers);
+            cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
+            cpumask_clear_cpu(cpu, &workers);

-    while ( !cpumask_empty(&workers) )
-    {
-        peer_cpu = cpumask_cycle(peer_cpu, &workers);
-        cpumask_clear_cpu(peer_cpu, &workers);
+            if ( cpumask_empty(&workers) )
+                goto next_node;

-        /*
-         * Get ahold of the scheduler lock for this peer CPU.
-         *
-         * Note: We don't spin on this lock but simply try it. Spinning could
-         * cause a deadlock if the peer CPU is also load balancing and trying
-         * to lock this CPU.
-         */
-        if ( !pcpu_schedule_trylock(peer_cpu) )
-        {
-            SCHED_STAT_CRANK(steal_trylock_failed);
-            continue;
-        }
+            peer_cpu = cpumask_first(&workers);
+            do
+            {
+                /*
+                 * Get ahold of the scheduler lock for this peer CPU.
+                 *
+                 * Note: We don't spin on this lock but simply try it. Spinning
+                 * could cause a deadlock if the peer CPU is also load
+                 * balancing and trying to lock this CPU.
+                 */
+                if ( !pcpu_schedule_trylock(peer_cpu) )
+                {
+                    SCHED_STAT_CRANK(steal_trylock_failed);
+                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
+                    continue;
+                }

-        /*
-         * Any work over there to steal?
-         */
-        speer = cpumask_test_cpu(peer_cpu, online) ?
-            csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
-        pcpu_schedule_unlock(peer_cpu);
-        if ( speer != NULL )
-        {
-            *stolen = 1;
-            return speer;
-        }
+                /* Any work over there to steal? */
+                speer = cpumask_test_cpu(peer_cpu, online) ?
+                    csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL;
+                pcpu_schedule_unlock(peer_cpu);
+
+                /* As soon as one vcpu is found, balancing ends */
+                if ( speer != NULL )
+                {
+                    *stolen = 1;
+                    return speer;
+                }
+
+                peer_cpu = cpumask_cycle(peer_cpu, &workers);
+
+            } while( peer_cpu != cpumask_first(&workers) );
+
+ next_node:
+            peer_node = cycle_node(peer_node, node_online_map);
+        } while( peer_node != node );
      }

These changes all look right. But then, I'm a bit tired, so I'll giveit another once-over tomorrow. :-)


[To be continued]



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  - From: Dario Faggioli

References:
- [Xen-devel] [PATCH 00 of 10 v2] NUMA aware credit scheduling
  - From: Dario Faggioli
- [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  - From: Dario Faggioli

Prev by Date: Re: [Xen-devel] Only CPU0 active after ACPI S3, xen 4.1.3
Next by Date: [Xen-devel] [PATCH] x86: compat_show_guest_stack() should not truncate MFN
Previous by thread: Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Next by thread: Re: [Xen-devel] [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.