Xen project Mailing List

Re: [Xen-devel] [PATCH] xen: sched_credit: filter node-affinity mask against online cpus

To: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

From: George Dunlap <george.dunlap@xxxxxxxxxxxxx>

Date: Mon, 16 Sep 2013 15:00:09 +0100

Cc: Keir Fraser <keir@xxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Mon, 16 Sep 2013 14:13:35 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 13/09/13 17:09, Dario Faggioli wrote:

in _csched_cpu_pick(), as not doing so may result in the domain's
node-affinity mask (as retrieved by csched_balance_cpumask() )
and online mask (as retrieved by cpupool_scheduler_cpumask() )
having an empty intersection.

Therefore, when attempting a node-affinity load balancing step
and running this:

     ...
     /* Pick an online CPU from the proper affinity mask */
     csched_balance_cpumask(vc, balance_step, &cpus);
     cpumask_and(&cpus, &cpus, online);
     ...

we end up with an empty cpumask (in cpus). At this point, in
the following code:

     ....
     /* If present, prefer vc's current processor */
     cpu = cpumask_test_cpu(vc->processor, &cpus)
             ? vc->processor
             : cpumask_cycle(vc->processor, &cpus);
     ....

an ASSERT (from inside cpumask_cycle() ) triggers like this:

(XEN) Xen call trace:
(XEN)    [<ffff82d08011b124>] _csched_cpu_pick+0x1d2/0x652
(XEN)    [<ffff82d08011b5b2>] csched_cpu_pick+0xe/0x10
(XEN)    [<ffff82d0801232de>] vcpu_migrate+0x167/0x31e
(XEN)    [<ffff82d0801238cc>] cpu_disable_scheduler+0x1c8/0x287
(XEN)    [<ffff82d080101b3f>] cpupool_unassign_cpu_helper+0x20/0xb4
(XEN)    [<ffff82d08010544f>] continue_hypercall_tasklet_handler+0x4a/0xb1
(XEN)    [<ffff82d080127793>] do_tasklet_work+0x78/0xab
(XEN)    [<ffff82d080127a70>] do_tasklet+0x5f/0x8b
(XEN)    [<ffff82d080158985>] idle_loop+0x57/0x5e
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) Assertion 'cpu < nr_cpu_ids' failed at 
/home/dario/Sources/xen/xen/xen.git/xen/include/xe:16481

It is for example sufficient to have a domain with node-affinity
to NUMA node 1 running, and issueing a `xl cpupool-numa-split'
would make the above happen. That is because, by default, all
the existing domains remain assigned to the first cpupool, and
it now (after the cpupool-numa-split) only includes NUMA node 0.

This change prevents that by generalizing the function used
for figuring out whether a node-affinity load balancing step
is legit or not. This way we can, in _csched_cpu_pick(),
figure out early enough that the mask would end up empty,
skip the step all together and avoid the splat.

Signed-off-by: Dario Faggioli <dario.faggioli@xxxxxxxxxx>
Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
Cc: Jan Beulich <JBeulich@xxxxxxxx>
Cc: Keir Fraser <keir@xxxxxxx>
---
This was originally submitted (yesterday) as "xen: make sure the node-affinity
is always updated". After discussing it with George, I got convinced that a fix
like this is more appropriate, since it deals with the issue in the scheduler,
without making a mess out of the user interface.

This is therefore not being posted as a real v2 of that patch because the
approach radically changed.
---
  xen/common/sched_credit.c |   41 ++++++++++++++++++++++++++++++-----------
  1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index dbe6de6..d76cd88 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -296,15 +296,25 @@ static void csched_set_node_affinity(
   * vcpu-affinity balancing is always necessary and must never be skipped.
   * OTOH, if a domain's node-affinity is said to be automatically computed
   * (or if it just spans all the nodes), we can safely avoid dealing with
- * node-affinity entirely. Ah, node-affinity is also deemed meaningless
- * in case it has empty intersection with the vcpu's vcpu-affinity, as it
- * would mean trying to schedule it on _no_ pcpu!
+ * node-affinity entirely.
+ *
+ * Node-affinity is also deemed meaningless in case it has empty
+ * intersection with mask, to cover the cases where using the node-affinity
+ * mask seems legit, but would instead led to trying to schedule the vcpu
+ * on _no_ pcpu! Typical use cases are for mask to be equal to the vcpu's
+ * vcpu-affinity, or to the && of vcpu-affinity and the set of online cpus
+ * in the domain's cpupool.
   */
-#define __vcpu_has_node_affinity(vc)                                          \
-    ( !(cpumask_full(CSCHED_DOM(vc->domain)->node_affinity_cpumask)           \
-        || !cpumask_intersects(vc->cpu_affinity,                              \
-                               CSCHED_DOM(vc->domain)->node_affinity_cpumask) \
-        || vc->domain->auto_node_affinity == 1) )
+static inline int __vcpu_has_node_affinity(struct vcpu *vc, cpumask_t *mask)
+{
+    if ( vc->domain->auto_node_affinity == 1
+         || cpumask_full(CSCHED_DOM(vc->domain)->node_affinity_cpumask)
+         || !cpumask_intersects(CSCHED_DOM(vc->domain)->node_affinity_cpumask,
+                                mask) )
+        return 0;
+
+    return 1;
+}

/*

   * Each csched-balance step uses its own cpumask. This function determines
@@ -393,7 +403,8 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new)
              int new_idlers_empty;

if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY

-                 && !__vcpu_has_node_affinity(new->vcpu) )
+                 && !__vcpu_has_node_affinity(new->vcpu,
+                                              new->vcpu->cpu_affinity) )
                  continue;

/* Are there idlers suitable for new (for this balance step)? */

@@ -627,10 +638,18 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu 
*vc, bool_t commit)
      int balance_step;

online = cpupool_scheduler_cpumask(vc->domain->cpupool);

+    cpumask_and(&cpus, vc->cpu_affinity, online);
+
      for_each_csched_balance_step( balance_step )
      {
+        /*
+         * We filter the node-affinity mask against
+         * [vc->cpu_affinity && online] here to avoid problems if all
+         * the cpus in in the node-affinity mask are offline (e.g.,
+         * because the domain moved to a different cpupool).
+         */

It's probably worth mentioning that the reason you do the cpumask_and

here and not in the other place this is called is that cpumask_cycle is

called after this one (which will choke on an empy mask), but not in the

other place it's called.

Other than that, looks good -- thanks. -George _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.