[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2] Fix scheduler crash after s3 resume

On 24/01/13 16:36, Jan Beulich wrote:
On 24.01.13 at 15:26, Tomasz Wroblewski<tomasz.wroblewski@xxxxxxxxxx>  wrote:
@@ -212,6 +213,8 @@
             BUG_ON(error == -EBUSY);
             printk("Error taking CPU%d up: %d\n", cpu, error);
+        if (system_state == SYS_STATE_resume)
+            cpumask_set_cpu(cpu, cpupool0->cpu_valid);
This can't be right: What tells you that all CPUs were in pool 0?

You're right, in my simple tests this was the case, but generally speaking it might not be.. Would an approach based on storing cpupool0 mask in disable_nonboot_cpus() and restoring it in enable_nonboot_cpus() be more acceptable?
Also, for the future - generating patches with -p helps quite
a bit in reviewing them.

Ok, thanks!
--- a/xen/common/schedule.c     Mon Jan 21 17:03:10 2013 +0000
+++ b/xen/common/schedule.c     Thu Jan 24 13:40:31 2013 +0000
@@ -545,7 +545,7 @@
     int    ret = 0;

     c = per_cpu(cpupool, cpu);
-    if ( (c == NULL) || (system_state == SYS_STATE_suspend) )
+    if ( c == NULL )
         return ret;

     for_each_domain_in_cpupool ( d, c )
@@ -556,7 +556,8 @@

             cpumask_and(&online_affinity, v->cpu_affinity, c->cpu_valid);
             if ( cpumask_empty(&online_affinity)&&
-                 cpumask_test_cpu(cpu, v->cpu_affinity) )
+                 cpumask_test_cpu(cpu, v->cpu_affinity)&&
+                 system_state != SYS_STATE_suspend )
                 printk("Breaking vcpu affinity for domain %d vcpu %d\n",
                         v->domain->domain_id, v->vcpu_id);
I doubt this is correct, as you don't restore any of the settings
during resume that you tear down here.

Is the objection about the affinity part or also the (c == NULL) bit? The cpu_disable_scheduler() function is currently part of a regular cpu down process, and was also part of suspend process before the "system state variable" changeset which regressed it. So the (c==NULL) hunk mostly just returns to previous state where this was working alot better (by empirical testing). But I am no expert on this, so would be grateful for ideas how this could be fixed in a better way!

Just to recap, the current problem boils down, I believe, to the fact that vcpu_wake (schedule.c) function keeps getting called occasionally during the S3 path for cpus which have the per_cpu data freed, causing a crash. Safest way of fixing it seemed to be just put the suspend cpu_disable_scheduler under regular path again - it probably isn't the best..

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.