Xen project Mailing List

Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer

To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>

From: Juergen Gross <jgross@xxxxxxxx>

Date: Fri, 21 Oct 2022 08:41:04 +0200

Cc: Juergen Gross <jgross@xxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Fri, 21 Oct 2022 06:41:12 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 20.09.22 12:22, Marek Marczykowski-Górecki wrote:

On Mon, Aug 22, 2022 at 12:00:27PM +0200, Marek Marczykowski-Górecki wrote:

On Mon, Aug 22, 2022 at 11:53:50AM +0200, Jan Beulich wrote:

On 21.08.2022 18:14, Marek Marczykowski-Górecki wrote:

On Sat, Oct 09, 2021 at 06:28:17PM +0200, Marek Marczykowski-Górecki wrote:

On Sun, Jan 31, 2021 at 03:15:30AM +0100, Marek Marczykowski-Górecki wrote:

I'm resurrecting this thread as it was recently mentioned elsewhere. I
can still reproduce the issue on the recent staging branch (9dc687f155).

It fails after the first resume (not always, but frequent enough to
debug it). At least one guest needs to be running - with just (PV) dom0
the crash doesn't happen (at least for the ~8 times in a row I tried).
If the first resume works, the second (almost?) always will fail but
with a different symptoms - dom0 kernel lockups (at least some of its
vcpus). I haven't debugged this one yet at all.

Any help will be appreciated, I can apply some debug patches, change
configuration etc.


This still happens on 4.14.3. Maybe it is related to freeing percpu
areas, as it caused other issues with suspend too? Just a thought...


I have reproduced this on current staging(*). And I can reproduce it
reliably. And also, I got (I believe) closely related crash with credit1
scheduler.

(*) It isn't plain staging, it's one with my xhci console patches on
top, including attempt to make it survive S3. I believe the only
relevant part there is sticking set_timer() into console resume path (or
just having a timer with rather short delay registered). The actual tree
at https://github.com/marmarek/xen/tree/master-xue2-debug, including
quite a lot of debug prints and debug hacks.

Specific crash with credit2:


(XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at 
common/sched/credit2.c:2274
(XEN) ----[ Xen-4.17-unstable  x86_64  debug=y  Tainted:   C    ]----
(XEN) CPU:    10
(XEN) RIP:    e008:[<ffff82d040247a4d>] credit2.c#csched2_unit_wake+0x152/0x154
(XEN) RFLAGS: 0000000000010083   CONTEXT: hypervisor (d0v0)
(XEN) rax: ffff830251778230   rbx: ffff830251768cb0   rcx: 00000032111d6000
(XEN) rdx: ffff8302515c1eb0   rsi: 0000000000000006   rdi: ffff830251769000
(XEN) rbp: ffff8302515cfd90   rsp: ffff8302515cfd70   r8:  ffff830251769000
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: ffff830251768dd0   r13: ffff8302515c1d00   r14: 0000000000000006
(XEN) r15: ffff82d0405ddb40   cr0: 0000000080050033   cr4: 0000000000372660
(XEN) cr3: 000000022f2a1000   cr2: ffff8881012738e0
(XEN) fsb: 0000744bf6a0db80   gsb: ffff888255600000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen code around <ffff82d040247a4d> 
(credit2.c#csched2_unit_wake+0x152/0x154):
(XEN)  df e8 6f bf ff ff eb ad <0f> 0b f3 0f 1e fa 55 48 89 e5 41 57 41 56 41 55
(XEN) Xen stack trace from rsp=ffff8302515cfd70:
(XEN)    ffff83025174b000 ffff830251768cb0 ffff830251778270 ffff82d0405c4298
(XEN)    ffff8302515cfdd8 ffff82d04024fcb8 0000000000000202 ffff830251778270
(XEN)    ffff83025174b000 0000000000000001 ffff830251769018 0000000000000000
(XEN)    0000000000000000 ffff8302515cfe48 ffff82d04020a8c9 ffff8882556aedc0
(XEN)    0000000000000003 00001910537e623e 0000000b988f78a6 0000000059d4a716
(XEN)    00001901f30fa41e 0000000217f96af6 0000000000000000 ffff83025174b000
(XEN)    ffff830251756000 0000000000000002 0000000000000001 ffff8302515cfe70
(XEN)    ffff82d0402f7968 ffff830251756000 ffff8302515cfef8 0000000000000018
(XEN)    ffff8302515cfee8 ffff82d0402ec6de 0000000000000000 ffffffff82f157e0
(XEN)    0000000000000000 0000000000000000 ffff8302515cfef8 0000000000000000
(XEN)    0000000000000000 ffff8302515cffff ffff830251756000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 00007cfdaea300e7
(XEN)    ffff82d0402012bd 0000000000000000 ffffffff82c51120 ffff88810036cf00
(XEN)    0000000000000002 000000000001e120 0000000000000002 0000000000000246
(XEN)    ffffffff82f157e0 0000000000000001 0000000000000000 0000000000000018
(XEN)    ffffffff81e4a30a 0000000000000000 0000000000000002 0000000000000001
(XEN)    0000010000000000 ffffffff81e4a30a 000000000000e033 0000000000000246
(XEN)    ffffc9004aef7c18 000000000000e02b fb5ee398d214b10c eb5ef398c214a10c
(XEN)    eb56f390c21ca104 ebd6f310c29ca184 0000e0100000000a ffff830251756000
(XEN)    0000003211016000 0000000000372660 0000000000000000 80000002963e1002
(XEN) Xen call trace:
(XEN)    [<ffff82d040247a4d>] R credit2.c#csched2_unit_wake+0x152/0x154
(XEN)    [<ffff82d04024fcb8>] F vcpu_wake+0xfd/0x267
(XEN)    [<ffff82d04020a8c9>] F common_vcpu_op+0x178/0x5d1
(XEN)    [<ffff82d0402f7968>] F do_vcpu_op+0x69/0x226
(XEN)    [<ffff82d0402ec6de>] F pv_hypercall+0x575/0x657
(XEN)    [<ffff82d0402012bd>] F lstar_enter+0x13d/0x150
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 10:
(XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at 
common/sched/credit2.c:2274
(XEN) ****************************************


Ok, I think I figured it out!

I added a function that verifies run queues of all the sched units, and
called it basically every other line on the resume path. The debug
function (if anybody is interested):

     void verify_sched_units(void)
     {
         struct domain *d;
         const struct sched_unit *unit;

for_each_domain ( d )

         {
             for_each_sched_unit ( d, unit )
             {
                 if ( c2rqd(sched_unit_master(unit)) != csched2_unit(unit)->rqd 
)
                 {
                     printk(XENLOG_WARNING "d%d sched unit %d: rq=%d, unit master 
%d, rq=%d\n",
                             d->domain_id, unit->unit_id,
                             csched2_unit(unit)->rqd ? 
csched2_unit(unit)->rqd->id : -1,
                             sched_unit_master(unit),
                             c2rqd(sched_unit_master(unit))->id);
                     WARN_ON(1);
                 }
             }
         }
     }

It appears that restore_vcpu_affinity() is responsible, specifically
this part:

1216         /*
1217          * Re-assign the initial processor as after resume we have no
1218          * guarantee the old processor has come back to life again.
1219          *
1220          * Therefore, here, before actually unpausing the domains, we 
should
1221          * set v->processor of each of their vCPUs to something that will
1222          * make sense for the scheduler of the cpupool in which they are 
in.
1223          */
...
1249         res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu)));
1250         sched_set_res(unit, res);
1251
1252         spin_unlock_irq(lock);
1253
1254         /* v->processor might have changed, so reacquire the lock. */
1255         lock = unit_schedule_lock_irq(unit);
1256         res = sched_pick_resource(unit_scheduler(unit), unit);
1257         sched_set_res(unit, res);
1258         spin_unlock_irq(lock);
1259
1260         if ( old_cpu != sched_unit_master(unit) )
1261             sched_move_irqs(unit);

It calls sched_set_res() directly, which assigns sched resources, but
does _not_ adjust runqueues (if new pcpu happen to be assigned to
another runqueue than the one from previous pcpu).

I have two (non exclusive) ideas here:
1. If old_cpu is actually still available, do not move it at all.
2. Use sched_migrate() instead of sched_set_res().

Here is the patch that fixes it for me:
---8<---
diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index 83455fbde1c8..dcf202d8b307 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -1246,19 +1246,29 @@ void restore_vcpu_affinity(struct domain *d)
              }
          }

- res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu)));

+        /* Prefer old cpu if available. */
+        if ( cpumask_test_cpu(old_cpu, cpumask_scratch_cpu(cpu)) )
+            res = get_sched_res(old_cpu);
+        else
+            res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu)));
          sched_set_res(unit, res);

spin_unlock_irq(lock);- /* v->processor might have changed, so reacquire the lock. */

-        lock = unit_schedule_lock_irq(unit);
-        res = sched_pick_resource(unit_scheduler(unit), unit);
-        sched_set_res(unit, res);
-        spin_unlock_irq(lock);
-
+        /*
+         * If different cpu was chosen, it was random, let scheduler do proper
+         * decision.
+         */
          if ( old_cpu != sched_unit_master(unit) )
+        {
+            /* v->processor might have changed, so reacquire the lock. */
+            lock = unit_schedule_lock_irq(unit);
+            res = sched_pick_resource(unit_scheduler(unit), unit);
+            sched_migrate(unit_scheduler(unit), unit, res->master_cpu);
+            spin_unlock_irq(lock);
+
              sched_move_irqs(unit);
+        }
      }

rcu_read_unlock(&sched_res_rculock);

---8<---

I have several doubts here:

1. If old_cpu is available, is sched_set_res() needed at all?
2. Should both calls be changed to sched_migrate()? Currently I changed
    only the second one, in case scheduler could be confused about
    old_cpu not being available anymore.
3. Are there any extra locking requirements for sched_migrate() at this
    stage? The long comment above sched_unit_migrate_start() suggests
    there might be, but I'm not sure if that's really the case during
    resume.
4. Related to the above - should thaw_domains() be modified to call
    restore_vcpu_affinity() for all domains first, and unpause only
    later? That could reduce locking requirements, I guess.

Looking into this in more detail I think the fix is much easier. In case a cpu isn't coming up again after suspend, cpu_disable_scheduler() for this cpu will be called. This will do the needed vcpu migration, so we can just remove restore_vcpu_affinity() completely without any need for replacement. I'll write a patch. Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.