Xen project Mailing List

Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer

From: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>

Date: Tue, 20 Sep 2022 12:22:28 +0200

Cc: Juergen Gross <jgross@xxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Jürgen Groß <jgross@xxxxxxxx>

Delivery-date: Tue, 20 Sep 2022 10:22:49 +0000

Feedback-id: i1568416f:Fastmail

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, Aug 22, 2022 at 12:00:27PM +0200, Marek Marczykowski-Górecki wrote: > On Mon, Aug 22, 2022 at 11:53:50AM +0200, Jan Beulich wrote: > > On 21.08.2022 18:14, Marek Marczykowski-Górecki wrote: > > > On Sat, Oct 09, 2021 at 06:28:17PM +0200, Marek Marczykowski-Górecki > > > wrote: > > >> On Sun, Jan 31, 2021 at 03:15:30AM +0100, Marek Marczykowski-Górecki > > >> wrote: > > >>> I'm resurrecting this thread as it was recently mentioned elsewhere. I > > >>> can still reproduce the issue on the recent staging branch (9dc687f155). > > >>> > > >>> It fails after the first resume (not always, but frequent enough to > > >>> debug it). At least one guest needs to be running - with just (PV) dom0 > > >>> the crash doesn't happen (at least for the ~8 times in a row I tried). > > >>> If the first resume works, the second (almost?) always will fail but > > >>> with a different symptoms - dom0 kernel lockups (at least some of its > > >>> vcpus). I haven't debugged this one yet at all. > > >>> > > >>> Any help will be appreciated, I can apply some debug patches, change > > >>> configuration etc. > > >> > > >> This still happens on 4.14.3. Maybe it is related to freeing percpu > > >> areas, as it caused other issues with suspend too? Just a thought... > > > > > > I have reproduced this on current staging(*). And I can reproduce it > > > reliably. And also, I got (I believe) closely related crash with credit1 > > > scheduler. > > > > > > (*) It isn't plain staging, it's one with my xhci console patches on > > > top, including attempt to make it survive S3. I believe the only > > > relevant part there is sticking set_timer() into console resume path (or > > > just having a timer with rather short delay registered). The actual tree > > > at https://github.com/marmarek/xen/tree/master-xue2-debug, including > > > quite a lot of debug prints and debug hacks. > > > > > > Specific crash with credit2: > > (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at > common/sched/credit2.c:2274 > (XEN) ----[ Xen-4.17-unstable x86_64 debug=y Tainted: C ]---- > (XEN) CPU: 10 > (XEN) RIP: e008:[<ffff82d040247a4d>] > credit2.c#csched2_unit_wake+0x152/0x154 > (XEN) RFLAGS: 0000000000010083 CONTEXT: hypervisor (d0v0) > (XEN) rax: ffff830251778230 rbx: ffff830251768cb0 rcx: 00000032111d6000 > (XEN) rdx: ffff8302515c1eb0 rsi: 0000000000000006 rdi: ffff830251769000 > (XEN) rbp: ffff8302515cfd90 rsp: ffff8302515cfd70 r8: ffff830251769000 > (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 > (XEN) r12: ffff830251768dd0 r13: ffff8302515c1d00 r14: 0000000000000006 > (XEN) r15: ffff82d0405ddb40 cr0: 0000000080050033 cr4: 0000000000372660 > (XEN) cr3: 000000022f2a1000 cr2: ffff8881012738e0 > (XEN) fsb: 0000744bf6a0db80 gsb: ffff888255600000 gss: 0000000000000000 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 > (XEN) Xen code around <ffff82d040247a4d> > (credit2.c#csched2_unit_wake+0x152/0x154): > (XEN) df e8 6f bf ff ff eb ad <0f> 0b f3 0f 1e fa 55 48 89 e5 41 57 41 56 41 > 55 > (XEN) Xen stack trace from rsp=ffff8302515cfd70: > (XEN) ffff83025174b000 ffff830251768cb0 ffff830251778270 ffff82d0405c4298 > (XEN) ffff8302515cfdd8 ffff82d04024fcb8 0000000000000202 ffff830251778270 > (XEN) ffff83025174b000 0000000000000001 ffff830251769018 0000000000000000 > (XEN) 0000000000000000 ffff8302515cfe48 ffff82d04020a8c9 ffff8882556aedc0 > (XEN) 0000000000000003 00001910537e623e 0000000b988f78a6 0000000059d4a716 > (XEN) 00001901f30fa41e 0000000217f96af6 0000000000000000 ffff83025174b000 > (XEN) ffff830251756000 0000000000000002 0000000000000001 ffff8302515cfe70 > (XEN) ffff82d0402f7968 ffff830251756000 ffff8302515cfef8 0000000000000018 > (XEN) ffff8302515cfee8 ffff82d0402ec6de 0000000000000000 ffffffff82f157e0 > (XEN) 0000000000000000 0000000000000000 ffff8302515cfef8 0000000000000000 > (XEN) 0000000000000000 ffff8302515cffff ffff830251756000 0000000000000000 > (XEN) 0000000000000000 0000000000000000 0000000000000000 00007cfdaea300e7 > (XEN) ffff82d0402012bd 0000000000000000 ffffffff82c51120 ffff88810036cf00 > (XEN) 0000000000000002 000000000001e120 0000000000000002 0000000000000246 > (XEN) ffffffff82f157e0 0000000000000001 0000000000000000 0000000000000018 > (XEN) ffffffff81e4a30a 0000000000000000 0000000000000002 0000000000000001 > (XEN) 0000010000000000 ffffffff81e4a30a 000000000000e033 0000000000000246 > (XEN) ffffc9004aef7c18 000000000000e02b fb5ee398d214b10c eb5ef398c214a10c > (XEN) eb56f390c21ca104 ebd6f310c29ca184 0000e0100000000a ffff830251756000 > (XEN) 0000003211016000 0000000000372660 0000000000000000 80000002963e1002 > (XEN) Xen call trace: > (XEN) [<ffff82d040247a4d>] R credit2.c#csched2_unit_wake+0x152/0x154 > (XEN) [<ffff82d04024fcb8>] F vcpu_wake+0xfd/0x267 > (XEN) [<ffff82d04020a8c9>] F common_vcpu_op+0x178/0x5d1 > (XEN) [<ffff82d0402f7968>] F do_vcpu_op+0x69/0x226 > (XEN) [<ffff82d0402ec6de>] F pv_hypercall+0x575/0x657 > (XEN) [<ffff82d0402012bd>] F lstar_enter+0x13d/0x150 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 10: > (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at > common/sched/credit2.c:2274 > (XEN) **************************************** Ok, I think I figured it out! I added a function that verifies run queues of all the sched units, and called it basically every other line on the resume path. The debug function (if anybody is interested): void verify_sched_units(void) { struct domain *d; const struct sched_unit *unit; for_each_domain ( d ) { for_each_sched_unit ( d, unit ) { if ( c2rqd(sched_unit_master(unit)) != csched2_unit(unit)->rqd ) { printk(XENLOG_WARNING "d%d sched unit %d: rq=%d, unit master %d, rq=%d\n", d->domain_id, unit->unit_id, csched2_unit(unit)->rqd ? csched2_unit(unit)->rqd->id : -1, sched_unit_master(unit), c2rqd(sched_unit_master(unit))->id); WARN_ON(1); } } } } It appears that restore_vcpu_affinity() is responsible, specifically this part: 1216 /* 1217 * Re-assign the initial processor as after resume we have no 1218 * guarantee the old processor has come back to life again. 1219 * 1220 * Therefore, here, before actually unpausing the domains, we should 1221 * set v->processor of each of their vCPUs to something that will 1222 * make sense for the scheduler of the cpupool in which they are in. 1223 */ ... 1249 res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu))); 1250 sched_set_res(unit, res); 1251 1252 spin_unlock_irq(lock); 1253 1254 /* v->processor might have changed, so reacquire the lock. */ 1255 lock = unit_schedule_lock_irq(unit); 1256 res = sched_pick_resource(unit_scheduler(unit), unit); 1257 sched_set_res(unit, res); 1258 spin_unlock_irq(lock); 1259 1260 if ( old_cpu != sched_unit_master(unit) ) 1261 sched_move_irqs(unit); It calls sched_set_res() directly, which assigns sched resources, but does _not_ adjust runqueues (if new pcpu happen to be assigned to another runqueue than the one from previous pcpu). I have two (non exclusive) ideas here: 1. If old_cpu is actually still available, do not move it at all. 2. Use sched_migrate() instead of sched_set_res(). Here is the patch that fixes it for me: ---8<--- diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c index 83455fbde1c8..dcf202d8b307 100644 --- a/xen/common/sched/core.c +++ b/xen/common/sched/core.c @@ -1246,19 +1246,29 @@ void restore_vcpu_affinity(struct domain *d) } } - res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu))); + /* Prefer old cpu if available. */ + if ( cpumask_test_cpu(old_cpu, cpumask_scratch_cpu(cpu)) ) + res = get_sched_res(old_cpu); + else + res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu))); sched_set_res(unit, res); spin_unlock_irq(lock); - /* v->processor might have changed, so reacquire the lock. */ - lock = unit_schedule_lock_irq(unit); - res = sched_pick_resource(unit_scheduler(unit), unit); - sched_set_res(unit, res); - spin_unlock_irq(lock); - + /* + * If different cpu was chosen, it was random, let scheduler do proper + * decision. + */ if ( old_cpu != sched_unit_master(unit) ) + { + /* v->processor might have changed, so reacquire the lock. */ + lock = unit_schedule_lock_irq(unit); + res = sched_pick_resource(unit_scheduler(unit), unit); + sched_migrate(unit_scheduler(unit), unit, res->master_cpu); + spin_unlock_irq(lock); + sched_move_irqs(unit); + } } rcu_read_unlock(&sched_res_rculock); ---8<--- I have several doubts here: 1. If old_cpu is available, is sched_set_res() needed at all? 2. Should both calls be changed to sched_migrate()? Currently I changed only the second one, in case scheduler could be confused about old_cpu not being available anymore. 3. Are there any extra locking requirements for sched_migrate() at this stage? The long comment above sched_unit_migrate_start() suggests there might be, but I'm not sure if that's really the case during resume. 4. Related to the above - should thaw_domains() be modified to call restore_vcpu_affinity() for all domains first, and unpause only later? That could reduce locking requirements, I guess. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab

Attachment: signature.asc
Description: PGP signature

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.