Xen project Mailing List

Re: [Xen-devel] [PATCH] Fix scheduler crash after s3 resume

To: "Tomasz Wroblewski" <tomasz.wroblewski@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Wed, 23 Jan 2013 16:11:15 +0000

Cc: george.dunlap@xxxxxxxxxxxxx, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, keir@xxxxxxx

Delivery-date: Wed, 23 Jan 2013 16:11:36 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 23.01.13 at 16:51, Tomasz Wroblewski <tomasz.wroblewski@xxxxxxxxxx> >>> wrote: > Hi all, > > This was also discussed earlier, for example here > http://xen.markmail.org/thread/iqvkylp3mclmsnbw > > Changeset 25079:d5ccb2d1dbd1 (Introduce system_state variable) added a > global variable, which, among other things, is used to prevent disabling > cpu scheduler, prevent breaking vcpu affinities, prevent removing the > cpu from cpupool on suspend. However, it missed one place where cpu is > removed from the cpupool valid cpus mask, in smpboot.c, __cpu_disable(), > line 840: > > cpumask_clear_cpu(cpu, cpupool0->cpu_valid); > > This causes the vcpu in the default pool to be considered inactive, and > the following assertion is violated in sched_credit.c soon after resume > transitions out of xen, causing a platform reboot: > > (XEN) Finishing wakeup from ACPI S3 state. > (XEN) Enabling non-boot CPUs ... > (XEN) Assertion '!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)' > failed at sched_credit.c:507 > (XEN) ----[ Xen-4.3-unstable x86_64 debug=y Tainted: C ]---- > (XEN) CPU: 1 > (XEN) RIP: e008:[<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd > (XEN) RFLAGS: 0000000000010202 CONTEXT: hypervisor > (XEN) rax: 0000000000000001 rbx: 0000000000000008 rcx: 0000000000000008 > (XEN) rdx: 00000000000000ff rsi: 0000000000000008 rdi: 0000000000000000 > (XEN) rbp: ffff83011415fdd8 rsp: ffff83011415fcf8 r8: 0000000000000000 > (XEN) r9: 000000000000003e r10: 00000008f3de731f r11: ffffea0000063800 > (XEN) r12: ffff82c480261720 r13: ffff830137b4d950 r14: ffff830137beb010 > (XEN) r15: ffff82c480261720 cr0: 0000000080050033 cr4: 00000000000026f0 > (XEN) cr3: 000000013c17d000 cr2: ffff8800ac6ef8f0 > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 > (XEN) Xen stack trace from rsp=ffff83011415fcf8: > (XEN) 00000000000af257 0000000800000001 ffff8300ba4fd000 0000000000000000 > (XEN) 0000000000000000 0000000000000000 0000000000000002 ffff8800ac6ef8f0 > (XEN) 0000000800000000 00000001318e0025 0000000000000087 ffff83011415fd68 > (XEN) ffff82c480124f79 ffff83011415fd98 ffff83011415fda8 00007fda88d1e790 > (XEN) ffff8800ac6ef8f0 00000001318e0025 0000000000000000 0000000000000000 > (XEN) 0000000000000000 0000000000000000 0000000000000146 ffff830137b4d940 > (XEN) 0000000000000001 ffff830137b4d950 ffff830137beb010 ffff82c480261720 > (XEN) ffff83011415fe48 ffff82c48011a51b 0002000e00000007 ffffffff81009071 > (XEN) 000000000000e033 ffff83013a805360 ffff880002bb3c28 000000000000e02b > (XEN) e4d87248e7ca5f52 ffff830102ae2200 0000000000000001 ffff82c48011a356 > (XEN) 00000008efa1f543 00007fda88d1e790 ffff83011415fe78 ffff82c48012748f > (XEN) 0000000000000002 ffff830137beb028 ffff830102ae2200 ffff830137beb8d0 > (XEN) ffff83011415fec8 ffff82c48012758b ffff830114150000 ffff8800ac6ef8f0 > (XEN) 80100000ae86d065 ffff82c4802e0080 ffff82c4802e0000 ffff830114158000 > (XEN) ffffffffffffffff 00007fda88d1e790 ffff83011415fef8 ffff82c480124b4e > (XEN) ffff8300ba4fd000 ffffea0000063800 00000001318e0025 ffff8800ac6ef8f0 > (XEN) ffff83011415ff08 ffff82c480124bb4 00007cfeebea00c7 ffff82c480226a71 > (XEN) 00007fda88d1e790 ffff8800ac6ef8f0 00000001318e0025 ffffea0000063800 > (XEN) ffff880002bb3c78 00000001318e0025 ffffea0000063800 0000000000000146 > (XEN) 00003ffffffff000 ffffea0002b1bbf0 0000000000000000 00000001318e0025 > (XEN) Xen call trace: > (XEN) [<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd > (XEN) [<ffff82c48011a51b>] csched_tick+0x1c5/0x342 > (XEN) [<ffff82c48012748f>] execute_timer+0x4e/0x6c > (XEN) [<ffff82c48012758b>] timer_softirq_action+0xde/0x206 > (XEN) [<ffff82c480124b4e>] __do_softirq+0x8e/0x99 > (XEN) [<ffff82c480124bb4>] do_softirq+0x13/0x15 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 1: > (XEN) Assertion '!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)' > failed at sched_credit.c:507 > (XEN) **************************************** > (XEN) > (XEN) Reboot in five seconds... > > ^ reason for above being that "cpus" cpumask is empty as it is a logical > "and" between cpupool's valid cpus (from which the cpu was removed) and > cpu affinity mask. So can you confirm that this is not a problem on a non-debug hypervisor? I'm particularly asking because, leaving the ASSERT() aside, such a fundamental flaw would have made it impossible for S3 to work for anyone, and that's reportedly not the case. > Attached patch follows the spirit of the changeset 25079:d5ccb2d1dbd1 > (which blocked removal of the cpu from the cpupool in cpupool.c) by also > blocking it's removal from the cpupool's valid cpumask. So cpu > affinities are still preserved across suspend/resume, and scheuduler > does not need to be disabled, as per original intent (I think). Would > welcome comments. Looks reasonable (and consistent with the earlier change), but I'd still like to wait for at least Keir's and Juergen's opinion. Jan > Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@xxxxxxxxxx> > > Commit message: > Fix s3 resume regression (crash in scheduler) after c-s > 25079:d5ccb2d1dbd1 by also blocking removal of the cpu from the > cpupool's cpu_valid mask - in the spirit of mentioned c-s. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.