[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH] Fix scheduler crash after s3 resume

Hi all,

This was also discussed earlier, for example here http://xen.markmail.org/thread/iqvkylp3mclmsnbw

Changeset 25079:d5ccb2d1dbd1 (Introduce system_state variable) added a global variable, which, among other things, is used to prevent disabling cpu scheduler, prevent breaking vcpu affinities, prevent removing the cpu from cpupool on suspend. However, it missed one place where cpu is removed from the cpupool valid cpus mask, in smpboot.c, __cpu_disable(), line 840:

    cpumask_clear_cpu(cpu, cpupool0->cpu_valid);

This causes the vcpu in the default pool to be considered inactive, and the following assertion is violated in sched_credit.c soon after resume transitions out of xen, causing a platform reboot:

(XEN) Finishing wakeup from ACPI S3 state.
(XEN) Enabling non-boot CPUs  ...
(XEN) Assertion '!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)' failed at sched_credit.c:507
(XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    1
(XEN) RIP:    e008:[<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd
(XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor
(XEN) rax: 0000000000000001   rbx: 0000000000000008   rcx: 0000000000000008
(XEN) rdx: 00000000000000ff   rsi: 0000000000000008   rdi: 0000000000000000
(XEN) rbp: ffff83011415fdd8   rsp: ffff83011415fcf8   r8:  0000000000000000
(XEN) r9:  000000000000003e   r10: 00000008f3de731f   r11: ffffea0000063800
(XEN) r12: ffff82c480261720   r13: ffff830137b4d950   r14: ffff830137beb010
(XEN) r15: ffff82c480261720   cr0: 0000000080050033   cr4: 00000000000026f0
(XEN) cr3: 000000013c17d000   cr2: ffff8800ac6ef8f0
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff83011415fcf8:
(XEN)    00000000000af257 0000000800000001 ffff8300ba4fd000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000002 ffff8800ac6ef8f0
(XEN)    0000000800000000 00000001318e0025 0000000000000087 ffff83011415fd68
(XEN)    ffff82c480124f79 ffff83011415fd98 ffff83011415fda8 00007fda88d1e790
(XEN)    ffff8800ac6ef8f0 00000001318e0025 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000146 ffff830137b4d940
(XEN)    0000000000000001 ffff830137b4d950 ffff830137beb010 ffff82c480261720
(XEN)    ffff83011415fe48 ffff82c48011a51b 0002000e00000007 ffffffff81009071
(XEN)    000000000000e033 ffff83013a805360 ffff880002bb3c28 000000000000e02b
(XEN)    e4d87248e7ca5f52 ffff830102ae2200 0000000000000001 ffff82c48011a356
(XEN)    00000008efa1f543 00007fda88d1e790 ffff83011415fe78 ffff82c48012748f
(XEN)    0000000000000002 ffff830137beb028 ffff830102ae2200 ffff830137beb8d0
(XEN)    ffff83011415fec8 ffff82c48012758b ffff830114150000 ffff8800ac6ef8f0
(XEN)    80100000ae86d065 ffff82c4802e0080 ffff82c4802e0000 ffff830114158000
(XEN)    ffffffffffffffff 00007fda88d1e790 ffff83011415fef8 ffff82c480124b4e
(XEN)    ffff8300ba4fd000 ffffea0000063800 00000001318e0025 ffff8800ac6ef8f0
(XEN)    ffff83011415ff08 ffff82c480124bb4 00007cfeebea00c7 ffff82c480226a71
(XEN)    00007fda88d1e790 ffff8800ac6ef8f0 00000001318e0025 ffffea0000063800
(XEN)    ffff880002bb3c78 00000001318e0025 ffffea0000063800 0000000000000146
(XEN)    00003ffffffff000 ffffea0002b1bbf0 0000000000000000 00000001318e0025
(XEN) Xen call trace:
(XEN)    [<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd
(XEN)    [<ffff82c48011a51b>] csched_tick+0x1c5/0x342
(XEN)    [<ffff82c48012748f>] execute_timer+0x4e/0x6c
(XEN)    [<ffff82c48012758b>] timer_softirq_action+0xde/0x206
(XEN)    [<ffff82c480124b4e>] __do_softirq+0x8e/0x99
(XEN)    [<ffff82c480124bb4>] do_softirq+0x13/0x15
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) Assertion '!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)' failed at sched_credit.c:507
(XEN) ****************************************
(XEN) Reboot in five seconds...

^ reason for above being that "cpus" cpumask is empty as it is a logical "and" between cpupool's valid cpus (from which the cpu was removed) and cpu affinity mask.

Attached patch follows the spirit of the changeset 25079:d5ccb2d1dbd1 (which blocked removal of the cpu from the cpupool in cpupool.c) by also blocking it's removal from the cpupool's valid cpumask. So cpu affinities are still preserved across suspend/resume, and scheuduler does not need to be disabled, as per original intent (I think). Would welcome comments.

Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@xxxxxxxxxx>

Commit message:
Fix s3 resume regression (crash in scheduler) after c-s 25079:d5ccb2d1dbd1 by also blocking removal of the cpu from the cpupool's cpu_valid mask - in the spirit of mentioned c-s.

Attachment: fix-suspend-cpu-valid-mask
Description: Text document

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.