[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Fix scheduler crash after s3 resume



>>> On 23.01.13 at 16:51, Tomasz Wroblewski <tomasz.wroblewski@xxxxxxxxxx> 
>>> wrote:
> Hi all,
> 
> This was also discussed earlier, for example here 
> http://xen.markmail.org/thread/iqvkylp3mclmsnbw 
> 
> Changeset 25079:d5ccb2d1dbd1 (Introduce system_state variable) added a 
> global variable, which, among other things, is used to prevent disabling 
> cpu scheduler, prevent breaking vcpu affinities, prevent removing the 
> cpu from cpupool on suspend. However, it missed one place where cpu is 
> removed from the cpupool valid cpus mask, in smpboot.c, __cpu_disable(), 
> line 840:
> 
>      cpumask_clear_cpu(cpu, cpupool0->cpu_valid);
> 
> This causes the vcpu in the default pool to be considered inactive, and 
> the following assertion is violated in sched_credit.c soon after resume 
> transitions out of xen, causing a platform reboot:
> 
> (XEN) Finishing wakeup from ACPI S3 state.
> (XEN) Enabling non-boot CPUs  ...
> (XEN) Assertion '!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)' 
> failed at sched_credit.c:507
> (XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Tainted:    C ]----
> (XEN) CPU:    1
> (XEN) RIP:    e008:[<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd
> (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor
> (XEN) rax: 0000000000000001   rbx: 0000000000000008   rcx: 0000000000000008
> (XEN) rdx: 00000000000000ff   rsi: 0000000000000008   rdi: 0000000000000000
> (XEN) rbp: ffff83011415fdd8   rsp: ffff83011415fcf8   r8:  0000000000000000
> (XEN) r9:  000000000000003e   r10: 00000008f3de731f   r11: ffffea0000063800
> (XEN) r12: ffff82c480261720   r13: ffff830137b4d950   r14: ffff830137beb010
> (XEN) r15: ffff82c480261720   cr0: 0000000080050033   cr4: 00000000000026f0
> (XEN) cr3: 000000013c17d000   cr2: ffff8800ac6ef8f0
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> (XEN) Xen stack trace from rsp=ffff83011415fcf8:
> (XEN)    00000000000af257 0000000800000001 ffff8300ba4fd000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000002 ffff8800ac6ef8f0
> (XEN)    0000000800000000 00000001318e0025 0000000000000087 ffff83011415fd68
> (XEN)    ffff82c480124f79 ffff83011415fd98 ffff83011415fda8 00007fda88d1e790
> (XEN)    ffff8800ac6ef8f0 00000001318e0025 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000146 ffff830137b4d940
> (XEN)    0000000000000001 ffff830137b4d950 ffff830137beb010 ffff82c480261720
> (XEN)    ffff83011415fe48 ffff82c48011a51b 0002000e00000007 ffffffff81009071
> (XEN)    000000000000e033 ffff83013a805360 ffff880002bb3c28 000000000000e02b
> (XEN)    e4d87248e7ca5f52 ffff830102ae2200 0000000000000001 ffff82c48011a356
> (XEN)    00000008efa1f543 00007fda88d1e790 ffff83011415fe78 ffff82c48012748f
> (XEN)    0000000000000002 ffff830137beb028 ffff830102ae2200 ffff830137beb8d0
> (XEN)    ffff83011415fec8 ffff82c48012758b ffff830114150000 ffff8800ac6ef8f0
> (XEN)    80100000ae86d065 ffff82c4802e0080 ffff82c4802e0000 ffff830114158000
> (XEN)    ffffffffffffffff 00007fda88d1e790 ffff83011415fef8 ffff82c480124b4e
> (XEN)    ffff8300ba4fd000 ffffea0000063800 00000001318e0025 ffff8800ac6ef8f0
> (XEN)    ffff83011415ff08 ffff82c480124bb4 00007cfeebea00c7 ffff82c480226a71
> (XEN)    00007fda88d1e790 ffff8800ac6ef8f0 00000001318e0025 ffffea0000063800
> (XEN)    ffff880002bb3c78 00000001318e0025 ffffea0000063800 0000000000000146
> (XEN)    00003ffffffff000 ffffea0002b1bbf0 0000000000000000 00000001318e0025
> (XEN) Xen call trace:
> (XEN)    [<ffff82c480119e9e>] _csched_cpu_pick+0x155/0x5fd
> (XEN)    [<ffff82c48011a51b>] csched_tick+0x1c5/0x342
> (XEN)    [<ffff82c48012748f>] execute_timer+0x4e/0x6c
> (XEN)    [<ffff82c48012758b>] timer_softirq_action+0xde/0x206
> (XEN)    [<ffff82c480124b4e>] __do_softirq+0x8e/0x99
> (XEN)    [<ffff82c480124bb4>] do_softirq+0x13/0x15
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 1:
> (XEN) Assertion '!cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus)' 
> failed at sched_credit.c:507
> (XEN) ****************************************
> (XEN)
> (XEN) Reboot in five seconds...
> 
> ^ reason for above being that "cpus" cpumask is empty as it is a logical 
> "and" between cpupool's valid cpus (from which the cpu was removed) and 
> cpu affinity mask.

So can you confirm that this is not a problem on a non-debug
hypervisor? I'm particularly asking because, leaving the ASSERT()
aside, such a fundamental flaw would have made it impossible
for S3 to work for anyone, and that's reportedly not the case.

> Attached patch follows the spirit of the changeset 25079:d5ccb2d1dbd1 
> (which blocked removal of the cpu from the cpupool in cpupool.c) by also 
> blocking it's removal from the cpupool's valid cpumask. So cpu 
> affinities are still preserved across suspend/resume, and scheuduler 
> does not need to be disabled, as per original intent (I think). Would 
> welcome comments.

Looks reasonable (and consistent with the earlier change), but
I'd still like to wait for at least Keir's and Juergen's opinion.

Jan

> Signed-off-by: Tomasz Wroblewski <tomasz.wroblewski@xxxxxxxxxx>
> 
> Commit message:
> Fix s3 resume regression (crash in scheduler) after c-s 
> 25079:d5ccb2d1dbd1 by also blocking removal of the cpu from the 
> cpupool's cpu_valid mask - in the spirit of mentioned c-s.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.