[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2] Fix scheduler crash after s3 resume

The crash happens due to an access to the scheduler percpu area which isn't
allocated at the moment. The accessed element is the address of the scheduler
lock for this cpu. Disabling the percpu locking scheme of the scheduler while
the non-boot cpus are offline will avoid the crash.

Ok, so I tried this approach (by turning the locking in vcpu_wake to be conditional based on system_state), and whilst it stopped vcpu_wake crash I traded it for a crash in acpi_cpufreq_target:

(XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    3
(XEN) RIP:    e008:[<ffff82c4801a0594>] acpi_cpufreq_target+0x165/0x33b
(XEN) RFLAGS: 0000000000010293   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff830137bc7300   rcx: 0000000000000000
(XEN) rdx: 0000000000000009   rsi: ffff82c480265460   rdi: ffff830137bd7d60
(XEN) rbp: ffff830137bd7db0   rsp: ffff830137bd7d30   r8:  0000000000000004
(XEN) r9:  00000000fffffffe   r10: 0000000000000009   r11: 0000000000000000
(XEN) r12: ffff830137bc7c70   r13: ffff8301025444f8   r14: ffff830137bc7c70
(XEN) r15: 0000000001b5b14c   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 00000000ba674000   cr2: 000000000000004c
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff830137bd7d30:
(XEN)    000000008017d626 0000000000000009 00000009000000fb ffff830100000001
(XEN)    ffff830137bd7d60 0000080000000199 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 ffffffff37bd7da0 00000000ffffffea
(XEN)    ffff830137bc7c70 00000000002936c8 0000000006d9e30a 0000000001b5b14c
(XEN)    ffff830137bd7df0 ffff82c4801414ee ffff830137bc7c70 0000000000000003
(XEN)    ffff830137bd7df0 0000000000000008 0000000000000008 ffff830102ae1340
(XEN)    ffff830137bd7e50 ffff82c480140815 ffff8301141624a0 002936c800000286
(XEN)    ffff82c480308dc0 ffff830137bc7c70 0000000000000003 ffff830102ae1380
(XEN)    ffff830137bebb50 ffff830137bebc00 0000000000000010 0000000000000030
(XEN)    ffff830137bd7e70 ffff82c480140a2b ffff830137bd7e70 0000001548c205b8
(XEN)    ffff830137bd7ef0 ffff82c4801a31da 0000000000000002(XEN) Resetting with 

(call graph sadly got eaten)

which corresponds to the following lines in cpufreq.c

    freqs.old = perf->states[perf->state].core_frequency * 1000;
    freqs.new = data->freq_table[next_state].frequency;
ffff82c4801a058d:       8b 55 94                mov    -0x6c(%rbp),%edx
ffff82c4801a0590:       48 8b 43 08             mov    0x8(%rbx),%rax
ffff82c4801a0594:       8b 44 d0 04             mov    0x4(%rax,%rdx,8),%eax
ffff82c4801a0598:       89 45 8c                mov    %eax,-0x74(%rbp)
ffff82c4801a059b:       48 c7 c0 00 80 ff ff    mov    $0xffffffffffff8000,%rax
ffff82c4801a05a2:       48 21 e0                and    %rsp,%rax

which I guess crashes because either freq_table or data is freed at this point 
(indeed seems that cpufreq driver has some cpu up/down logic which frees it). 
Given this is not even first place in acpi_freq_target this is accessed, it 
looks like the cpu got torn down halfway thru this function... Suspect there 
are likely to be more sites affected by this.

I also tried Jan's suggestion of making do_softirq skip its job if we are 
resuming, that causes a hang in rcu_barrier(), adding another resume 
conditional rcu_barrier() made it progress further but crash elsewhere (don't 
remember where exactly, this approach looked a bit like dead end so i abandoned 
it quickly)

So still not having a better solution than the revert of the 
cpu_disable_schedule() hunk.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.