[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Help with: Xen BUG at timer.c:189
Am Freitag, 19. Mai 2017, 06:41:36 schrieb Jan Beulich: > >>> On 19.05.17 at 11:52, <dietmar.hahn@xxxxxxxxxxxxxx> wrote: > > I'am struggling with a hypervisor panic. The hypervisor version is 4.4.3, > > yes I know - very old ;-), but the affected code hasn't much changed. > > Well, at the very least I'd expect you to base your code on > 4.4.4 plus the about 199 backports we've added there over time > (assuming yours is based on ours), some of which are scheduler > ones. Don't you have older versions in the field? > > > The panic was never seen before and occured while booting the system, > > only dom0 was running. > > > > (XEN) Xen BUG at timer.c:189 > > (XEN) ----[ Xen-4.4.3_02-26.2_x3 x86_64 debug=n Tainted: C ]---- > > (XEN) CPU: 14 > > (XEN) RIP: e008:[<ffff82d08012c3e6>] remove_entry+0x26/0x80 > > (XEN) RFLAGS: 0000000000010086 CONTEXT: hypervisor > > (XEN) rax: 00000000000000bc rbx: ffff830839bcd5c8 rcx: ffff82d080343a80 > > (XEN) rdx: 000000000000d5f0 rsi: 0000002887fa0000 rdi: ffff830839bcd5c8 > > (XEN) rbp: 000000000000d5f0 rsp: ffff830839bbfe20 r8: 0000002887211398 > > (XEN) r9: 000000288721132f r10: 00000000bf8b2994 r11: ffff82d08011cfa0 > > (XEN) r12: ffff830839bcd5c8 r13: ffff82d080343a80 r14: 000000000000d5f0 > > (XEN) r15: ffff82d080343a80 cr0: 000000008005003b cr4: 00000000001526f0 > > (XEN) cr3: 0000000075679000 cr2: 000000000069b8ec > > (XEN) ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008 > > (XEN) Xen stack trace from rsp=ffff830839bbfe20: > > (XEN) ffff830839bcb020 ffff82d08012d3a8 0000002887fa0000 0000000000000286 > > (XEN) ffff830839bcd5f0 ffff830839bcd9c0 0000002887c1915d 000000000000000e > > (XEN) ffff82d080343a70 0000000000000002 ffff830839bcda30 ffff82d0801affd0 > > (XEN) 000000010000000e 00000028872113f4 ffffffffffffffff ffff82d08030fc00 > > (XEN) ffff82d0802ffc00 ffff82d080189a1c 0000000000000000 0000000000000000 > > (XEN) ffffffffffffffff 0000107800076e92 ffff82d0801292aa ffff830839bb8000 > > (XEN) ffff8300754fd000 ffff82d08031abe0 0000000000000000 ffff830839bcb048 > > (XEN) 0000000000000000 ffff82d080161ad6 0000000000000001 0000000000000000 > > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000009 > > (XEN) ffff8803dda20010 0000000000000246 0000000000000000 00000000fffffffe > > (XEN) 0000000000000000 0000000000000000 ffffffff800033aa 0000000000000000 > > (XEN) 0000000000000000 0000000000000001 0000010000000000 ffffffff800033aa > > (XEN) 000000000000e033 0000000000000246 ffff8803dda21f08 000000000000e02b > > (XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000 > > (XEN) 000000000000000e ffff8300754fd000 00000037b9887700 0000000000000000 > > (XEN) Xen call trace: > > (XEN) [<ffff82d08012c3e6>] remove_entry+0x26/0x80 > > (XEN) [<ffff82d08012d3a8>] set_timer+0x178/0x230 > > (XEN) [<ffff82d0801affd0>] mwait_idle+0x270/0x370 > > (XEN) [<ffff82d080189a1c>] local_time_calibration+0x19c/0x1a0 > > (XEN) [<ffff82d0801292aa>] __do_softirq+0x6a/0xa0 > > (XEN) [<ffff82d080161ad6>] idle_loop+0x16/0x50 > > (XEN) > > (XEN) > > (XEN) **************************************** > > (XEN) Panic on CPU 14: > > (XEN) Xen BUG at timer.c:189 > > (XEN) **************************************** > > (XEN) > > (XEN) Reboot in five seconds... > > (XEN) Executing kexec image on cpu14 > > (XEN) Shot down all CPUs > > > > This is the call of BUG() in remove_entry() in case timer->status is not > > TIMER_STATUS_in_heap or TIMER_STATUS_in_list. > > > > With the address of the timer from the vmcore: > > crash> struct timer ffff830839bcd5c8 > > struct timer { > > expires = 0, > > { > > heap_offset = 968676784, > > list_next = 0xffff830839bcd5b0, > > inactive = { > > next = 0xffff830839bcd5b0, > > prev = 0x200200200000050 > > } > > }, > > function = 0xffff830839bcd5e0, > > data = 0xffff830839bcd5e0, > > cpu = 54768, > > status = 188 '\274' > > } > > > > This looks like the structure was overwritten. > > > > The call path seems to be: > > mwait_idle() > > sched_tick_resume() > > csched_tick_resume() > > set_timer() > > remove_timer() > > > > In sched_tick_resume(): > > unsigned int cpu = smp_processor_id(); > > > > I extracted struct cpu_info from the stack and found: > > crash> cpu_info ffff830839bbff18 > > struct cpu_info { > > ... > > processor_id = 14, > > current_vcpu = 0xffff8300754fd000, > > per_cpu_offset = 0x37b9887700, > > ... > > > > And from the code in sched_credit.c: spc = CSCHED_PCPU(cpu); > > I calculated spc: > > > > crash> csched_pcpu 0xffff830839be2b40 > > struct csched_pcpu { > > runq = { > > next = 0xffff830839be2b40, > > prev = 0xffff830839be2b40 > > }, > > runq_sort_last = 0, > > ticker = { > > expires = 174080000000, > > { > > heap_offset = 1, > > list_next = 0x100100100000001, > > inactive = { > > next = 0x100100100000001, > > prev = 0x200200200200200 > > } > > }, > > function = 0xffff82d08011f200 <csched_tick>, > > data = 0xe, > > cpu = 14, > > status = 3 '\003' > > }, > > tick = 0, > > idle_bias = 47, > > balance_mask = 0xffff830839be2bb0 > > } > > > > And this ticker looks ok! > > So the ticker address extracted from the vmcore in remove_entry() > > is not the same as the expected one read from the code. > > Right, and looking through raw stack contents I can't seem to > find any slot holding 0xffff830839be????. If this was a hypervisor > binary we delivered, I would even have gone and checked which > slots correspond to saved copies of pointers derived from what > was supposedly passed into set_timer() (rooted at > CSCHED_PCPU(14)), but this looks to be a crash with a binary > you've produced yourself. Quite likely corruption starts elsewhere. I disassembled the affected code and found that the stack is not used that much. The only usable value is ffff830839bcd5f0. And this is &timer->cpu used (and moved to the stack) in timer_lock() in the macro timer_lock_irqsave(). Calculating back to the struct timer this is our wrong address from the panic. This would mean set_timer() was already called with the wrong address. On the other side I had a look at all 48 CSCHED_PCPU(cpu) areas and all seems to be fine. At least ticker->cpu correspondends always to the cpu used in __per_cpu_offset[cpu]. So we have a really short path where the error can happen I think: csched_tick_resume(const struct scheduler *ops, unsigned int cpu) ... spc = CSCHED_PCPU(cpu); ... set_timer(&spc->ticker, ...); As already said above cpu is from smp_processor_id() in csched_tick_resume() and looks fine. Very strange! > > Jan > Thanks. Dietmar. -- Company details: http://ts.fujitsu.com/imprint.html _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |