[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Help with: Xen BUG at timer.c:189



>>> On 19.05.17 at 11:52, <dietmar.hahn@xxxxxxxxxxxxxx> wrote:
> I'am struggling with a hypervisor panic. The hypervisor version is 4.4.3,
> yes I know - very old ;-), but the affected code hasn't much changed.

Well, at the very least I'd expect you to base your code on
4.4.4 plus the about 199 backports we've added there over time
(assuming yours is based on ours), some of which are scheduler
ones.

> The panic was never seen before and occured while booting the system,
> only dom0 was running.
> 
> (XEN) Xen BUG at timer.c:189
> (XEN) ----[ Xen-4.4.3_02-26.2_x3  x86_64  debug=n  Tainted:    C ]----
> (XEN) CPU:    14
> (XEN) RIP:    e008:[<ffff82d08012c3e6>] remove_entry+0x26/0x80
> (XEN) RFLAGS: 0000000000010086   CONTEXT: hypervisor
> (XEN) rax: 00000000000000bc   rbx: ffff830839bcd5c8   rcx: ffff82d080343a80
> (XEN) rdx: 000000000000d5f0   rsi: 0000002887fa0000   rdi: ffff830839bcd5c8
> (XEN) rbp: 000000000000d5f0   rsp: ffff830839bbfe20   r8:  0000002887211398
> (XEN) r9:  000000288721132f   r10: 00000000bf8b2994   r11: ffff82d08011cfa0
> (XEN) r12: ffff830839bcd5c8   r13: ffff82d080343a80   r14: 000000000000d5f0
> (XEN) r15: ffff82d080343a80   cr0: 000000008005003b   cr4: 00000000001526f0
> (XEN) cr3: 0000000075679000   cr2: 000000000069b8ec
> (XEN) ds: 002b   es: 002b   fs: 0000   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff830839bbfe20:
> (XEN)    ffff830839bcb020 ffff82d08012d3a8 0000002887fa0000 0000000000000286
> (XEN)    ffff830839bcd5f0 ffff830839bcd9c0 0000002887c1915d 000000000000000e
> (XEN)    ffff82d080343a70 0000000000000002 ffff830839bcda30 ffff82d0801affd0
> (XEN)    000000010000000e 00000028872113f4 ffffffffffffffff ffff82d08030fc00
> (XEN)    ffff82d0802ffc00 ffff82d080189a1c 0000000000000000 0000000000000000
> (XEN)    ffffffffffffffff 0000107800076e92 ffff82d0801292aa ffff830839bb8000
> (XEN)    ffff8300754fd000 ffff82d08031abe0 0000000000000000 ffff830839bcb048
> (XEN)    0000000000000000 ffff82d080161ad6 0000000000000001 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000009
> (XEN)    ffff8803dda20010 0000000000000246 0000000000000000 00000000fffffffe
> (XEN)    0000000000000000 0000000000000000 ffffffff800033aa 0000000000000000
> (XEN)    0000000000000000 0000000000000001 0000010000000000 ffffffff800033aa
> (XEN)    000000000000e033 0000000000000246 ffff8803dda21f08 000000000000e02b
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    000000000000000e ffff8300754fd000 00000037b9887700 0000000000000000
> (XEN) Xen call trace:
> (XEN)    [<ffff82d08012c3e6>] remove_entry+0x26/0x80
> (XEN)    [<ffff82d08012d3a8>] set_timer+0x178/0x230
> (XEN)    [<ffff82d0801affd0>] mwait_idle+0x270/0x370
> (XEN)    [<ffff82d080189a1c>] local_time_calibration+0x19c/0x1a0
> (XEN)    [<ffff82d0801292aa>] __do_softirq+0x6a/0xa0
> (XEN)    [<ffff82d080161ad6>] idle_loop+0x16/0x50
> (XEN) 
> (XEN) 
> (XEN) ****************************************
> (XEN) Panic on CPU 14:
> (XEN) Xen BUG at timer.c:189
> (XEN) ****************************************
> (XEN) 
> (XEN) Reboot in five seconds...
> (XEN) Executing kexec image on cpu14
> (XEN) Shot down all CPUs
> 
> This is the call of BUG() in remove_entry() in case timer->status is not
> TIMER_STATUS_in_heap or TIMER_STATUS_in_list.
> 
> With the address of the timer from the vmcore:
> crash> struct timer ffff830839bcd5c8
> struct timer {
>   expires = 0,
>   {
>     heap_offset = 968676784, 
>     list_next = 0xffff830839bcd5b0, 
>     inactive = {
>       next = 0xffff830839bcd5b0, 
>       prev = 0x200200200000050
>     }
>   }, 
>   function = 0xffff830839bcd5e0, 
>   data = 0xffff830839bcd5e0, 
>   cpu = 54768, 
>   status = 188 '\274'
> }
> 
> This looks like the structure was overwritten.
> 
> The call path seems to be:
>  mwait_idle()
>    sched_tick_resume()
>      csched_tick_resume()
>        set_timer()
>          remove_timer()
> 
> In sched_tick_resume():
>  unsigned int cpu = smp_processor_id();
> 
> I extracted struct cpu_info from the stack and found:
> crash> cpu_info ffff830839bbff18
> struct cpu_info {
>   ...
>   processor_id = 14, 
>   current_vcpu = 0xffff8300754fd000, 
>   per_cpu_offset = 0x37b9887700,
>   ...
> 
> And from the code in sched_credit.c:  spc = CSCHED_PCPU(cpu);
> I calculated spc:
> 
> crash> csched_pcpu 0xffff830839be2b40
> struct csched_pcpu {
>   runq = {
>     next = 0xffff830839be2b40, 
>     prev = 0xffff830839be2b40
>   }, 
>   runq_sort_last = 0, 
>   ticker = {
>     expires = 174080000000, 
>     {
>       heap_offset = 1, 
>       list_next = 0x100100100000001, 
>       inactive = {
>         next = 0x100100100000001, 
>         prev = 0x200200200200200
>       }
>     }, 
>     function = 0xffff82d08011f200 <csched_tick>, 
>     data = 0xe, 
>     cpu = 14, 
>     status = 3 '\003'
>   }, 
>   tick = 0, 
>   idle_bias = 47, 
>   balance_mask = 0xffff830839be2bb0
> }
> 
> And this ticker looks ok!
> So the ticker address extracted from the vmcore in remove_entry()
> is not the same as the expected one read from the code.

Right, and looking through raw stack contents I can't seem to
find any slot holding 0xffff830839be????. If this was a hypervisor
binary we delivered, I would even have gone and checked which
slots correspond to saved copies of pointers derived from what
was supposedly passed into set_timer() (rooted at
CSCHED_PCPU(14)), but this looks to be a crash with a binary
you've produced yourself. Quite likely corruption starts elsewhere.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.