[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [xen-unstable test] 106580: regressions - trouble: blocked/broken/fail/pass
On 10/03/17 08:37, Jan Beulich wrote: >>>> On 10.03.17 at 08:20, <osstest-admin@xxxxxxxxxxxxxx> wrote: >> flight 106580 xen-unstable real [real] >> http://logs.test-lab.xenproject.org/osstest/logs/106580/ >> >> Regressions :-( >> >> Tests which did not succeed and are blocking, >> including tests which could not be run: >> test-armhf-armhf-xl-arndale 3 host-install(3) broken REGR. vs. >> 106534 >> test-amd64-amd64-migrupgrade 10 xen-boot/dst_host fail REGR. vs. >> 106534 > The NMI watchdog has hit the EOI timer waiting to be able to send > an IPI on CPU1: > > Mar 10 00:09:32.745677 (XEN) Xen call trace: > Mar 10 00:09:32.745727 (XEN) [<ffff82d080134083>] _spin_lock+0x2c/0x4f > Mar 10 00:09:32.745779 (XEN) [<ffff82d080133e34>] > on_selected_cpus+0x2c/0xc6 > Mar 10 00:09:32.753699 (XEN) [<ffff82d080177101>] > irq.c#irq_guest_eoi_timer_fn+0x142/0x165 > Mar 10 00:09:32.761711 (XEN) [<ffff82d080136ddc>] > timer.c#execute_timer+0x47/0x62 > Mar 10 00:09:32.769683 (XEN) [<ffff82d080136ed2>] > timer.c#timer_softirq_action+0xdb/0x22c > Mar 10 00:09:32.769744 (XEN) [<ffff82d0801337e1>] > softirq.c#__do_softirq+0x7f/0x8a > Mar 10 00:09:32.777697 (XEN) [<ffff82d080133836>] do_softirq+0x13/0x15 > Mar 10 00:09:32.785792 (XEN) [<ffff82d080255081>] > entry.o#process_softirqs+0x21/0x30 > > That lock is being held by CPU2: > > Mar 10 00:15:25.133639 (XEN) Xen call trace: > Mar 10 00:15:25.133655 (XEN) [<ffff82d080102389>] __bitmap_empty+0x54/0x96 > Mar 10 00:15:25.141636 (XEN) [<ffff82d080133eb5>] > on_selected_cpus+0xad/0xc6 > Mar 10 00:15:25.149635 (XEN) [<ffff82d0801ca640>] > powernow.c#powernow_cpufreq_cpu_init+0x20d/0x372 > Mar 10 00:15:25.157633 (XEN) [<ffff82d08014c476>] > cpufreq_add_cpu+0x1d6/0x5d3 > Mar 10 00:15:25.157654 (XEN) [<ffff82d0801ca173>] > cpufreq_cpu_init+0x17/0x1a > Mar 10 00:15:25.165658 (XEN) [<ffff82d08014cd8d>] set_px_pminfo+0x2b6/0x2f7 > Mar 10 00:15:25.165679 (XEN) [<ffff82d0801956dd>] > do_platform_op+0xe69/0x1959 > Mar 10 00:15:25.173667 (XEN) [<ffff82d080251485>] pv_hypercall+0x1ef/0x42d > Mar 10 00:15:25.181678 (XEN) [<ffff82d080254ff6>] > entry.o#test_all_events+0/0x30 > > Register state tells us that it's CPU5 not responding. The only piece > of information we have about CPU5 is > > Mar 10 00:09:32.809709 (XEN) CPU5 @ e008:ffff82d080134083 (0000000000000000) > > which is the also in _spin_lock(), but which I'm afraid is too little to > diagnose the issue. I'm therefore wondering whether we wouldn't > better default "async-show-all" to true in debug builds. > > What I'm also puzzled by is that the system is still partly alive after > the panic: There's Dom0 output, and it is also reacting to debug > key input. I would have expected a panic to bring down the system > right away... Not very surprising. We crashed because the IPI lock was unavailable, then disable the watchdog in machine_halt() and try to IPI again. CPU1 is almost certainly waiting trying to broadcast __machine_halt(). This is the second odd corner case we have seen around machine_halt(). The last one was because of being unsafe to use if you panic() from the middle of context_switch(), as interrupts are re-enabled, and a guest irq hits an assertion. The solution in both cases to make it more reliable is to an NMI broadcast and leave interrupts disabled. IMO, noreboot isn't a clever thing to be using at all. OSSTest should be installing a crash kernel and collecting crash logs, which will be far more useful to aid diagnosis. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |