[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq
> Without #define DIFF_LIST 1: > 1) The guest still crashes (xl-dmesg-not-defined.txt) AHA! --MARK-- 0: ffff8305085ffd28 [p:ffff83054ef27e88, n:ffff83054ef27e88] 1: ffff8305085ffd28 [p:0200200200200200, n:0100100100100100] The same pirq_dpci structure is put twice on the list. That will surely make it unhappy. The reason the #define DIFF_LIST 1 would work is that it has code to deal with that odd scenario. But .. more importantly - the code should not allow you to put _two_ of the same 'pirq_dpci' structures on the list. CPU00: d16 OK-softirq 186msec ago, state:1, 3257 count, [prev:ffff82d0802e7e88, next:ffff82d0802e7e88] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 d16 OK-raise 236msec ago, state:1, 3257 count, [prev:0200200200200200, next:0100100100100100] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 CPU01: d16 OK-softirq 232msec ago, state:1, 2978 count, [prev:ffff83054ef57e70, next:ffff83054ef57e70] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 d16 OK-raise 281msec ago, state:1, 2978 count, [prev:0200200200200200, next:0100100100100100] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 CPU02: d16 OK-softirq 373msec ago, state:1, 2378 count, [prev:ffff83054ef47e70, next:ffff83054ef47e70] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 d16 OK-raise 423msec ago, state:1, 2378 count, [prev:0200200200200200, next:0100100100100100] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 CPU03: d16 OK-softirq 867msec ago, state:1, 2744 count, [prev:ffff83054ef37e70, next:ffff83054ef37e70] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 d16 OK-raise 916msec ago, state:1, 2744 count, [prev:0200200200200200, next:0100100100100100] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 CPU04: d16 OK-softirq 497msec ago, state:1, 76910 count, [prev:ffff83054ef2e3e0, next:ffff83054ef2e3e0] ffff8305085ffd28MACH_PCI_SHIFT MAPPED_SHIFT GUEST_PCI_SHIFT PIRQ:0 d16 OK-raise 489msec ago, state:1, 76916 count, [prev:ffff83054ef2e3e0, next:ffff83054ef2e3e0] ffff8305085ffd28MACH_PCI_SHIFT MAPPED_SHIFT GUEST_PCI_SHIFT PIRQ:0 d16 ERR-poison 600msec ago, state:0, 1 count, [prev:0200200200200200, next:0100100100100100] ffff8305085ffd28MACH_PCI_SHIFT MAPPED_SHIFT GUEST_PCI_SHIFT PIRQ:0 CPU05: d16 OK-softirq 852msec ago, state:1, 2207 count, [prev:ffff83054ef1fe70, next:ffff83054ef1fe70] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 d16 OK-raise 902msec ago, state:1, 2207 count, [prev:0200200200200200, next:0100100100100100] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT PIRQ:87 domain_crash called from io.c:964 Domain 16 reported crashed by domain 7 on cpu#4: > > With #define DIFF_LIST 1: > 2) Nor the guest or the host crashed (let it run for about an hour and 15 > minutes), > but the USB XHCI driver bailed out quickly after guest boot, so there were > no MSI-X interrupts anymore. > (xl-dmesg-defined-nousb.txt, dmesg-guest-defined-nousb.txt) Hm, that is odd. Can you tell me how you get your SeaBIOS to add those extra statements? I don't seem to see them with my SeaBIOS (I've an XHCI controlled hooked up to the guest too). Maybe I am missing an CONFIG in the SeaBIOS build. > 3) On another boot the USB XHCI didn't bail out, after a while the host > crashes. (serial.log) > (XEN) [2014-11-18 14:53:37.364] RIP: e008:[<ffff82d08014a4de>] > hvm_do_IRQ_dpci+0xf4/0x131 > which resolves to: > # addr2line -e xen-syms ffff82d08014a4de > /usr/src/new/xen-unstable-vanilla/xen/include/xen/list.h:67 > which is: > static inline void __list_add(struct list_head *new, > struct list_head *prev, > struct list_head *next) > { > next->prev = new; > new->next = next; > new->prev = prev; > Here -> prev->next = new; > } Looking at the stack: (XEN) [2014-11-18 14:53:37.306] 0: ffff8303faaf25a8 [p:ffff83054ef2e3e0, n:ffff83054ef2e3e0] We detected that the list was poison and were printing it, while an: ----[ Xen-4.5.0-rc x86_64 debug=y Not tainted ]---- CPU: 4 RIP: e008:[<ffff82d08014a4de>] hvm_do_IRQ_dpci+0xf4/0x131 RFLAGS: 0000000000010082 CONTEXT: hypervisor rax: ffff83054ef2e3e0 rbx: ffff8303faaf25a8 rcx: ffff8303faaf2610 rdx: 0200200200200200 rsi: 0000000000000006 rdi: 000000006ef3f123 rbp: ffff83054ef27be8 rsp: ffff83054ef27bd8 r8: ffff8303faaf2010 r9: 000000000000002f r10: 0000000000000132 r11: 0000000000000003 r12: ffff8305125d6000 r13: 0000000000000000 r14: ffff8303faaf2580 r15: 000000000000002f cr0: 000000008005003b cr4: 00000000000006f0 cr3: 00000005448c0000 cr2: ffffffffff600400 ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008 Xen stack trace from rsp=ffff83054ef27bd8: ffff83054ef27be8 ffff8303faaf26c0 ffff83054ef27c78 ffff82d080172060 0000000000000020 ffff83054ef27cf6 0000000000000046 ffff83054ef27c70 ffff82d000000000 ffff83055d002f24 0000000000000000 0000002f4ef27c40 ffff83054ef27c48 ffff82d08014434f ffff83054ef27c60 0000000000000000 ffff82d08025efdc 0000000000000200 ffff83054ef27d78 0000000000000003 00007cfab10d8357 ffff82d080233122 0000000000000003 ffff83054ef27d78 0000000000000200 ffff82d08025efdc ffff83054ef27d60 0000000000000000 0000000000000003 0000000000000132 0000000000000003 ffff83054ef80000 0000000000000000 0000000000000000 ffff83054ef20000 000000000000000a ffff82d0802966c0 000000d100000000 ffff82d0801439ba 000000000000e008 0000000000000206 ffff83054ef27d30 000000000000e010 ffff83054ef27d60 ffff82d08030961e ffff83054ef2e378 ffff83054ef27e10 0000000000000000 ffff82d08025f37e ffff83054ef27dc0 ffff82d080143eda ffff82d0802967a0 0000000000000028 ffff83054ef27dd0 ffff83054ef27d90 ffff83054ef27da0 0000000000000000 ffff8303faaf25a8 ffff83054ef2e3e0 ffff83054ef2e3e0 000001785db75800 ffff83054ef27eb0 ffff82d0801495c0 ffff82d080233122 ffff83054ef2e378 ffff83054ef27e00 ffff83054ef2e4e0 ffff83054ef2e400 0000000300000000 ffff8305125d60b8 ffff83054ef27e70 ffff83054ef2e3e0 ffff83054ef2e3e0 ffff8303faaf25a8 ffff83054ef2e3e0 ffff83054ef2e3e0 ffff83054ef2e378 0100100100100100 0200200200200200 ffff83054ef2e378 Xen call trace: [<ffff82d08014a4de>] hvm_do_IRQ_dpci+0xf4/0x131 [<ffff82d080172060>] do_IRQ+0x49c/0x624 and interrupt came in. Said interrupt ended up calling 'raise_softirq' and added itself on the list. But the list is corrupted and it below up. Now the oddidity is that the 'prev' in this case is the per-cpu list - which should be cleared with 'list_splice_init' which INIT_LIST_HEAD() which resets the list! First of it should not even enter in the 'raise_softirq' as: [I suppose that ffff8303faaf25a8 has already been cleared but the second item in the dpci_list is also ffff8303faaf25a8]. 186 if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) ) 187 return; is set.. [<ffff82d080233122>] common_interrupt+0x62/0x70 [<ffff82d0801439ba>] vprintk_common+0x131/0x13e [<ffff82d080143eda>] printk+0x46/0x48 [<ffff82d0801495c0>] dpci_softirq+0x199/0x3e2 [<ffff82d08012be31>] __do_softirq+0x81/0x8c [<ffff82d08012be89>] do_softirq+0x13/0x15 [<ffff82d0801633e5>] idle_loop+0x5e/0x6e OK, so it looks like a couple of things are not happening on your machine: 1) test_and_[set|clear]_bit sometimes return unexpected values. [But this might be invalid as the addition of the ffff8303faaf25a8 might be correct - as the second dpci the softirq is processing could be the MSI one] 2) INIT_LIST_HEAD operations on the same CPU are not honored. > > When i look at the combination of (2) and (3), It seems it could be an > interaction between the two passed through devices and/or different IRQ types. Could be - as in it is causing this issue to show up faster than expected. Or it is the one that triggers more than one dpci happening at the same time. > > So i will now test without #define DIFF_LIST 1 and not passing through the > USB controller, see > if that still crashes, if it doesn't i will see if i can passthrough a device > which also only uses legacy > interrupts instead of MSI / MSI-X, see if that crashes or not. Right. I believe the issue you are triggering is that two or more 'pirq_dpci' are added on the per-cpu list. Having them different is fine (so one serving MSI the other IRQ), but having them the same is a no-go. Somehow you are triggering the later. Is your machine by any chance set to 'Aggressive' memory setting? My ASUS box has that setting and while it works great with games, Xen ends up hitting some rather strange lockups that I can only blame on that. > > >> Thank you. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |