[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq



> Without #define DIFF_LIST 1:
> 1) The guest still crashes (xl-dmesg-not-defined.txt)

AHA!

 --MARK--
  0: ffff8305085ffd28 [p:ffff83054ef27e88, n:ffff83054ef27e88]
  1: ffff8305085ffd28 [p:0200200200200200, n:0100100100100100]

The same pirq_dpci structure is put twice on the list. That
will surely make it unhappy. The reason the #define DIFF_LIST 1
would work is that it has code to deal with that odd scenario.

But .. more importantly - the code should not allow you to
put _two_ of the same 'pirq_dpci' structures on the list.

 CPU00: 
 d16 OK-softirq 186msec ago, state:1, 3257 count, [prev:ffff82d0802e7e88, 
next:ffff82d0802e7e88] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT  
PIRQ:87
 d16 OK-raise   236msec ago, state:1, 3257 count, [prev:0200200200200200, 
next:0100100100100100] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT  
PIRQ:87
 CPU01: 
 d16 OK-softirq 232msec ago, state:1, 2978 count, [prev:ffff83054ef57e70, 
next:ffff83054ef57e70] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT  
PIRQ:87
 d16 OK-raise   281msec ago, state:1, 2978 count, [prev:0200200200200200, 
next:0100100100100100] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT  
PIRQ:87
 CPU02: 
 d16 OK-softirq 373msec ago, state:1, 2378 count, [prev:ffff83054ef47e70, 
next:ffff83054ef47e70] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT  
PIRQ:87
 d16 OK-raise   423msec ago, state:1, 2378 count, [prev:0200200200200200, 
next:0100100100100100] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT  
PIRQ:87
 CPU03: 
 d16 OK-softirq 867msec ago, state:1, 2744 count, [prev:ffff83054ef37e70, 
next:ffff83054ef37e70] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT  
PIRQ:87
 d16 OK-raise   916msec ago, state:1, 2744 count, [prev:0200200200200200, 
next:0100100100100100] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT  
PIRQ:87
 CPU04: 
 d16 OK-softirq 497msec ago, state:1, 76910 count, [prev:ffff83054ef2e3e0, 
next:ffff83054ef2e3e0] ffff8305085ffd28MACH_PCI_SHIFT MAPPED_SHIFT 
GUEST_PCI_SHIFT  PIRQ:0
 d16 OK-raise   489msec ago, state:1, 76916 count, [prev:ffff83054ef2e3e0, 
next:ffff83054ef2e3e0] ffff8305085ffd28MACH_PCI_SHIFT MAPPED_SHIFT 
GUEST_PCI_SHIFT  PIRQ:0
 d16 ERR-poison 600msec ago, state:0, 1 count, [prev:0200200200200200, 
next:0100100100100100] ffff8305085ffd28MACH_PCI_SHIFT MAPPED_SHIFT 
GUEST_PCI_SHIFT  PIRQ:0
 CPU05: 
 d16 OK-softirq 852msec ago, state:1, 2207 count, [prev:ffff83054ef1fe70, 
next:ffff83054ef1fe70] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT  
PIRQ:87
 d16 OK-raise   902msec ago, state:1, 2207 count, [prev:0200200200200200, 
next:0100100100100100] ffff8305094a39a8<NULL> MAPPED_SHIFT GUEST_MSI_SHIFT  
PIRQ:87
 domain_crash called from io.c:964
 Domain 16 reported crashed by domain 7 on cpu#4:


> 
> With #define DIFF_LIST 1:
> 2) Nor the guest or the host crashed (let it run for about an hour and 15 
> minutes), 
>    but the USB XHCI driver bailed out quickly after guest boot, so there were 
> no MSI-X interrupts anymore.
>    (xl-dmesg-defined-nousb.txt, dmesg-guest-defined-nousb.txt)

Hm, that is odd. Can you tell me how you get your SeaBIOS to add those
extra statements? I don't seem to see them with my SeaBIOS (I've an XHCI
controlled hooked up to the guest too). Maybe I am missing an CONFIG in the
SeaBIOS build.

> 3) On another boot the USB XHCI didn't bail out, after a while the host 
> crashes. (serial.log)
>    (XEN) [2014-11-18 14:53:37.364] RIP:    e008:[<ffff82d08014a4de>] 
> hvm_do_IRQ_dpci+0xf4/0x131
>    which resolves to:
>    # addr2line -e xen-syms ffff82d08014a4de
>    /usr/src/new/xen-unstable-vanilla/xen/include/xen/list.h:67
>    which is:
>    static inline void __list_add(struct list_head *new,
>                               struct list_head *prev,
>                               struct list_head *next)
>     {
>     next->prev = new;
>     new->next = next;
>     new->prev = prev;
> Here ->    prev->next = new;
>     }

Looking at the stack:


(XEN) [2014-11-18 14:53:37.306]  0: ffff8303faaf25a8 [p:ffff83054ef2e3e0, 
n:ffff83054ef2e3e0]

We detected that the list was poison and were printing it, while an:

 ----[ Xen-4.5.0-rc  x86_64  debug=y  Not tainted ]----
 CPU:    4
 RIP:    e008:[<ffff82d08014a4de>] hvm_do_IRQ_dpci+0xf4/0x131
 RFLAGS: 0000000000010082   CONTEXT: hypervisor
 rax: ffff83054ef2e3e0   rbx: ffff8303faaf25a8   rcx: ffff8303faaf2610
 rdx: 0200200200200200   rsi: 0000000000000006   rdi: 000000006ef3f123
 rbp: ffff83054ef27be8   rsp: ffff83054ef27bd8   r8:  ffff8303faaf2010
 r9:  000000000000002f   r10: 0000000000000132   r11: 0000000000000003
 r12: ffff8305125d6000   r13: 0000000000000000   r14: ffff8303faaf2580
 r15: 000000000000002f   cr0: 000000008005003b   cr4: 00000000000006f0
 cr3: 00000005448c0000   cr2: ffffffffff600400
 ds: 002b   es: 002b   fs: 0000   gs: 0000   ss: e010   cs: e008
 Xen stack trace from rsp=ffff83054ef27bd8:
    ffff83054ef27be8 ffff8303faaf26c0 ffff83054ef27c78 ffff82d080172060
    0000000000000020 ffff83054ef27cf6 0000000000000046 ffff83054ef27c70
    ffff82d000000000 ffff83055d002f24 0000000000000000 0000002f4ef27c40
    ffff83054ef27c48 ffff82d08014434f ffff83054ef27c60 0000000000000000
    ffff82d08025efdc 0000000000000200 ffff83054ef27d78 0000000000000003
    00007cfab10d8357 ffff82d080233122 0000000000000003 ffff83054ef27d78
    0000000000000200 ffff82d08025efdc ffff83054ef27d60 0000000000000000
    0000000000000003 0000000000000132 0000000000000003 ffff83054ef80000
    0000000000000000 0000000000000000 ffff83054ef20000 000000000000000a
    ffff82d0802966c0 000000d100000000 ffff82d0801439ba 000000000000e008
    0000000000000206 ffff83054ef27d30 000000000000e010 ffff83054ef27d60
    ffff82d08030961e ffff83054ef2e378 ffff83054ef27e10 0000000000000000
    ffff82d08025f37e ffff83054ef27dc0 ffff82d080143eda ffff82d0802967a0
    0000000000000028 ffff83054ef27dd0 ffff83054ef27d90 ffff83054ef27da0
    0000000000000000 ffff8303faaf25a8 ffff83054ef2e3e0 ffff83054ef2e3e0
    000001785db75800 ffff83054ef27eb0 ffff82d0801495c0 ffff82d080233122
    ffff83054ef2e378 ffff83054ef27e00 ffff83054ef2e4e0 ffff83054ef2e400
    0000000300000000 ffff8305125d60b8 ffff83054ef27e70 ffff83054ef2e3e0
    ffff83054ef2e3e0 ffff8303faaf25a8 ffff83054ef2e3e0 ffff83054ef2e3e0
    ffff83054ef2e378 0100100100100100 0200200200200200 ffff83054ef2e378
 Xen call trace:
    [<ffff82d08014a4de>] hvm_do_IRQ_dpci+0xf4/0x131
    [<ffff82d080172060>] do_IRQ+0x49c/0x624

and interrupt came in. Said interrupt ended up calling 'raise_softirq'
and added itself on the list. But the list is corrupted and
it below up.

Now the oddidity is that the 'prev' in this case is the per-cpu list -
which should be cleared with 'list_splice_init' which INIT_LIST_HEAD() which
resets the list!

First of it should not even enter in the 'raise_softirq' as:
[I suppose that ffff8303faaf25a8 has already been cleared but the
second item in the dpci_list is also ffff8303faaf25a8].

 186     if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )                
     
 187         return;                                                            
     

is set..  

    [<ffff82d080233122>] common_interrupt+0x62/0x70
    [<ffff82d0801439ba>] vprintk_common+0x131/0x13e
    [<ffff82d080143eda>] printk+0x46/0x48
    [<ffff82d0801495c0>] dpci_softirq+0x199/0x3e2
    [<ffff82d08012be31>] __do_softirq+0x81/0x8c
    [<ffff82d08012be89>] do_softirq+0x13/0x15
    [<ffff82d0801633e5>] idle_loop+0x5e/0x6e

OK, so it looks like a couple of things are not happening on your
machine:

 1) test_and_[set|clear]_bit sometimes return unexpected values.
    [But this might be invalid as the addition of the ffff8303faaf25a8
     might be correct - as the second dpci the softirq is processing
     could be the MSI one]
 2) INIT_LIST_HEAD operations on the same CPU are not honored.

> 
> When i look at the combination of (2) and (3), It seems it could be an 
> interaction between the two passed through devices and/or different IRQ types.

Could be - as in it is causing this issue to show up faster than
expected. Or it is the one that triggers more than one dpci happening
at the same time.
> 
> So i will now test without #define DIFF_LIST 1 and not passing through the 
> USB controller, see
> if that still crashes, if it doesn't i will see if i can passthrough a device 
> which also only uses legacy
> interrupts instead of MSI / MSI-X, see if that crashes or not.

Right. I believe the issue you are triggering is that two or more
'pirq_dpci' are added on the per-cpu list. Having them different
is fine (so one serving MSI the other IRQ), but having them the same is a no-go.
Somehow you are triggering the later.


Is your machine by any chance set to 'Aggressive' memory setting? My ASUS
box has that setting and while it works great with games, Xen ends up hitting
some rather strange lockups that I can only blame on that.

> 
> >> Thank you.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.