[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq



Wednesday, November 19, 2014, 2:55:41 AM, you wrote:

> On Tue, Nov 18, 2014 at 11:12:54PM +0100, Sander Eikelenboom wrote:
>> 
>> Tuesday, November 18, 2014, 9:56:33 PM, you wrote:
>> 
>> >> 
>> >> Uhmm i thought i had these switched off (due to problems earlier and then 
>> >> forgot 
>> >> about them .. however looking at the earlier reports these lines were 
>> >> also in 
>> >> those reports).
>> >> 
>> >> The xen-syms and these last runs are all with a prestine xen tree cloned 
>> >> today (staging 
>> >> branch), so the qemu-xen and seabios defined with that were also freshly 
>> >> cloned 
>> >> and had a new default seabios config. (just to rule out anything stale in 
>> >> my tree)
>> >> 
>> >> If you don't see those messages .. perhaps your seabios and qemu trees 
>> >> (and at least the 
>> >> seabios config) are not the most recent (they don't get updated 
>> >> automatically 
>> >> when you just do a git pull on the main tree) ?
>> >> 
>> >> In /tools/firmware/seabios-dir/.config i have:
>> >> CONFIG_USB=y
>> >> CONFIG_USB_UHCI=y
>> >> CONFIG_USB_OHCI=y
>> >> CONFIG_USB_EHCI=y
>> >> CONFIG_USB_XHCI=y
>> >> CONFIG_USB_MSC=y
>> >> CONFIG_USB_UAS=y
>> >> CONFIG_USB_HUB=y
>> >> CONFIG_USB_KEYBOARD=y
>> >> CONFIG_USB_MOUSE=y
>> >> 
>> 
>> > I seem to have the same thing. Perhaps it is my XHCI controller being 
>> > wonky.
>> 
>> >> And this is all just from a:
>> >> - git clone git://xenbits.xen.org/xen.git -b staging
>> >> - make clean && ./configure && make -j6 && make -j6 install
>> 
>> > Aye. 
>> > .. snip..
>> >> >  1) test_and_[set|clear]_bit sometimes return unexpected values.
>> >> >     [But this might be invalid as the addition of the ffff8303faaf25a8
>> >> >      might be correct - as the second dpci the softirq is processing
>> >> >      could be the MSI one]
>> >> 
>> >> Would there be an easy way to stress test this function separately in 
>> >> some 
>> >> debugging function to see if it indeed is returning unexpected values ?
>> 
>> > Sadly no. But you got me looking in the right direction when you mentioned
>> > 'timeout'.
>> >> 
>> >> >  2) INIT_LIST_HEAD operations on the same CPU are not honored.
>> >> 
>> >> Just curious, have you also tested the patches on AMD hardware ?
>> 
>> > Yes. To reproduce this the first thing I did was to get an AMD box.
>> 
>> >> 
>> >>  
>> >> >> When i look at the combination of (2) and (3), It seems it could be an 
>> >> >> interaction between the two passed through devices and/or different 
>> >> >> IRQ types.
>> >> 
>> >> > Could be - as in it is causing this issue to show up faster than
>> >> > expected. Or it is the one that triggers more than one dpci happening
>> >> > at the same time.
>> >> 
>> >> Well that didn't seem to be it (see separate amendment i mailed 
>> >> previously)
>> 
>> > Right, the current theory I've is that the interrupts are not being
>> > Acked within 8 milisecond and we reset the 'state' - and at the same
>> > time we get an interrupt and schedule it - while we are still processing
>> > the same interrupt. This would explain why the 'test_and_clear_bit'
>> > got the wrong value.
>> 
>> > In regards to the list poison - following this thread of logic - with
>> > the 'state = 0' set we open the floodgates for any CPU to put the same
>> > 'struct hvm_pirq_dpci' on its list.
>> 
>> > We do reset the 'state' on _every_ GSI that is mapped to a guest - so
>> > we also reset the 'state' for the MSI one (XHCI). Anyhow in your case:
>> 
>> > CPUX:                           CPUY:
>> > pt_irq_time_out:
>> > state = 0;                      
>> > [out of timer coder, the                raise_softirq
>> >  pirq_dpci is on the dpci_list]         [adds the pirq_dpci as state == 0]
>> 
>> > softirq_dpci                            softirq_dpci:
>> >         list_del
>> >         [entries poison]
>> >                                                 list_del <= BOOM
>> >                         
>> > Is what I believe is happening.
>> 
>> > The INTX device - once I put a load on it - does not trigger
>> > any pt_irq_time_out, so that would explain why I cannot hit this.
>> 
>> > But I believe your card hits these "hiccups".   
>> 
>> 
>> Hi Konrad,
>> 
>> I just tested you 5 patches and as a result i still got an(other) host crash:
>> (complete serial log attached)
>> 
>> (XEN) [2014-11-18 21:55:41.591] ----[ Xen-4.5.0-rc  x86_64  debug=y  Not 
>> tainted ]----
>> (XEN) [2014-11-18 21:55:41.591] CPU:    0
>> (XEN) [2014-11-18 21:55:41.591] ----[ Xen-4.5.0-rc  x86_64  debug=y  Not 
>> tainted ]----
>> (XEN) [2014-11-18 21:55:41.591] RIP:    e008:[<ffff82d08012c7e7>]CPU:    2
>> (XEN) [2014-11-18 21:55:41.591] RIP:    e008:[<ffff82d08014a461>] 
>> hvm_do_IRQ_dpci+0xbd/0x13c
>> (XEN) [2014-11-18 21:55:41.591] RFLAGS: 0000000000010006    
>> _spin_unlock+0x1f/0x30CONTEXT: hypervisor

> Duh!

> Here is another patch on top of the five you have (attached and inline).

Hi Konrad,

Happy to report it has been running with this additional patch for 2 hours now 
without any problems. I think you nailed it :-)
More than happy to test the definitive patch as well.

--
Sander

> From 18008650f10199e2b83f74394f634a97086e34b8 Mon Sep 17 00:00:00 2001
> From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> Date: Tue, 18 Nov 2014 20:48:43 -0500
> Subject: [PATCH] debug: Whether we want to clear the 'dom' field or not when
>  changing the 'state' bit.

> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> ---
>  xen/drivers/passthrough/io.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)

> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index aad5607..24e2bd6 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -243,13 +243,14 @@ bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci 
> *pirq_dpci)
>  }
>  
>  /*
> - * Reset the pirq_dpci->dom parameter to NULL.
> + * Reset the pirq_dpci->dom parameter to NULL if 'clear_dom' is set.
>   *
>   * This function checks the different states to make sure it can do it
>   * at the right time. If it unschedules the 'hvm_dirq_assist' from running
>   * it also refcounts (which is what the softirq would have done) properly.
>   */
> -static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
> +static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci,
> +                                  unsigned int clear_dom)
>  {
>      struct domain *d = pirq_dpci->dom;
>      struct __debug *debug = &__get_cpu_var(_d);
> @@ -276,7 +277,8 @@ static void pt_pirq_softirq_reset(struct hvm_pirq_dpci 
> *pirq_dpci)
>           * in local variable before it sets STATE_RUN - and therefore will 
> not
>           * dereference '->dom' which would crash.
>           */
-        pirq_dpci->>dom = NULL;
> +        if (clear_dom)
> +            pirq_dpci->dom = NULL;
>          break;
>      }
>  }
> @@ -295,7 +297,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct 
> hvm_pirq_dpci *pirq_dpci,
>                                &pirq_dpci->flags) )
>      {
>          _record(&debug->clear, pirq_dpci);
> -        pt_pirq_softirq_reset(pirq_dpci);
> +        pt_pirq_softirq_reset(pirq_dpci, 0 /* leave ->dom /);
>          /* pirq_dpci->state = 0; <= OUCH! */
>          pirq_dpci->pending = 0;
>          pirq_guest_eoi(dpci_pirq(pirq_dpci));
> @@ -333,9 +335,11 @@ static void pt_irq_time_out(void *data)
>      pt_pirq_iterate(irq_map->dom, pt_irq_guest_eoi, NULL);
>  
>      spin_unlock(&irq_map->dom->event_lock);
> +#if 0
>      console_start_sync();
>      dump_debug((char)0);
>      console_end_sync();
> +#endif
>  }
>  
>  struct hvm_irq_dpci *domain_get_irq_dpci(const struct domain *d)
> @@ -446,7 +450,7 @@ int pt_irq_create_bind(
>                       * in the queue.
>                       */
>                      pirq_dpci->line = __LINE__;
> -                    pt_pirq_softirq_reset(pirq_dpci);
> +                    pt_pirq_softirq_reset(pirq_dpci, 1 /* clear dom */);
>                  }
>              }
>              if ( unlikely(rc) )
> @@ -701,7 +705,7 @@ int pt_irq_destroy_bind(
>           * call to pt_pirq_softirq_reset.
>           */
>          pirq_dpci->line = __LINE__;
> -        pt_pirq_softirq_reset(pirq_dpci);
> +        pt_pirq_softirq_reset(pirq_dpci, 1 /* clear ->dom */);
>  
>          pirq_cleanup_check(pirq, d);
>      }


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.