|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq
On Tue, Nov 18, 2014 at 11:12:54PM +0100, Sander Eikelenboom wrote:
>
> Tuesday, November 18, 2014, 9:56:33 PM, you wrote:
>
> >>
> >> Uhmm i thought i had these switched off (due to problems earlier and then
> >> forgot
> >> about them .. however looking at the earlier reports these lines were also
> >> in
> >> those reports).
> >>
> >> The xen-syms and these last runs are all with a prestine xen tree cloned
> >> today (staging
> >> branch), so the qemu-xen and seabios defined with that were also freshly
> >> cloned
> >> and had a new default seabios config. (just to rule out anything stale in
> >> my tree)
> >>
> >> If you don't see those messages .. perhaps your seabios and qemu trees
> >> (and at least the
> >> seabios config) are not the most recent (they don't get updated
> >> automatically
> >> when you just do a git pull on the main tree) ?
> >>
> >> In /tools/firmware/seabios-dir/.config i have:
> >> CONFIG_USB=y
> >> CONFIG_USB_UHCI=y
> >> CONFIG_USB_OHCI=y
> >> CONFIG_USB_EHCI=y
> >> CONFIG_USB_XHCI=y
> >> CONFIG_USB_MSC=y
> >> CONFIG_USB_UAS=y
> >> CONFIG_USB_HUB=y
> >> CONFIG_USB_KEYBOARD=y
> >> CONFIG_USB_MOUSE=y
> >>
>
> > I seem to have the same thing. Perhaps it is my XHCI controller being wonky.
>
> >> And this is all just from a:
> >> - git clone git://xenbits.xen.org/xen.git -b staging
> >> - make clean && ./configure && make -j6 && make -j6 install
>
> > Aye.
> > .. snip..
> >> > 1) test_and_[set|clear]_bit sometimes return unexpected values.
> >> > [But this might be invalid as the addition of the ffff8303faaf25a8
> >> > might be correct - as the second dpci the softirq is processing
> >> > could be the MSI one]
> >>
> >> Would there be an easy way to stress test this function separately in some
> >> debugging function to see if it indeed is returning unexpected values ?
>
> > Sadly no. But you got me looking in the right direction when you mentioned
> > 'timeout'.
> >>
> >> > 2) INIT_LIST_HEAD operations on the same CPU are not honored.
> >>
> >> Just curious, have you also tested the patches on AMD hardware ?
>
> > Yes. To reproduce this the first thing I did was to get an AMD box.
>
> >>
> >>
> >> >> When i look at the combination of (2) and (3), It seems it could be an
> >> >> interaction between the two passed through devices and/or different IRQ
> >> >> types.
> >>
> >> > Could be - as in it is causing this issue to show up faster than
> >> > expected. Or it is the one that triggers more than one dpci happening
> >> > at the same time.
> >>
> >> Well that didn't seem to be it (see separate amendment i mailed previously)
>
> > Right, the current theory I've is that the interrupts are not being
> > Acked within 8 milisecond and we reset the 'state' - and at the same
> > time we get an interrupt and schedule it - while we are still processing
> > the same interrupt. This would explain why the 'test_and_clear_bit'
> > got the wrong value.
>
> > In regards to the list poison - following this thread of logic - with
> > the 'state = 0' set we open the floodgates for any CPU to put the same
> > 'struct hvm_pirq_dpci' on its list.
>
> > We do reset the 'state' on _every_ GSI that is mapped to a guest - so
> > we also reset the 'state' for the MSI one (XHCI). Anyhow in your case:
>
> > CPUX: CPUY:
> > pt_irq_time_out:
> > state = 0;
> > [out of timer coder, the raise_softirq
> > pirq_dpci is on the dpci_list] [adds the pirq_dpci as state == 0]
>
> > softirq_dpci softirq_dpci:
> > list_del
> > [entries poison]
> > list_del <= BOOM
> >
> > Is what I believe is happening.
>
> > The INTX device - once I put a load on it - does not trigger
> > any pt_irq_time_out, so that would explain why I cannot hit this.
>
> > But I believe your card hits these "hiccups".
>
>
> Hi Konrad,
>
> I just tested you 5 patches and as a result i still got an(other) host crash:
> (complete serial log attached)
>
> (XEN) [2014-11-18 21:55:41.591] ----[ Xen-4.5.0-rc x86_64 debug=y Not
> tainted ]----
> (XEN) [2014-11-18 21:55:41.591] CPU: 0
> (XEN) [2014-11-18 21:55:41.591] ----[ Xen-4.5.0-rc x86_64 debug=y Not
> tainted ]----
> (XEN) [2014-11-18 21:55:41.591] RIP: e008:[<ffff82d08012c7e7>]CPU: 2
> (XEN) [2014-11-18 21:55:41.591] RIP: e008:[<ffff82d08014a461>]
> hvm_do_IRQ_dpci+0xbd/0x13c
> (XEN) [2014-11-18 21:55:41.591] RFLAGS: 0000000000010006
> _spin_unlock+0x1f/0x30CONTEXT: hypervisor
Duh!
Here is another patch on top of the five you have (attached and inline).
From 18008650f10199e2b83f74394f634a97086e34b8 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Date: Tue, 18 Nov 2014 20:48:43 -0500
Subject: [PATCH] debug: Whether we want to clear the 'dom' field or not when
changing the 'state' bit.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
---
xen/drivers/passthrough/io.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index aad5607..24e2bd6 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -243,13 +243,14 @@ bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci
*pirq_dpci)
}
/*
- * Reset the pirq_dpci->dom parameter to NULL.
+ * Reset the pirq_dpci->dom parameter to NULL if 'clear_dom' is set.
*
* This function checks the different states to make sure it can do it
* at the right time. If it unschedules the 'hvm_dirq_assist' from running
* it also refcounts (which is what the softirq would have done) properly.
*/
-static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
+static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci,
+ unsigned int clear_dom)
{
struct domain *d = pirq_dpci->dom;
struct __debug *debug = &__get_cpu_var(_d);
@@ -276,7 +277,8 @@ static void pt_pirq_softirq_reset(struct hvm_pirq_dpci
*pirq_dpci)
* in local variable before it sets STATE_RUN - and therefore will not
* dereference '->dom' which would crash.
*/
- pirq_dpci->dom = NULL;
+ if (clear_dom)
+ pirq_dpci->dom = NULL;
break;
}
}
@@ -295,7 +297,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct
hvm_pirq_dpci *pirq_dpci,
&pirq_dpci->flags) )
{
_record(&debug->clear, pirq_dpci);
- pt_pirq_softirq_reset(pirq_dpci);
+ pt_pirq_softirq_reset(pirq_dpci, 0 /* leave ->dom /);
/* pirq_dpci->state = 0; <= OUCH! */
pirq_dpci->pending = 0;
pirq_guest_eoi(dpci_pirq(pirq_dpci));
@@ -333,9 +335,11 @@ static void pt_irq_time_out(void *data)
pt_pirq_iterate(irq_map->dom, pt_irq_guest_eoi, NULL);
spin_unlock(&irq_map->dom->event_lock);
+#if 0
console_start_sync();
dump_debug((char)0);
console_end_sync();
+#endif
}
struct hvm_irq_dpci *domain_get_irq_dpci(const struct domain *d)
@@ -446,7 +450,7 @@ int pt_irq_create_bind(
* in the queue.
*/
pirq_dpci->line = __LINE__;
- pt_pirq_softirq_reset(pirq_dpci);
+ pt_pirq_softirq_reset(pirq_dpci, 1 /* clear dom */);
}
}
if ( unlikely(rc) )
@@ -701,7 +705,7 @@ int pt_irq_destroy_bind(
* call to pt_pirq_softirq_reset.
*/
pirq_dpci->line = __LINE__;
- pt_pirq_softirq_reset(pirq_dpci);
+ pt_pirq_softirq_reset(pirq_dpci, 1 /* clear ->dom */);
pirq_cleanup_check(pirq, d);
}
--
1.9.3
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |