[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] domU and dom0 hung with Xen console interrupt binding showing in-flight=1, (---M)
On 17/08/2010 18:28, "Bruce Edge" <bruce.edge@xxxxxxxxx> wrote: > On Tue, Jun 29, 2010 at 1:42 AM, Jan Beulich <JBeulich@xxxxxxxxxx> wrote: >>>>> On 28.06.10 at 20:22, Dante Cinco <dantecinco@xxxxxxxxx> wrote: >>> I have an HP Proliant DL380-G6 (dual Xeon E5540 @ 2.53GHz) with Xen 4.0.0 >>> and dom0 Linux 2.6.32.12 x86_64 pvops and domU Linux kernel 2.6.30.1 x86_64. >>> I'm using PCI passthrough (pci-stub) to pass my 4-port 8Gb PMC-Sierra Fibre >>> Channel HBA to domU. After running I/Os for several hours, both dom0 and >>> domU hangs and the Xen console shows the interrupt binding below where IRQ >>> 66 shows in-flight=1 and mask set (---M). What's the best way to debug this >>> problem? >> >> There are potentially two problems here: One is that the guest may >> fail to send the EOI notification. You would want to check whether >> pirq_guest_eoi() got run after that last occurrence of the interrupt. >> >> The more worrying part is that Xen should time out on a guest failing >> to send the EOI notification, and ack the interrupt nevertheless. >> Looking at the code I fail to see how the ack_APIC_irq() would get >> sent in this case: non-maskable MSIs get this issued from >> end_msi_irq(), but ->end doesn't get invoked from >> irq_guest_eoi_timer_fn() (only ->enable does). Keir, am I missing >> something? I don't think that timer logic is designed to handle non-maskable MSIs, only maskable ones. It ought to be not too hard to fix it up for non-maskable ones too by issuing the ->end() call from the timer handler? -- Keir >> Otoh I can't see how this can work reliably in the first place: Since >> there's no other way to mask such interrupts, sending an ack to the >> LAPIC could result in an interrupt storm. Disabling MSI on the >> affected device isn't a good option either, as we know there are >> devices that switch to legacy IRQ mode irreversibly in that case, >> and hence the device becomes unusable (presumably until being >> reset). But very likely this would still be better than hanging the >> entire box; it probably would just need a more graceful timeout. >> >> Jan > > > This is still happening. I have 2 identical boxes that were running a stress > test and both hung after a few hours. They have identical hardware and > software configs so I'll report the config for one and attach the xen dump for > both. > > dom0 info: > > HP Proliant DL380-G6 (dual Xeon E5540 @ 2.53GHz) > > # cat /proc/cmdline > root=/dev/mapper/system-dom0_0 ro earlyprintk=xen loglevel=10 debug acpi=force > console=hvc0,115200n8 > > # uname -a > Linux dpm8800-09 2.6.32.16 #1 SMP Wed Aug 4 15:38:21 PDT 2010 x86_64 GNU/Linux > > The domU is an Ubuntu 10.04 kernel, 2.6.32.15+drm33.5 in hvm mode. > > # xm info > host : dpm8800-09 > release : 2.6.32.16 > version : #1 SMP Wed Aug 4 15:38:21 PDT 2010 > machine : x86_64 > nr_cpus : 16 > nr_nodes : 2 > cores_per_socket : 4 > threads_per_core : 2 > cpu_mhz : 2533 > hw_caps : > bfebfbff:28100800:00000000:00001b40:009ce3bd:00000000:00000001:00000000 > virt_caps : hvm hvm_directio > total_memory : 12277 > free_memory : 11631 > node_to_cpu : node0:0,2,4,6,8,10,12,14 > node1:1,3,5,7,9,11,13,15 > node_to_memory : node0:5601 > node1:6029 > node_to_dma32_mem : node0:3506 > node1:0 > max_node_id : 1 > xen_major : 4 > xen_minor : 0 > xen_extra : .1-rc4 > xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 > hvm-3.0-x86_32p hvm-3.0-x86_64 > xen_scheduler : credit > xen_pagesize : 4096 > platform_params : virt_start=0xffff800000000000 > xen_changeset : unavailable > xen_commandline : dom0_mem=512M dom0_max_vcpus=1 dom0_vcpus_pin=true > iommu=1,passthrough,no-intremap loglvl=all loglvl_guest=all loglevl=10 debug > apic=on apic_verbosity=verbose extra_guest_irqs=80 com1=115200,8n1 > console=com1 console_to_ring xen-pciback.permissive acpi=force numa=on > cc_compiler : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) > cc_compile_by : bedge > cc_compile_domain : lsi.com <http://lsi.com> > cc_compile_date : Sun Aug 1 09:44:29 PDT 2010 > xend_config_format : 4 > > This device (as well as a few more of these) is passed through via pciback: > > dpm8800-09:~# lspci | grep 10: > 10:00.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) > 10:00.1 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) > 10:00.2 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) > 10:00.3 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) <- on both cases > it's this device that loses the interrupt in flight > > 10:00.3 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 08) > Flags: bus master, fast devsel, latency 0, IRQ 5 > I/O ports at a800 [size=256] > I/O ports at ac00 [size=256] > Memory at fbdc0000 (64-bit, non-prefetchable) [size=32K] > Capabilities: [50] Power Management version 3 > Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ > Queue=0/1 Enable- > Capabilities: [70] Express Endpoint, MSI 01 > Capabilities: [b0] MSI-X: Enable- Mask- TabSize=9 > Capabilities: [100] Advanced Error Reporting <?> > > > From host dpm8800-10: > (XEN) IRQ: 133 affinity:00000000,00000000,00000000,00000001 vec:94 > type=PCI-MSI status=00000050 in-flight=0 domain-list=2:126(----), > (XEN) IRQ: 134 affinity:00000000,00000000,00000000,00000001 vec:d4 > type=PCI-MSI status=00000050 in-flight=1 domain-list=2:125(---M), > (XEN) IRQ: 135 affinity:00000000,00000000,00000000,00000004 vec:9c > type=PCI-MSI status=00000010 in-flight=0 domain-list=2:124(----), > > From host dpm8800-09: > (XEN) IRQ: 131 affinity:00000000,00000000,00000000,00002000 vec:7f > type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 62(----), > (XEN) IRQ: 132 affinity:00000000,00000000,00000000,00000001 vec:dd > type=PCI-MSI status=00000010 in-flight=1 domain-list=2:127(---M), > (XEN) IRQ: 133 affinity:00000000,00000000,00000000,00000001 vec:3e > type=PCI-MSI status=00000010 in-flight=0 domain-list=2:126(----), > > This time both cases correspond to 10:00.3: > > (XEN) 10:00.3 - dom 2 - MSIs < 132 > > > (XEN) MSI 132 vec=dc fixed edge assert phys cpu dest=00000010 > mask=0/0/-1 > > > Let me know if there's anything else I can provide to assist in diagnosing > this problem. > > Thanks > > -Bruce > >> >>> (XEN) IRQ: 66 affinity:00000000,00000000,00000000,00000001 vec:b9 >>> type=PCI-MSI status=00000010 in-flight=1 domain-list=1: 79(---M), >>> (XEN) IRQ: 67 affinity:00000000,00000000,00000000,00000004 vec:d9 >>> type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 78(----), >>> (XEN) IRQ: 68 affinity:00000000,00000000,00000000,00000010 vec:22 >>> type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 77(----), >>> (XEN) IRQ: 69 affinity:00000000,00000000,00000000,00000040 vec:2a >>> type=PCI-MSI status=00000010 in-flight=0 domain-list=1: 76(----), >>> >>> (XEN) 07:00.3 - dom 1 - MSIs < 69 > >>> (XEN) 07:00.2 - dom 1 - MSIs < 68 > >>> (XEN) 07:00.1 - dom 1 - MSIs < 67 > >>> (XEN) 07:00.0 - dom 1 - MSIs < 66 > >>> >>> (XEN) MSI 66 vec=b9 fixed edge assert phys cpu dest=00000000 >>> mask=0/0/-1 >>> (XEN) MSI 67 vec=d9 fixed edge assert phys cpu dest=00000004 >>> mask=0/0/-1 >>> (XEN) MSI 68 vec=22 fixed edge assert phys cpu dest=00000002 >>> mask=0/0/-1 >>> (XEN) MSI 69 vec=2a fixed edge assert phys cpu dest=00000006 >>> mask=0/0/-1 >>> >>> Thanks. >>> >>> Dante >> >> >> > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |