[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Questions about GPLPV stability tests
> Today, yet another of my machines went down. To explain the setup, I'm > running tests across a couple of similar machines. On some of the > machines I updated to latest 2.6.32.x and Xen 4.1.x (and I upgrade > every machine which goes down). The machine which just went down > wasn't using the latest versions yet, but some more logging was > enabled in 'dmesg'. > > This box still had Xen 4.1.2 and 2.6.32.37. I'm not sure if the > logging is too useful, yet but it may be of some use. The log output > is mostly from the serial console. I looked at some log files on > disks. Some files give clues, but some important log files didn't get > synced to disk before bad things happened. The last timestamp I was > able to find was at '16:13:55' which some of our own software wrote > some line to its logs. > > The few log files which got synced to disk tell that the VM started > around 16:13:02 which matches PCIe device ownership to the VM: > > (XEN) [2011-12-10 16:13:03] [VT-D]iommu.c:1501: d0:PCIe: unmap bdf = 3:0.0 > (XEN) [2011-12-10 16:13:03] [VT-D]iommu.c:1374: d1109:PCIe: map bdf = 3:0.0 > (XEN) [2011-12-10 16:13:03] [VT-D]io.c:304: d1109: bind: m_gsi=55 > g_gsi=32 device=4 intx=0 > (XEN) [2011-12-10 16:13:23] [VT-D]io.c:328: d1109: unbind: m_gsi=55 > g_gsi=32 device=4 intx=0 > (XEN) [2011-12-10 16:13:23] [VT-D]io.c:386: d1109 unmap: m_irq=55 > device=4 intx=0 > (XEN) [2011-12-10 16:13:23] [VT-D]io.c:304: d1109: bind: m_gsi=16 > g_gsi=32 device=4 intx=0 > > > We are using the tapdisk driver for disk 'xvdc' which is referred to > as 'tdc' I guess. It is a copy-on-write tdc would be what the disk is in dom 0. > VHD disk which uses a raw disk image as its backing store. Apparently > there are all sorts of read errors. > Not sure if the read errors are too bad. > > [905276.510737] blkback: ring-ref 15658, event-channel 13, protocol 1 > (unspecified, assuming native) > [905276.512128] blkback: ring-ref 15658, event-channel 13, protocol 1 > (unspecified, assuming native) > [905277.113215] block tdc: sector-size: 512 capacity: 118785 > [905277.215872] blkback: ring-ref 15775, event-channel 14, protocol 1 > (unspecified, assuming native) > [905680.196850] end_request: I/O error, dev tdc, sector 45689 > [905680.197293] end_request: I/O error, dev tdc, sector 45689 > [905680.197624] end_request: I/O error, dev tdc, sector 45688 > [905680.197993] end_request: I/O error, dev tdc, sector 45696 > > More blktap issues hours later, right? The real problem must have > happened much earlier then. Any chance you can eliminate blktap from the picture? Can you use blkback? > > [912393.552881] blkback: ring-ref 15810, event-channel 13, protocol 1 > (unspecified, assuming native) > [912394.101667] block tdc: sector-size: 512 capacity: 118785 > [912394.384715] blkback: ring-ref 15660, event-channel 14, protocol 1 > (unspecified, assuming native) > [912402.433492] BUG: unable to handle kernel NULL pointer dereference > at 00000000000002f0 > [912402.434077] IP: [<ffffffff81249ac2>] blktap_device_end_request+0x4e/0x68 > [912402.434384] PGD 384bb067 PUD 3a55f067 PMD 0 > [912402.434752] Oops: 0000 [#1] SMP > [912402.435096] last sysfs file: /sys/devices/virtual/block/tdc/removable > [912402.435363] CPU 2 > [912402.435653] Modules linked in: dm_snapshot dm_mod ipmi_si ipmi_devintf > [912402.436092] Pid: 16994, comm: tapdisk2 Not tainted 2.6.32.37 #1 X8STi > [912402.436595] RIP: e030:[<ffffffff81249ac2>] [<ffffffff81249ac2>] > blktap_device_end_request+0x4e/0x68 > [912402.437117] RSP: e02b:ffff8800616f9d08 EFLAGS: 00010046 > [912402.437380] RAX: 0000000000000000 RBX: ffff880014ef29b0 RCX: > 0000000000000018 > [912402.437884] RDX: 0000000000000000 RSI: 0000000000000000 RDI: > ffff88007fcebd58 > [912402.438368] RBP: 0000000000000000 R08: ffffffff816a9488 R09: > ffff8800123af040 > [912402.438855] R10: 00000001365d23c7 R11: ffffffff810861e1 R12: > ffff88007bea4e70 > [912402.439358] R13: ffff88007b8aa800 R14: 0000000000000000 R15: > ffff880014ef29b0 > [912402.439848] FS: 00007fde3bbe2730(0000) GS:ffff880028070000(0000) > knlGS:0000000000000000 > [912402.440341] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > [912402.440621] CR2: 00000000000002f0 CR3: 000000001f721000 CR4: > 0000000000002660 > [912402.441102] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [912402.441583] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: > 0000000000000400 > [912402.442083] Process tapdisk2 (pid: 16994, threadinfo > ffff8800616f8000, task ffff88007bab4ec0) > [912402.442569] Stack: > [912402.442825] ffffea000185d038 0000000000000001 ffff88007bab4ec0 > 0000000000000000 > [912402.443199] <0> ffff88007b8aa800 ffffffff81248c3b 00000000000001ff > 0000000000000000 > [912402.443871] <0> 00033e583f482159 000000000000e030 ffff88006f6da480 > 0000039f000003a0 > [912402.444793] Call Trace: > [912402.445054] [<ffffffff81248c3b>] ? blktap_ring_ioctl+0x137/0x228 > [912402.445326] [<ffffffff810bea63>] ? core_sys_select+0x1ee/0x21e > [912402.445599] [<ffffffff810dc64f>] ? really_put_req+0x62/0x8f > [912402.445871] [<ffffffff8100e865>] ? xen_force_evtchn_callback+0x9/0xa > [912402.446159] [<ffffffff8100eef2>] ? check_events+0x12/0x20 > [912402.446434] [<ffffffff810dc4e4>] ? aio_read_evt+0x26/0xe4 > [912402.446704] [<ffffffff810dd71a>] ? sys_io_getevents+0x10e/0x356 > [912402.446984] [<ffffffff810bcc47>] ? vfs_ioctl+0x56/0x6c > [912402.447255] [<ffffffff810bd133>] ? do_vfs_ioctl+0x460/0x49e > [912402.447543] [<ffffffff81059699>] ? getnstimeofday+0x55/0xaf > [912402.447814] [<ffffffff810bd1c2>] ? sys_ioctl+0x51/0x70 > [912402.448083] [<ffffffff81011a02>] ? system_call_fastpath+0x16/0x1b > [912402.448351] Code: e8 a5 f5 ff ff 49 8b 44 24 40 48 8b b8 f0 02 00 > 00 e8 a5 37 27 00 41 8b 54 24 60 44 89 f6 4c 89 e7 e8 85 13 f8 ff 49 > 8b 44 24 40 <48> 8b 80 f0 02 00 00 fe 00 ff 14 25 28 0d 6a 81 5b 5b 41 > 5c 41 > [912402.451398] RIP [<ffffffff81249ac2>] blktap_device_end_request+0x4e/0x68 > [912402.451733] RSP <ffff8800616f9d08> > [912402.451997] CR2: 00000000000002f0 > [912402.452265] ---[ end trace f20c5f7e976fe78b ]--- > > Half a minute later I get tons of these: > [912432.458625] ata6.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x6 > frozen .. frozen.. > [912432.458893] ata6.00: failed command: READ FPDMA QUEUED > [912432.459172] ata6.00: cmd 60/08:00:f2:13:e3/00:00:0d:00:00/40 tag 0 > ncq 4096 in > > And tons of these: > [912478.690536] sd 5:0:0:0: [sda] Unhandled error code > [912478.690798] sd 5:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET > driverbyte=DRIVER_OK > [912478.691306] sd 5:0:0:0: [sda] CDB: Read(10): 28 00 00 e6 95 05 00 00 68 00 > > I wonder if we are having real SATA issues or whether this is somehow That really looks to cause blktap to die when the command failed. > caused by the 'real' problem which is tapdisk? > > Finally I also get timeouts in the network code. I think these are > just caused by the bad things happening. > [912737.470673] WARNING: at net/sched/sch_generic.c:261 > dev_watchdog+0xd2/0x16e() > [912737.471170] Hardware name: X8STi > [912737.471426] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out and then this time out.. > [912737.471686] Modules linked in: dm_snapshot dm_mod ipmi_si ipmi_devintf > [912737.472106] Pid: 0, comm: swapper Tainted: G D 2.6.32.37 #1 > [912737.472586] Call Trace: > [912737.472838] <IRQ> [<ffffffff813c8e12>] ? dev_watchdog+0xd2/0x16e > [912737.473130] [<ffffffff813c8e12>] ? dev_watchdog+0xd2/0x16e > [912737.473391] [<ffffffff81040fb1>] ? warn_slowpath_common+0x77/0xa3 > [912737.473668] [<ffffffff813c8d40>] ? dev_watchdog+0x0/0x16e > [912737.473928] [<ffffffff81041039 > ] ? warn_slowpk > _dev_program_evt+ 0xe3/0x18c > [90/0x60 > [912737.474988] [<ffffffff8100e865>] ? xen_force_evtchn_callback+0x9/0xa > [912737.480706] [<ffffffff8100eef2>] ? check_events+0x12/0x20 > [912737.480965] [<ffffffff813c8d15>] ? netif_tx_lock+0x3d/0x68 > [912737.481222] [<ffffffff813c8d40>] ? dev_watchdog+0x0/0x16e > [912737.481481] [<ffffffff813b83a0>] ? netdev_drivername+0x3b/0x40 > [912737.481741] [<ffffffff813c8e12>] ? dev_watchdog+0xd2/0x16e > [912737.482015] [<ffffffff81049614>] ? run_timer_softirq+0x16c/0x1d9 > [912737.482274] [<ffffffff81078fb8>] ? handle_percpu_irq+0x4e/0x6a > [912737.482534] [<ffffffff81045bc8>] ? __do_softirq+0x91/0x116 > [912737.482794] [<ffffffff81012b6c>] ? call_softirq+0x1c/0x30 > [912737.483052] [<ffffffff810140e7>] ? do_softirq+0x3f/0x7c > [912737.483324] [<ffffffff81045a64>] ? irq_exit+0x36/0x76 > [912737.483584] [<ffffffff8123b6ad>] ? xen_evtchn_do_upcall+0x33/0x42 > [912737.483843] [<ffffffff81012bbe>] ? xen_do_hypervisor_callback+0x1e/0x30 > [912737.484101] <EOI> [<ffffffff8100eedf>] ? > xen_restore_fl_direct_end+0x0/0x1 > [912737.484394] [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1001 > [912737.484670] [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1001 > [912737.484930] [<ffffffff8100e8e3>] ? xen_safe_halt+0xc/0x15 > [912737.485188] [<ffffffff81017f07>] ? default_idle+0x21/0x3d > [912737.485447] [<ffffffff81010df1>] ? cpu_idle+0x59/0x91 > [912737.485704] ---[ end trace f20c5f7e976fe78c ]--- > > > Does this feel like blktap issues? Or is it more likely that something > else happened which causes blktap and the other things to fail? It looks like the interrupts stopped. Perhaps you can run the C code (attached) on the serial console and see _what_ (or perhaps _when_) the interrupts stops (and also verify that the timeouts and 'frozen' happen due to no interrupts being received). There are a couple of bugs that were discovered in the interrupt code (that had been there since 2.6.27!) that went to the stable tree - so they should been backported to 2.6.32.x tree - but I honestly don't recall the names. I can look them up tomorrow. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |