[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-users] Xen 4.0.0 - tapdisk2 "hangs"
On Tue, May 4, 2010 at 2:09 PM, Heiko Wundram <modelnine@xxxxxxxxxxxxx> wrote: > Hey all! > > I'm currently in the process of migrating a (Gentoo-based) Xen-server to use > Xen 4.0.0 (where I'm using the Xen ebuilds from bugs.gentoo.org), and I'm > having severe problems with tapdisk2 (which I wish to use to do I/O > prioritizing using CFQ on the LVM-based backing storage of a virtual > server). > > It seems that after a while of heavy I/O in the virtual domain, the > communication between the (paravirtualized) DomU and Dom0 (the > tapdisk2-process) breaks, in that no more interrupts are delivered to Dom0 > for I/O requests from the virtual domain, and as such the virtual host > "loses" its harddisk (but does not "break" besides not responding). The > network front-/backend is not affected by this communication loss, AFAICT. > > The virtual host can be destroyed by an xm destroy, but the created blktap2 > interface does not disappear until the next reboot, and cannot be removed by > the respective sysfs accesses (rather, echoing a 1 into "remove" blocks, > too, and is "unkillable", i.e. stays in kernel space). After a blktap2 > device has entered this broken state, no more hosts can be created by xm > create (that blocks, too), and the host system must be rebooted to enter a > usable state again. > > I've not been able to provoke this breakage by "normal" I/O (i.e., when the > hosts run normally), but I have been able to provoke it by using bonnie, > which after a short period of substained read/write I/O of +120MB/s will > freeze the blktap2 device. > > The Dom0 and the DomU kernels that are being used are xen-sources-2.6.32-r1 > (which are the xen-stable 2.6.32.10 [11?] based OpenSuSE Xen-kernel sources, > AFAIK) from the official portage tree; the kernel configuration that's in > use is attached. > > I've tried iommu=off for xen (the mobo doesn't support VT-d anyway, so Xen > never turns it on), and I've also looked for any signs of errors appearing > when setting verbosity 9 for the blktap2 module and loglvl=all and > guest_loglvl=all for Xen, but there are no errors that I've seen so far. > > Strace-ing the tapdisk2 process reveals that it's blocked on select(), and > none of the descriptors it's polling on ever return as readable (which is > the condition that tapdisk2 queries), rather they always timeout after 600s. > > Thanks in advance for any hint as to what is causing this, or if there's > anything I might try to get things working... > > PS: I have to boot with acpi=off, as the mobo won't reboot when acpi is > turned on for Dom0 (not even when disabling ACPI reboots), but using acpi > directly doesn't change that blktap2 blocks. > > --- Heiko. > > > > _______________________________________________ > Xen-users mailing list > Xen-users@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-users > I have had exactly the same problem and ended up going back to tapdisk1. I was able to replicate the problem using the entire SLE11-SP1 kernel source patch set which proves that the bug exists upstream, unfortunately I am very busy on other projects at the moment so did not have time to debug it at all. The SLE11-SP1 tree has been updated since xen-sources-2.6.32-r1, I will make a updated set of patches for you to try but it will take me a couple of days. Andy _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |