[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] Making snapshot of logical volumes handling HVM domU causes OOPS and instability
I use LVM volumes for domU disks. To create backups, I create a snapshot of the volume, mount the snapshot in the dom0, mount an equally-sized backup volume from another physical storage source, run an rsync from one to the other, unmount both, then remove the snapshot. This includes creating a snapshot and mounting NTFS volumes from Windows-based HVM guests. This practice may not be perfect, but has worked fine for me for a couple of years - while I was running Xen 3.2.1 and linux-2.6.18.8-xen dom0 (and the same kernel for domU). After upgrades of udev started complaining about the kernel being too old, I thought it was well past time to try to transition to a newer version of Xen and a newer dom0 kernel. This transition has been a gigantic learning experience, let me tell you. After that transition, here's the problem I've been wrestling with and can't seem to find a solution for: It seems like any time I start manipulating a volume group to add or remove a snapshot of a logical volume that's used as a disk for a running HVM guest, new calls to LVM2 and/or Xen's storage locks up and spins forever. The first time I ran across the problem, there was no indication of a problem other than any command I ran that handled anything to do with LVM would freeze and be completely unable to be signaled to do anything. In other words, no error messages, nothing in dmesg, nothing in syslog... The commands would just freeze and not return. That was with the 2.6.31.14 kernel that is what's currently retrieved if you checkout xen-4.0-testing.hg and just do a make dist. I have since checked out and compiled 2.6.32.18 that comes from doing git checkout -b xen/stable-2.6.32.x origin/xen/stable-2.6.32.x, as described on the Wiki page here: http://wiki.xensource.com/xenwiki/XenParavirtOps If I run that kernel for dom0, but continue to use 2.6.31.14 for the paravirtualized domUs, everything works fine until I try to manipulate the snapshots of the HVM volumes. Today, I got this kernel OOPS: --------------------------- [78084.004530] BUG: unable to handle kernel paging request at ffff8800267c9010 [78084.004710] IP: [<ffffffff810382ff>] xen_set_pmd+0x24/0x44 [78084.004886] PGD 1002067 PUD 1006067 PMD 217067 PTE 80100000267c9065 [78084.005065] Oops: 0003 [#1] SMP [78084.005234] last sysfs file: /sys/devices/virtual/block/dm-32/removable [78084.005256] CPU 1 [78084.005256] Modules linked in: tun xt_multiport fuse dm_snapshot nf_nat_tftp nf_conntrack_tftp nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_proto_gre ntfs parport_pc parport k8temp floppy forcedeth [last unloaded: scsi_wait_scan] [78084.005256] Pid: 22814, comm: udevd Tainted: G W 2.6.32.18 #1 H8SMI [78084.005256] RIP: e030:[<ffffffff810382ff>] [<ffffffff810382ff>] xen_set_pmd+0x24/0x44 [78084.005256] RSP: e02b:ffff88002e2e1d18 EFLAGS: 00010246 [78084.005256] RAX: 0000000000000000 RBX: ffff8800267c9010 RCX: ffff880000000000 [78084.005256] RDX: dead000000100100 RSI: 0000000000000000 RDI: 0000000000000004 [78084.005256] RBP: ffff88002e2e1d28 R08: 0000000001993000 R09: dead000000100100 [78084.005256] R10: 800000016e90e165 R11: 0000000000000000 R12: 0000000000000000 [78084.005256] R13: ffff880002d8f580 R14: 0000000000400000 R15: ffff880029248000 [78084.005256] FS: 00007fa07d87f7a0(0000) GS:ffff880002d81000(0000) knlGS:0000000000000000 [78084.005256] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [78084.005256] CR2: ffff8800267c9010 CR3: 0000000001001000 CR4: 0000000000000660 [78084.005256] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [78084.005256] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [78084.005256] Process udevd (pid: 22814, threadinfo ffff88002e2e0000, task ffff880019491e80) [78084.005256] Stack: [78084.005256] 0000000000600000 000000000061e000 ffff88002e2e1de8 ffffffff810fb8a5 [78084.005256] <0> 00007fff13ffffff 0000000100000206 ffff880003158003 0000000000000000 [78084.005256] <0> 0000000000000000 000000000061dfff 000000000061dfff 000000000061dfff [78084.005256] Call Trace: [78084.005256] [<ffffffff810fb8a5>] free_pgd_range+0x27c/0x45e [78084.005256] [<ffffffff810fbb2b>] free_pgtables+0xa4/0xc7 [78084.005256] [<ffffffff810ff1fd>] exit_mmap+0x107/0x13f [78084.005256] [<ffffffff8107714b>] mmput+0x39/0xda [78084.005256] [<ffffffff8107adff>] exit_mm+0xfb/0x106 [78084.005256] [<ffffffff8107c86d>] do_exit+0x1e8/0x6ff [78084.005256] [<ffffffff815c228b>] ? do_page_fault+0x2cd/0x2fd [78084.005256] [<ffffffff8107ce0d>] do_group_exit+0x89/0xb3 [78084.005256] [<ffffffff8107ce49>] sys_exit_group+0x12/0x16 [78084.005256] [<ffffffff8103cc82>] system_call_fastpath+0x16/0x1b [78084.005256] Code: 48 83 c4 28 5b c9 c3 55 48 89 e5 41 54 49 89 f4 53 48 89 fb e8 fc ee ff ff 48 89 df ff 05 52 8f 9e 00 e8 78 e4 ff ff 84 c0 75 05 <4c> 89 23 eb 16 e8 e0 ee ff ff 4c 89 e6 48 89 df ff 05 37 8f 9e [78084.005256] RIP [<ffffffff810382ff>] xen_set_pmd+0x24/0x44 [78084.005256] RSP <ffff88002e2e1d18> [78084.005256] CR2: ffff8800267c9010 [78084.005256] ---[ end trace 4eaa2a86a8e2da24 ]--- [78084.005256] Fixing recursive fault but reboot is needed! --------------------------- After that was printed on the console, use of anything that interacts with Xen (xentop, xm) would freeze whatever command it was and not return. After trying to do a sane shutdown on the guests, the whole dom0 locked completely. Even the alt-sysrq things stopped working after looking at a couple of them. I feel it's probably necessary to mention that this is after several, fairly rapid-fire creations and deletions of snapshot volumes. I have it scripted to make a snapshot, mount it, mount a backup volume, rsync it, unmount both volumes, and delete the snapshot for 19 volumes in a row. In other words, there's a lot of disk I/O going on around the time of the lockup. It always seems to coincide with when it gets to the volumes that are being used for active, running, Windows Server 2008, HVM volumes. That may be just coincidental, though, because those are the last ones on the list. 15 volumes used in active, running paravirtualized Linux guests are at the top of the list. Another issue that comes up is that if I run the 2.6.32.18 pvops kernel for my Linux domUs, after a time (usually only about an hour or so), the network interfaces stop responding. I don't know if the problem is related, but it was something else that I noticed. The only way to get the network access to come back is to reboot the domU. When I reverted the domU kernel to 2.6.31.14, this problem goes away. I'm not 100% sure, but I think this issue also causes xm console to not allow you to type on the console that you connect to. If I connect to a console, then issue an xm shutdown on the same domU from another terminal, all of the console messages that show the play-by-play of the shutdown process display, but my keyboard input doesn't seem to make it through. Since I'm not a developer, I don't know if these questions are better suited for the xen-users list, but since it generated an OOPS with the word "BUG" in capital letters, I thought I'd post it here. If that assumption was incorrect, just give me a gentle nudge and I'll redirect the inquiry to somewhere more appropriate. :) If you need any more information about my setup or steps used to recreate the problem or other debugging information, I'll be happy to accomodate. Just let me know what you need and how I can get it. Here's some more information about my setup: http://www.pridelands.org/~simba/hurricane-server.txt -- Scott Garron _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |