[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Remus blktap2 issue


  • To: Xen-devel@xxxxxxxxxxxxxxxxxxx
  • From: Jonathan Kirsch <kirsch.jonathan@xxxxxxxxx>
  • Date: Fri, 10 Sep 2010 14:16:33 -0700
  • Cc:
  • Delivery-date: Fri, 10 Sep 2010 14:17:32 -0700
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=SBiQK4NS71zVffMaX6JVf7J4HMgXyYFEG+JO1+L8oHZIni7FkK5zPNfg8UZZm50pwC rX0zPcwdC+OADGYjCnUBh/j9d2kUjrEiP5IYLJne2OpUUSftVRO4CN4W15emmMpOUH/w /nhnmTZNdy5HkhzfUvfwxcIfVkj4QUgZBTRuk=
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Hi,

Thanks for the response.  I looked at all of the error messages and didn't see anything jumping out at me.  Stumped, I decided to try upgrading to a Gigabit switch (I had been running 100Mbps).  This partially solves the problem, as described below. 

The good thing is that the problem I had been describing before (frozen backup, no disk replication) no longer happens.  I now see that the disk image on the backup is being updated, and the backup is at least not frozen.

The bad thing is that even though the backup sits in the paused state before the fail-over occurs (as expected), once the fail-over happens it thinks the VM has been shutdown and reboots itself.  Thus, when I log into the backup's vnc console after pulling the network cable from the primary, I see the VM booting up -- any program that I had running on the primary is (obviously) no longer running.  When I log in on the backup, I do see disk modifications reflected in the file system, which leads me to believe that disk replication now works.

I'm not sure how to go about debugging why the backup thinks the domain has been shut down.  I've checked the various log files but nothing jumps out at me.  Any ideas about why this might be happening or which logs I should be looking at?  I'd be happy to post the logs if you think that would help.

Note that I've confirmed that live migration (still) works perfectly.  I can migrate from one to the other and back again, with state maintained as it should be and without this "rebooting" issue coming up.

Thanks for the help,
Jon

PS I also don't understand why simply upgrading the switch would cause the other problems to go away.  Maybe I'm wrong, but it seems like using the Gigabit switch should have made synchronization happen faster but shouldn't have fundamentally changed the equation.  Any thoughts about this?



On Wed, Sep 8, 2010 at 12:43 PM, Shriram Rajagopalan <rshriram@xxxxxxxxx> wrote:


On Wed, Sep 8, 2010 at 11:33 AM, Jonathan Kirsch <kirsch.jonathan@xxxxxxxxx> wrote:
Hi,

Thanks a lot for the patch.  Unfortunately, this did not solve the problem for me (after applying the patch on both primary and backup, rebuilding and installing xen/tools/stubdom, and then rebooting both hosts).  The backup is still unable to create the disk device when the fail-over occurs.  Thus, although I see checkpoint traffic flowing from primary to backup, the state of the backup's disk image is never modified (as judged by the image's last-modified time).  The backup does switch from "paused" to "running," but it consumes 100% CPU and when I connect to its vnc console it is as if the VM is frozen.  So *something* is being transferred, because I do see the screen from the primary, but obviously all is not right, because I can't interact with it at all.

Are there any error messages in the Backup machine's syslog (or equivalent), about the tapdisks being used for the VM?

Are there error messages in the /var/log/xen/xend.log in Backup machine ?
Out of curiosity, in your working Remus deployment, which dom0 kernel are you running (and which version of Xen)?  I'm running Xen 4.0.1 and the pvops 2.6.31.14 dom0 kernel.  My understanding was that Remus supported pvops dom0 2.6.31.x. 

I am running Xen 4.0.1 with pvops 2.6.32.18. But I have not run any HVMs on remus on my setup yet. So, if your current setup is able to run HVM domUs (without remus) and you are also able to "live" migrate HVM domUs between the two machines, then the issue is somewhere else IMO.
Any other ideas regarding what this might be a symptom of?  My naive interpretation is that it is not a networking configuration problem (since state is being transferred), but that it has something to do with setting up the tapdisk via tapdisk2.
  
Thanks,
Jon 

On Wed, Sep 8, 2010 at 1:50 AM, Shriram Rajagopalan <rshriram@xxxxxxxxx> wrote:
Its not just the tap2:remus:....

there is a bug lurking in the in tools/python/xen/remus/device.py in ReplicatedDisk class. The regular _expression_ scans the domU config for only tap:tapdisk:remus... or tap:remus.. disk types only. I was able to get it working by fixing that regexp.
This applies for xen 4.0.1 only. Am not sure about xen unstable.
 Here is a patch that might be of help to you (its rather crude but heck I was too lazy :) )
diff -r b536ebfba183 tools/python/xen/remus/device.py
--- a/tools/python/xen/remus/device.py  Wed Aug 25 09:22:42 2010 +0100
+++ b/tools/python/xen/remus/device.py  Fri Sep 03 08:47:13 2010 -0700
@@ -36,10 +36,13 @@
         # to request commits.
         self.ctlfd = None
 
-        if not disk.uname.startswith('tap:remus:') and not disk.uname.startswith('tap:tapdisk:remus:'):
+        if not disk.uname.startswith('tap2:remus:') and not disk.uname.startswith('tap:remus:') and not disk.uname.startswith('tap:tapdisk:remus:'):
             raise ReplicatedDiskException('Disk is not replicated: %s' %
                                         str(disk))
-        fifo = re.match("tap:.*(remus.*)\|", disk.uname).group(1).replace(':', '_')
+        if disk.uname.startswith('tap2:remus:'):           
+            fifo = re.match("tap2:.*(remus.*)\|", disk.uname).group(1).replace(':', '_')
+        else:
+            fifo = re.match("tap:.*(remus.*)\|", disk.uname).group(1).replace(':', '_')
         absfifo = os.path.join(self.FIFODIR, fifo)
         absmsgfifo = absfifo + '.msg'



On Tue, Sep 7, 2010 at 11:01 PM, Pasi Kärkkäinen <pasik@xxxxxx> wrote:
On Tue, Sep 07, 2010 at 03:28:32PM -0700, Jonathan Kirsch wrote:
>    Hello,
>
>    I have been playing around with Remus on Xen 4.0.1, attempting to
>    fail-over for an HVM domU.
>
>    I've run into some problems that I think could be related to tapdisk2 and
>    its interaction with how one sets up Remus disk replication in the domU
>    config file.
>
>    A few things I've noticed:
>
>    -The tap:remus:backupHostIP:port|aio:imagePath notation does not work for
>    me, although this is what is written in the Remus documentation.  However,
>    I have found the following to work (i.e., not complain when starting
>    domU), so this is what I've been using:
>
>    tap2:remus:backupHostIP:port|aio:imagePath...
>

Yeah, this stuff was changed in Xen 4.0.1:
http://wiki.xensource.com/xenwiki/blktap2

I guess someone should update the remus wiki page.

-- Pasi


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel



--
perception is but an offspring of its own self


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel




--
perception is but an offspring of its own self

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.