[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Remus: Possible disk replication consistency bug

Hi dmelisso,

On Fri, Feb 10, 2012 at 11:02 AM, Dimitrios Melissovas <dimitrios.melissovas@xxxxxxx> wrote:

Short version:

1. Is there any way to get disk replication to work with the blktap2 driver
when I use two file disk images (say, a disk image and a swap image).

It used to be possible. I have run remus with 3 blktap2 disks.

I even remember fixing up some issue in the blktap2 code in unstable/4.1-testing,
that broke the replication (but that was almost a year ago).

Let me check what has changed in the unstable repo, to break the blktap2 replication again.

2. [Possible bug] How does Remus guarantee that when, after failover, a
replicated VM boots at the backup physical machine, its memory state is
going to be consistent with its disk state?

Currently there is no synchronization mechanism. I certainly agree that it is a bug and
a very very rare one, for I have not been able to reproduce the bug/observe side-effects.

I even have a fix for the drbd version, though not in the daemonized form that you are talking about.
I havent found time to add a similar fix to the blktap2 version, which is why I havent sent out a fix.
Remus uses two separate
channels, one for memory updates, and the other for disk updates. The
primary decides when to send individual commit messages to each of these
channels, but there appears to be no mechanism in place at the backup site
to coordinate if and when these updates should be applied. Thus, we have
the following execution scenario:

- Backup receives commit for disk updates for epoch E
- Primary crashes before sending commit for memory updates for epoch E
- Backup resumes the execution of the guest VM using the latest available information
- The guest VM's memory state corresponds to epoch E - 1 and its disk state
- corresponds to epoch E. This is inconsistent.
- Epilogue

I think it would be better/cleaner/more consistent to have some kind of
remus server daemon running on the backup physical machine. That daemon
would coordinate when disk and memory are to be committed to the guest VM's
state (when that demon has received all checkpoint state pertaining to a
particular VM that is). As such, it is the daemon that should decide when
to send a Checkpoint Acknowledgement message to the primary physical

Yes I agree. Stefano posted patches in xen-devel, introducing callbacks in xc_domain_restore.

My intention is to add one or more callbacks, that would
(a) send explicit checkpoint acknowledgements to the primary (rather than relying on the fsync() on primary)
(b) only send the checkpoint ack after ensuring that all disks have received their checkpoints.

With the explicit ack from xc_domain_restore callback, one could actually get rid of the disk
level acknowledgements. The primary would only send a "barrier" or "flush" to delimit checkpoints.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.