[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Remus: Possible disk replication consistency bug


Short version:

1. Is there any way to get disk replication to work with the blktap2 driver
when I use two file disk images (say, a disk image and a swap image).

2. [Possible bug] How does Remus guarantee that when, after failover, a
replicated VM boots at the backup physical machine, its memory state is
going to be consistent with its disk state? Remus uses two separate
channels, one for memory updates, and the other for disk updates. The
primary decides when to send individual commit messages to each of these
channels, but there appears to be no mechanism in place at the backup site
to coordinate if and when these updates should be applied. Thus, we have
the following execution scenario:

- Backup receives commit for disk updates for epoch E
- Primary crashes before sending commit for memory updates for epoch E
- Backup resumes the execution of the guest VM using the latest available information
- The guest VM's memory state corresponds to epoch E - 1 and its disk state
- corresponds to epoch E. This is inconsistent.

Long version:


In the simplified version of what I need, I have a single guest VM that has
access to two disk images, disk.img and swap.img, which are both replicated
using the blktap2 driver.

In order to achieve this, I used the following guide,
http://remusha.wikidot.com/ , which nets me the following setup
configuration (I deviated a little when creating the guest):

- Xen 4.2 (unstable), changeset (changeset version 24465:5b2676ac1321)
- Dom0 kernel: Linux v2.6.32.40 x86_64 (commit version 2b494f184d3337d10d59226e3632af56ea66629a)
- DomU kernel: Linux 3.0.4 x86_64

I have a guest VM, named frank, whose configuration file, frank.cfg,
contains the following parameters:

disk = [

which is correct, according to the documentation guidelines posted on
http://nss.cs.ubc.ca/remus/doc.html . Notice that I have assigned
a different tapdisk remote server channel to disk.img and swap.img

Assigning the same port number to both of them will not work, since both
the primary and the backup physical machines spawn one tapdisk daemon each
per disk image. I suppose that both daemons on the backup will try to bind
the same port number and, thus, one of them will fail. This causes the
procedure to hang. (In fact, the affected tapdisk instances on the primary
and backup will enter some kind of busy polling function and will consume
100% of the CPU assigned to them).

For whatever reason, however, things are not much different at all when I
assign different ports for replicating disk.img and swap.img. This is
something that I cannot explain, myself, and that is where I ultimately
gave up on trying to get disk replication working with blktap2.

Note that if we were to disable access to swap.img in frank.cfg, then the
whole process works as it should, disk, memory replication & all, which
demonstrates that I have a working setup among my physical machines.


In the meantime, I have also been digging into the source code of Remus,
blktap2 and some parts of drbd, and I think I may have come across a
possible bug. If my observations are correct, then, it is possible that
after a (very unlucky) primary machine failure, the replicated VM is
resumed on the backup machine where its memory is in epoch A state and its
disk is in epoch B state.

- If we are using blktap2, then it can be that A = B + 1 (the disk state is
one epoch behind than the memory state) or A = B - 1 (the disk state is one
epoch ahead of the memory state)
- If we are using drbd, then it can be that A = B - 1 (the disk state is
one epoch ahead the memory state)

Remus is using two different channels of communication, one for memory
updates and one for disk updates. If I understand the code structure
correctly, the issue I describe stems from the fact that Remus is also
using these channels to send two separate commit messages; one to the
xc_restore process, for memory, and one to the server tapdisk2 daemon
(similar for drbd) for disk updates.

These messages are needed in order to trace the boundaries of checkpoint
epochs for each channel. However, what I feel that is missing (or I haven't
been able to find it) is a process on the backup machine side that decides
when the local VM state (memory or disk) gets updated. For example, the
backup should update the state of a VM to epoch A iff we have received all
updates pertaining to epoch A (disk and memory). Then, and only then, can
the backup send a Checkpoint Acknowledgement message to the primary, at
which point the primary can release the VM's network buffers.

The files of interest to us are the following:

- assuming that we are using disk replication with blktap2

Assume that the primary machine is about to send a commit message to the
backup machine. Thus, we are at line 1982 in the xc_domain_save.c file. The
primary is about to execute the discard_file_cache() function which, as a
consequence, causes the primary to do an fsync() on the migration socket.

I am not sure about the particular mechanics involved in calling fsync() on
a connected TCP socket, but I presume that the intended behaviour is to
wait until the last byte written in that socket is acknowledged by the
receiver (violates the end to end argument, but works in the general case
where you have two machines connected back to back with a crossover cable).

Then, the primary invokes the checkpoint callback, which brings us to the
commit() function at line 166 in remus. This function invokes the
buf.commit() command for each of the available buffers. For network
buffers, this causes them to release their output, whereas for disk
buffers, this causes remus to wait for an acknowledgement response for that
particular disk from the backup, as seen in device.py. Since disk buffers
have been inserted first, remus waits for acknowledgements for all disk
buffers before it release any network buffer.

Notice that at line 89 in function postsuspend() in device.py, remus sends
a 'flush' message to the disk control channel, which is all it takes for
the secondary machine to release the pending disk updates to its disk state

The bug described can occur if the primary crashes after invoking the
discard_file_cache() command in xc_domain_save.c and before having a chance
to invoke any of the buf.commit() commands in remus. If the 'flush' message
has not left the primary's socket buffer yet, at the time of the crash,
then we have the A = B + 1 case outlined above, where the memory state is
one epoch ahead of the disk state. Similarly, if the primary crashes after
sending the 'flush' message but before calling the discard_file_cache()
command, we have the A = B - 1 case, where the VM's memory state is one
epoch behind the VM's disk state.

What's even more worrying, however, is that block-remus.c and
xc_domain_save.c have two entirely disjoint heartbeat mechanisms, which can
potentially amount to an entirely new level of trouble

- assuming that we use drbd

The setup is similar to the one described for blktap2. In this case,
however, remus forces the primary to wait in the preresume() callback until
it has received an acknowledgement for the disk updates. Unless I am
missing something, this makes little sense to me, as we are keeping the VM
suspended for a roundtrip time's worth of time, whereas it could have been
running. Shouldn't the waiting logic be moved to the commit() function

In any case, we have a similar scenario. drbd finishes sending the commit
message to the backup VM and the primary crashes immediately, before
returning from the postcopy callback. Thus, the backup machine receives a
commit for the disk updates but no commit for the memory updates. Since the
two do not coordinate with each other, it will happily apply the disk
updates and ruin memory-disk consistency on the guest VM.

- Epilogue

I think it would be better/cleaner/more consistent to have some kind of
remus server daemon running on the backup physical machine. That daemon
would coordinate when disk and memory are to be committed to the guest VM's
state (when that demon has received all checkpoint state pertaining to a
particular VM that is). As such, it is the daemon that should decide when
to send a Checkpoint Acknowledgement message to the primary physical

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.