Xen project Mailing List

On Wed, Apr 3, 2013 at 3:02 AM, Wen Congyang <wency@xxxxxxxxxxxxxx> wrote:

Virtual machine (VM) replication is a well known technique for providing
application-agnostic software-implemented hardware fault tolerance -
"non-stop service". Currently, remus provides this function, but it buffers
all output packets, and the latency is unacceptable.

In xen summit 2012, We introduce a new VM replication solution: colo
(COarse-grain LOck-stepping virtual machine). The presentation is in
the following URL:
http://www.slideshare.net/xen_com_mgr/colo-coarsegrain-lockstepping-virtual-machines-for-nonstop-service

Here is the summary of the solution:
>From the client's point of view, as long as the client observes identical
responses from the primary and secondary VMs, according to the service
semantics, then the secondary VM(SVM) is a valid replica of the primary
VM(PVM), and can successfully take over when a hardware failure of the
PVM is detected.

This patchset is RFC, and implements the frame of colo:
1. Both PVM and SVM are running
2. Forward the input packets from client to secondary machine(slaver)
3. Forward the output packets from SVM to primary machine(master)
4. Compare the output packets from PVM and SVM on the master side. If the
output packets are different, do a checkpoint

I skimmed through the presentation. Interesting approach. It would be nice to have the performance report

that is mentioned in the slides, available, so that we can understand the exact setups for the benchmarks.

A few quick thoughts after looking at the presentation:

0. I am not completely sold on the type of applications you have used to benchmark the system. They seem

stateless and dont have much of memory churn (dirty pages/epoch).

It would be nice to benchmark your system against something more realistic, like the DVDStore benchmark

or percona-tools' TPCC benchmark with MySQL. [The clients have to be outside the system, mind you].

And finally, something like Specweb2005, where there are about 1000 dirty pages per 25ms epoch.

I care more about how many concurrent connections was handled by the server and how frequently did you

have to synchronize between the machines.

1. The checkpoints are going to be very costly. If you are doing coarse grained locking and assuming

that checkpoint are triggered every second, you would probably lose all benefits of checkpoint compression.

Also your working set would have grown considerably large. You will inevitably end up taking the

slow path, where you suspend the VM, "synchronously" send a ton of pages over the network

(on the order of 10-100s of megabytes), and then resume the VM. Replicating this checkpoint is going to take

a long time and will screw performance.

The usual fast path has a small buffer (16/32 MB) to copy out the dirty pages to the buffer, and asynchronously

transmit it to the backup.

2. Whats the story for the DISK? The slides show that the VMs share a SAN disk.

And if both primary and secondary are operational, whose packets are you going

to discard, in a programmatic manner ?

While you have a FTP server benchmark, it doesnt demonstrate output consistency.

I would suggest you run something like DVDStore (from Dell) or some simple mysql TPCC

and see if the clients raise a hue and cry about data corruption. ;)

3. What happens if one is running faster than the other ?

Lets say the application does a bunch of read/write (dependent) to/from the SAN.

each write depends on the output of the previous read. And the writes are non-deterministic

(i.e. differ between primary and secondary). Wont this system end up in perpetual synchronization,

since the outputs from primary and backup would be different, causing a checkpoint again and again ?

And I would like to see atleast some *informal* guarantees of data consistency - might sound academic but

when you are talking about putting critical customer applications like a MySQL database, a SAP server or an

e-commerce web app, "consistency" matters! It helps to convince people that this system is not some half baked

experiment but something that is well thought of.

Once again, please CC me on the patches. Several files you have touched belong to remus code

and the MAINTAINERS file has the maintainer info.

Nit: in one of your slides, you mentioned (75 ms/checkpoint, of which 2/3rds was spend in suspend/resume).

That isnt an artifact of Remus, FYI. I have run remus at 20ms checkpoint interval, where VMs were suspended,

checkpointed and resumed in under 2ms.

With the addition of a ton of functionality -- both at the toolstack and in the guest kernel -- the suspend resume times have

gone up considerably. If you want to reduce that overhead, try a SuSe based kernel that has suspend-event channel support.

You may not need any of those lazy netifs/netups etc.

Even with that, the new power management framework in the 3.* kernels seem to have made suspend/resume pretty slow.

thanks

shriram

Changelog:
Patch 1: optimize the dirty pages transfer speed.
Patch 2-3: allow SVM running after checkpoint
Patch 4-5: modification for colo on the master side(wait a new checkpoint,
communicate with slaver when doing checkoint)
Patch 6-7: implement colo's user interface

Wen Congyang (7):
xc_domain_save: cache pages mapping
xc_domain_restore: introduce restore_callbacks for colo
colo: implement restore_callbacks
xc_domain_save: flush cache before calling callbacks->postcopy()
xc_domain_save: implement save_callbacks for colo
XendCheckpoint: implement colo
remus: implement colo mode

tools/libxc/Makefile | 4 +-
tools/libxc/ia64/xc_ia64_linux_restore.c | 3 +-
tools/libxc/xc_domain_restore.c | 256 +++++---
tools/libxc/xc_domain_restore_colo.c | 740 ++++++++++++++++++++++
tools/libxc/xc_domain_save.c | 162 +++--
tools/libxc/xc_save_restore_colo.h | 44 ++
tools/libxc/xenguest.h | 57 +-
tools/libxl/libxl_dom.c | 2 +-
tools/python/xen/lowlevel/checkpoint/checkpoint.c | 289 ++++++++-
tools/python/xen/lowlevel/checkpoint/checkpoint.h | 2 +
tools/python/xen/remus/image.py | 7 +-
tools/python/xen/remus/save.py | 6 +-
tools/python/xen/xend/XendCheckpoint.py | 138 ++--
tools/remus/remus | 8 +-
tools/xcutils/xc_restore.c | 3 +-
xen/include/public/xen.h | 1 +
16 files changed, 1503 insertions(+), 219 deletions(-)
create mode 100644 tools/libxc/xc_domain_restore_colo.c
create mode 100644 tools/libxc/xc_save_restore_colo.h

--
1.8.0

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Re: [Xen-devel] [RFC PATCH 0/7] COarse-grain LOck-stepping Virtual Machines for Non-stop Service