Xen project Mailing List

Re: [Xen-devel] [PATCH RFC] remus: implement remus replicated checkpointing disk

On 03/12/2014 02:10 AM, Shriram Rajagopalan wrote: > On Tue, Feb 25, 2014 at 6:53 PM, Lai Jiangshan <laijs@xxxxxxxxxxxxxx > <mailto:laijs@xxxxxxxxxxxxxx>> wrote: > > This patch implements remus replicated checkpointing disk. > It includes two parts: > generic remus replicated checkpointing disks framework > drbd replicated checkpointing disks > They will be split into different files in next round. > > The patch is still simple due to disk-setup-teardown-script is > still under implementing. I need to use libxl_ao to implement it, > but libxl_ao is hard to use. The work sequence is needed to ugly split > to serveral callbacks like device_hotplug(). > > And becuase the remus disk script is unimplemented, the drbd_setup() code > can't check the disk now. So it just assumes the user config the disk > correctly. > > This patch is *UNTESTED*. > (there is a problem with xl&drbd(without remus) in my BOXes). > > I request *comments* as many as possible. > > Thanks, > Lai > > Signed-off-by: Lai Jiangshan <laijs@xxxxxxxxxxxxxx > <mailto:laijs@xxxxxxxxxxxxxx>> > > > > Hi > sorry for the delayed response. And thanks a lot for this initiative. Apart > from the inline feedback, > there are a few things to consider first before going down this route. > > 1. The drbd kernel module required for Remus is still out of tree, currently > hosted on a wiki page. > The drbd folks didnt want to include the changes into their code > unfortunately, as they were offering the > same functionality to one of their paid customers. This is what they told me > back in 2011 or so. > > To streamline the storage replication module installation, is there a chance > of hosting the code in > xen.org <http://xen.org>'s repos? That way, we could script the download and > installation process. Like the qemu > stuff. > > 2. The tapdisk based replication unfortunately is outdated. Please correct me > if I have got this wrong. > Haven't we decided to get rid of blktap2 and go with the qemu disk models? In > which case, the tapdisk > remus code has to be ported into some qemu disk variant. We are implementing *qemu* replicated checkpointing disk, but we can't make it public even we have done, we need to delay the publication due to we are paid to implement it by a paid customer. > > Without getting a resolution to the above two, my stance is that we shouldn't > pollute xl with functionality > that requires out-of-band modules that may prove pretty painful to install > for the majority of folks out there. I'm also concern with out-of-band modules, since remus-drbd can't be merged upstream, It will be valueless to apply remus-drbd replicated checkpointing disk to xl. What's the status of blktap3 now? (I am asking to xen community) > > Based on the experience from the last 3 years, most average users of Remus > tend to skip disk replication > altogether. They install the distro's default drbd, use the disk replication > provided with it and then complain > that Remus crashes or fails. Some have ventured into tapdisk replication but > it unfortunately seemed to > get difficult as xend/blktap2 started getting deprecated. > > > > --- > tools/libxl/Makefile | 1 + > tools/libxl/libxl_dom.c | 19 +++- > tools/libxl/libxl_internal.h | 10 ++ > tools/libxl/libxl_remus.c | 2 + > tools/libxl/libxl_remus_replicated_disk.c | 219 > +++++++++++++++++++++++++++++ > 5 files changed, 249 insertions(+), 2 deletions(-) > create mode 100644 tools/libxl/libxl_remus_replicated_disk.c > > diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile > index 218f55e..dbf5dd9 100644 > --- a/tools/libxl/Makefile > +++ b/tools/libxl/Makefile > @@ -53,6 +53,7 @@ LIBXL_OBJS-y += libxl_nonetbuffer.o > endif > > LIBXL_OBJS-y += libxl_remus.o > +LIBXL_OBJS-y += libxl_remus_replicated_disk.o > > > So I think this part will also require some autoconf based stuff. > Especially, if DRBD & tapdisk are not present, then this whole > thing gets disabled. Just like the libxl_netbuffer and libxl_nonetbuffer > > Given that both of these (netbuffer and disk) are associated with Remus > and both are required for Remus to work "correctly", we might as well have > noremus.c and remus.c . Ofcourse it can be modularized a bit to have > netbuffer but no disk replication or vice versa. As long as the person > installing > or compiling this stuff is made to state explicitly that he/she does not want > Remus, but only a subset of its functionality for some other purpose. > > > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > index a4ffdfd..858f5be 100644 > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > > > > > /* The domain was suspended successfully. Start a new network > @@ -1279,7 +1284,10 @@ static int > libxl__remus_domain_resume_callback(void *data) > if (libxl__domain_resume(gc, dss->domid, /* Fast Suspend */1)) > return 0; > > - /* REMUS TODO: Deal with disk. */ > + /* Deal with disk. */ > + if (libxl__remus_disks_preresume(dss->remus_state)) > + return 0; > + > > > Bug. This should go before the resume call. Also, I would suggest changing > the comment > to something more meaningful, e.g., "commit disk changes.." > > > return 1; > } > > @@ -1326,6 +1334,13 @@ static void remus_checkpoint_dm_saved(libxl__egc > *egc, > goto out; > } > > + rc = libxl__remus_disks_commit(remus_state); > + if (rc) { > + LOG(ERROR, "Failed to commit disks state" > + " Terminating Remus.."); > + goto out; > + } > + > > > Now might be a good time to use the restore callbacks offered by the > toolstack to > get an explicit ack from the backup that it has received the memory > checkpoint too > before the network buffer is released. I think I put in a comment related to > that somewhere. > > > > if (remus_state->netbuf_state) { > rc = libxl__remus_netbuf_release_prev_epoch(gc, dss->domid, > remus_state); > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h > index 1bd2bba..8933e5f 100644 > --- a/tools/libxl/libxl_internal.h > +++ b/tools/libxl/libxl_internal.h > @@ -2309,6 +2309,10 @@ typedef struct libxl__remus_state { > void *netbuf_state; > libxl__ev_time timeout; > libxl__ev_child child; > + > + /* remus disks state */ > + uint32_t nr_disks; > + struct libxl__remus_disk **disks; > } libxl__remus_state; > > _hidden int libxl__netbuffer_enabled(libxl__gc *gc); > @@ -2336,6 +2340,12 @@ _hidden int > libxl__remus_netbuf_start_new_epoch(libxl__gc *gc, uint32_t domid, > _hidden int libxl__remus_netbuf_release_prev_epoch(libxl__gc *gc, > uint32_t domid, > libxl__remus_state > *remus_state); > > +_hidden int libxl__remus_disks_postsuspend(libxl__remus_state *state); > +_hidden int libxl__remus_disks_preresume(libxl__remus_state *state); > +_hidden int libxl__remus_disks_commit(libxl__remus_state *state); > +_hidden int libxl__remus_disks_setup(libxl__egc *egc, > libxl__domain_suspend_state *dss); > +_hidden void libxl__remus_disks_teardown(libxl__remus_state *state); > + > _hidden void libxl__remus_setup_initiate(libxl__egc *egc, > libxl__domain_suspend_state > *dss); > > diff --git a/tools/libxl/libxl_remus.c b/tools/libxl/libxl_remus.c > index cdc1c16..92eb36a 100644 > --- a/tools/libxl/libxl_remus.c > +++ b/tools/libxl/libxl_remus.c > @@ -23,6 +23,7 @@ void libxl__remus_setup_initiate(libxl__egc *egc, > libxl__domain_suspend_state *dss) > { > libxl__ev_time_init(&dss->remus_state->timeout); > + libxl__remus_disks_setup(egc, dss); > if (!dss->remus_state->netbufscript) > libxl__remus_setup_done(egc, dss, 0); > else > @@ -51,6 +52,7 @@ void libxl__remus_teardown_initiate(libxl__egc *egc, > /* stash rc somewhere before invoking teardown ops. */ > dss->remus_state->saved_rc = rc; > > + libxl__remus_disks_teardown(dss->remus_state); > if (!dss->remus_state->netbuf_state) > libxl__remus_teardown_done(egc, dss); > else > diff --git a/tools/libxl/libxl_remus_replicated_disk.c > b/tools/libxl/libxl_remus_replicated_disk.c > new file mode 100644 > index 0000000..4b16403 > --- /dev/null > +++ b/tools/libxl/libxl_remus_replicated_disk.c > @@ -0,0 +1,219 @@ > +/* > + * Copyright (C) 2013 > + * Author Lai Jiangshan <laijs@xxxxxxxxxxxxxx > <mailto:laijs@xxxxxxxxxxxxxx>> > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU Lesser General Public License as > published > + * by the Free Software Foundation; version 2.1 only. with the special > + * exception on linking described in file LICENSE. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU Lesser General Public License for more details. > + */ > + > +#include "libxl_osdeps.h" /* must come before any other headers */ > + > +#include "libxl_internal.h" > + > +typedef struct libxl__remus_disk > +{ > + const struct libxl_device_disk *disk; > + const struct libxl__remus_disk_type *type; > + > + /* ao callbacks for setup & teardown script */ > + int (*setup_cb)(struct libxl__remus_disk *d); > + int (*teardown_cb)(struct libxl__remus_disk *d); > +} libxl__remus_disk; > + > +typedef struct libxl__remus_disk_type > +{ > + /* checkpointing */ > + int (*postsuspend)(libxl__remus_disk *d); > + int (*preresume)(libxl__remus_disk *d); > + int (*commit)(libxl__remus_disk *d); > + > > > I would also suggest renaming these to something else, not associated with > suspend but associated with checkpoints. start_disk_sync, finish_disk_sync, > start_new_epoch or something along those lines. > > > > + /* setup & teardown */ > + libxl__remus_disk *(*setup)(libxl__gc *gc, libxl_device_disk *disk); > + void (*teardown)(libxl__remus_disk *d); > > +} libxl__remus_disk_type; > + > + > +/*** drbd implementation ***/ > +const int DRBD_SEND_CHECKPOINT = 20; > +const int DRBD_WAIT_CHECKPOINT_ACK = 30; > +typedef struct libxl__remus_drbd_disk > +{ > + libxl__remus_disk remus_disk; > + int ctl_fd; > + int ackwait; > +} libxl__remus_drbd_disk; > + > +static int drbd_postsuspend(libxl__remus_disk *d) > +{ > + struct libxl__remus_drbd_disk *drbd = CONTAINER_OF(d, *drbd, > remus_disk); > + > + if (!drbd->ackwait) { > + if (ioctl(drbd->ctl_fd, DRBD_SEND_CHECKPOINT, 0) <= 0) > + drbd->ackwait = 1; > + } > + > + return 0; > +} > + > +static int drbd_preresume(libxl__remus_disk *d) > +{ > + struct libxl__remus_drbd_disk *drbd = CONTAINER_OF(d, *drbd, > remus_disk); > + > + if (drbd->ackwait) { > + ioctl(drbd->ctl_fd, DRBD_WAIT_CHECKPOINT_ACK, 0); > + drbd->ackwait = 0; > + } > + > + return 0; > +} > + > +static int drbd_commit(libxl__remus_disk *d) > +{ > + /* nothing to do, all work are done by DRBD's protocal-D. */ > + return 0; > +} > + > +static libxl__remus_disk *drbd_setup(libxl__gc *gc, libxl_device_disk > *disk) > +{ > + libxl__remus_drbd_disk *drbd; > + //if (!(drbd && protocal-D)) // TODO: need to run script async to > check > + // return NULL > + > > > We don't need to run any scripts for DRBD (or tapdisk for that matter). It is hard to check the status/configuration of DRBD-disk in C code, it requires drbd C-header files(with remus supported). I find no way to do it. So we try to use scripts. Thank you and thanks to your elaborate reply. Lai > > DRBD scripts will get activated when the domain boots and thats the end of it. > On the backup side, it gets activated during the initial phase of Remus, which > is same as live migration. Since xl already supports live migration with drbd > based disks, we don't need any script related code at all. > > With regard to tapdisk-remus (atleast with blktap2), you cant boot the domain > fully unless you start Remus too. This in turn forces the backup to start the > tapdisk-remus receiving end. Once again in this case, in Xend, the live > migration > infrastructure did all the script setup work. > > > > > + GCNEW(drbd); > + > + drbd->ctl_fd = open(GCSPRINTF("/dev/drbd/by-res/%s", > disk->pdev_path), O_RDONLY); > + drbd->ackwait = 0; > + > + if (drbd->ctl_fd < 0) > + return NULL; > + > + return &drbd->remus_disk; > +} > + > +static void drbd_teardown(libxl__remus_disk *d) > +{ > + struct libxl__remus_drbd_disk *drbd = CONTAINER_OF(d, *drbd, > remus_disk); > + > + close(drbd->ctl_fd); > +} > + > +static const libxl__remus_disk_type drbd_disk_type = { > + .postsuspend = drbd_postsuspend, > + .preresume = drbd_preresume, > + .commit = drbd_commit, > + .setup = drbd_setup, > + .teardown = drbd_teardown, > +}; > + > +/*** checkpoint disks states and callbacks ***/ > +static const libxl__remus_disk_type *remus_disk_types[] = > +{ > + &drbd_disk_type, > +}; > + > +int libxl__remus_disks_postsuspend(libxl__remus_state *state) > +{ > + int i; > + int rc = 0; > + > + for (i = 0; rc == 0 && i < state->nr_disks; i++) > + rc = state->disks[i]->type->postsuspend(state->disks[i]); > + > + return rc; > +} > + > +int libxl__remus_disks_preresume(libxl__remus_state *state) > +{ > + int i; > + int rc = 0; > + > + for (i = 0; rc == 0 && i < state->nr_disks; i++) > + rc = state->disks[i]->type->preresume(state->disks[i]); > + > + return rc; > +} > + > +int libxl__remus_disks_commit(libxl__remus_state *state) > +{ > + int i; > + int rc = 0; > + > + for (i = 0; rc == 0 && i < state->nr_disks; i++) > + rc = state->disks[i]->type->commit(state->disks[i]); > + > + return rc; > +} > + > +#if 0 > +/* TODO: implement disk setup/teardown script */ > +static void disk_exec_timeout_cb(libxl__egc *egc, libxl__ev_time *ev, > + const struct timeval > *requested_abs) > +{ > + libxl__remus_disks_state *state = CONTAINER_OF(ev, *aodev, timeout); > + STATE_AO_GC(state->ao); > + > + libxl__ev_time_deregister(gc, &state->timeout); > + > + assert(libxl__ev_child_inuse(&state->child)); > + if (kill(state->child.pid, SIGKILL)) { > + } > + > + return; > +} > + > +int libxl__remus_disks_exec_script(libxl__gc *gc, > + libxl__remus_disks_state *state) > +{ > +} > +#endif > + > > > I don't know if this is needed at all, given that we don't have disk script > setup issues. > > > > > +int libxl__remus_disks_setup(libxl__egc *egc, > libxl__domain_suspend_state *dss) > +{ > + libxl__remus_state *remus_state = dss->remus_state; > + int i, j, nr_disks; > + libxl_device_disk *disks; > + libxl__remus_disk *remus_disk; > + const libxl__remus_disk_type *type; > + > + STATE_AO_GC(dss->ao); > + disks = libxl_device_disk_list(CTX, dss->domid, &nr_disks); > + remus_state->nr_disks = nr_disks; > + GCNEW_ARRAY(remus_state->disks, nr_disks); > + > + for (i = 0; i < nr_disks; i++) { > + remus_disk = NULL; > + for (j = 0; j < ARRAY_SIZE(remus_disk_types); j++) { > + type = remus_disk_types[j]; > + remus_disk = type->setup(gc, &disks[i]); > + if (!remus_disk) > + break; > + > + remus_state->disks[i] = remus_disk; > + remus_disk->disk = &disks[i]; > + remus_disk->type = type; > + } > + if (!remus_disk) { > + remus_state->nr_disks = i; > + libxl__remus_disks_teardown(remus_state); > + return -1; > + } > + } > + return 0; > +} > + > +void libxl__remus_disks_teardown(libxl__remus_state *state) > +{ > + int i; > + > + for (i = 0; i < state->nr_disks; i++) > + state->disks[i]->type->teardown(state->disks[i]); > + state->nr_disks = 0; > +} > + > -- > 1.7.1 > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.