Xen project Mailing List

Re: [Xen-devel] [PATCH RFC 07/20] migration: defer precopy policy to libxl

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

From: Joshua Otto <jtotto@xxxxxxxxxxxx>

Date: Thu, 30 Mar 2017 01:19:41 -0400

Cc: ian.jackson@xxxxxxxxxxxxx, hjarmstr@xxxxxxxxxxxx, wei.liu2@xxxxxxxxxx, czylin@xxxxxxxxxxxx, imhy.yang@xxxxxxxxx

Delivery-date: Thu, 30 Mar 2017 05:20:18 +0000

Dkim-filter: OpenDKIM Filter v2.11.0 minos.uwaterloo.ca v2U5JfC3028198

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Wed, Mar 29, 2017 at 09:18:10PM +0100, Andrew Cooper wrote: > On 27/03/17 10:06, Joshua Otto wrote: > > The precopy phase of the xc_domain_save() live migration algorithm has > > historically been implemented to run until either a) (almost) no pages > > are dirty or b) some fixed, hard-coded maximum number of precopy > > iterations has been exceeded. This policy and its implementation are > > less than ideal for a few reasons: > > - the logic of the policy is intertwined with the control flow of the > > mechanism of the precopy stage > > - it can't take into account facts external to the immediate > > migration context, such as interactive user input or the passage of > > wall-clock time > > - it does not permit the user to change their mind, over time, about > > what to do at the end of the precopy (they get an unconditional > > transition into the stop-and-copy phase of the migration) > > > > To permit users to implement arbitrary higher-level policies governing > > when the live migration precopy phase should end, and what should be > > done next: > > - add a precopy_policy() callback to the xc_domain_save() user-supplied > > callbacks > > - during the precopy phase of live migrations, consult this policy after > > each batch of pages transmitted and take the dictated action, which > > may be to a) abort the migration entirely, b) continue with the > > precopy, or c) proceed to the stop-and-copy phase. > > - provide an implementation of the old policy as such a callback in > > libxl and plumb it through the IPC machinery to libxc, effectively > > maintaing the old policy for now > > > > Signed-off-by: Joshua Otto <jtotto@xxxxxxxxxxxx> > > This patch should be split into two. One modifying libxc to use struct > precopy_stats, and a second to wire up the RPC call. Will do. > > --- > > tools/libxc/include/xenguest.h | 23 ++++- > > tools/libxc/xc_nomigrate.c | 3 +- > > tools/libxc/xc_sr_common.h | 7 +- > > tools/libxc/xc_sr_save.c | 194 > > ++++++++++++++++++++++++++----------- > > tools/libxl/libxl_dom_save.c | 20 ++++ > > tools/libxl/libxl_save_callout.c | 3 +- > > tools/libxl/libxl_save_helper.c | 7 +- > > tools/libxl/libxl_save_msgs_gen.pl | 4 +- > > 8 files changed, 189 insertions(+), 72 deletions(-) > > > > diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h > > index aa8cc8b..30ffb6f 100644 > > --- a/tools/libxc/include/xenguest.h > > +++ b/tools/libxc/include/xenguest.h > > @@ -39,6 +39,14 @@ > > */ > > struct xenevtchn_handle; > > > > +/* For save's precopy_policy(). */ > > +struct precopy_stats > > +{ > > + unsigned iteration; > > + unsigned total_written; > > + long dirty_count; /* -1 if unknown */ > > total_written and dirty_count are liable to be equal, so having them as > different widths of integer clearly can't be correct. Hmmm, I could have sworn that I chose the width to match the type of dirty_count in the shadow op stats, but I've checked again and it's uint32_t there so I'm not sure what I was thinking. > > > +}; > > + > > /* callbacks provided by xc_domain_save */ > > struct save_callbacks { > > /* Called after expiration of checkpoint interval, > > @@ -46,6 +54,17 @@ struct save_callbacks { > > */ > > int (*suspend)(void* data); > > > > + /* Called after every batch of page data sent during the precopy phase > > of a > > + * live migration to ask the caller what to do next based on the > > current > > + * state of the precopy migration. > > + */ > > +#define XGS_POLICY_ABORT (-1) /* Abandon the migration entirely > > and > > + * tidy up. */ > > +#define XGS_POLICY_CONTINUE_PRECOPY 0 /* Remain in the precopy phase. */ > > +#define XGS_POLICY_STOP_AND_COPY 1 /* Immediately suspend and transmit > > the > > + * remaining dirty pages. */ > > + int (*precopy_policy)(struct precopy_stats stats, void *data); > > Structures shouldn't be passed by value like this, as the compiler has > to do a lot of memcpy() work to make it happen. You should pass by > const pointer, as (as far as I can tell), they are strictly read-only to > the implementation of this hook? I chose to pass by value to make the IPC plumbing easier - libxl_save_msgs_gen.pl doesn't know what to do about pointers, and (not being the strongest Perl programmer...) I didn't want to volunteer to be the one to teach it. Is the memcpy() really significant here? If this were a tight loop, sure, but every invocation of the policy callback implies both a 4MB network transfer _and_ a synchronous RPC. > > + > > /* Called after the guest's dirty pages have been > > * copied into an output buffer. > > * Callback function resumes the guest & the device model, > > @@ -100,8 +119,8 @@ typedef enum { > > * doesn't use checkpointing > > * @return 0 on success, -1 on failure > > */ > > -int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t > > max_iters, > > - uint32_t max_factor, uint32_t flags /* XCFLAGS_xxx */, > > +int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, > > + uint32_t flags /* XCFLAGS_xxx */, > > struct save_callbacks* callbacks, int hvm, > > xc_migration_stream_t stream_type, int recv_fd); > > It would be cleaner for existing callers, and to extend in the future, > to encapsulate all of these parameters in a struct domain_save_params > and pass it by pointer to here. > > That way, we'd avoid the situation we currently have where some > information is passed in bitfields in a single parameter, whereas other > booleans are passed as integers. > > The hvm parameter specifically is useless, and can be removed by > rearranging the sanity checks until after the xc_domain_getinfo() call. > > > > > diff --git a/tools/libxc/xc_nomigrate.c b/tools/libxc/xc_nomigrate.c > > index 15c838f..2af64e4 100644 > > --- a/tools/libxc/xc_nomigrate.c > > +++ b/tools/libxc/xc_nomigrate.c > > @@ -20,8 +20,7 @@ > > #include <xenctrl.h> > > #include <xenguest.h> > > > > -int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t > > max_iters, > > - uint32_t max_factor, uint32_t flags, > > +int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t > > flags, > > struct save_callbacks* callbacks, int hvm, > > xc_migration_stream_t stream_type, int recv_fd) > > { > > diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h > > index b1aa88e..a9160bd 100644 > > --- a/tools/libxc/xc_sr_common.h > > +++ b/tools/libxc/xc_sr_common.h > > @@ -198,12 +198,11 @@ struct xc_sr_context > > /* Further debugging information in the stream. */ > > bool debug; > > > > - /* Parameters for tweaking live migration. */ > > - unsigned max_iterations; > > - unsigned dirty_threshold; > > - > > unsigned long p2m_size; > > > > + struct precopy_stats stats; > > + int policy_decision; > > + > > xen_pfn_t *batch_pfns; > > unsigned nr_batch_pfns; > > unsigned long *deferred_pages; > > diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c > > index 797aec5..eb95334 100644 > > --- a/tools/libxc/xc_sr_save.c > > +++ b/tools/libxc/xc_sr_save.c > > @@ -271,13 +271,29 @@ static int write_batch(struct xc_sr_context *ctx) > > } > > > > /* > > + * Test if the batch is full. > > + */ > > +static bool batch_full(struct xc_sr_context *ctx) > > const struct xc_sr_context *ctx > > This is a predicate, after all. > > > +{ > > + return ctx->save.nr_batch_pfns == MAX_BATCH_SIZE; > > +} > > + > > +/* > > + * Test if the batch is empty. > > + */ > > +static bool batch_empty(struct xc_sr_context *ctx) > > +{ > > + return ctx->save.nr_batch_pfns == 0; > > +} > > + > > +/* > > * Flush a batch of pfns into the stream. > > */ > > static int flush_batch(struct xc_sr_context *ctx) > > { > > int rc = 0; > > > > - if ( ctx->save.nr_batch_pfns == 0 ) > > + if ( batch_empty(ctx) ) > > return rc; > > > > rc = write_batch(ctx); > > @@ -293,19 +309,12 @@ static int flush_batch(struct xc_sr_context *ctx) > > } > > > > /* > > - * Add a single pfn to the batch, flushing the batch if full. > > + * Add a single pfn to the batch. > > */ > > -static int add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn) > > +static void add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn) > > { > > - int rc = 0; > > - > > - if ( ctx->save.nr_batch_pfns == MAX_BATCH_SIZE ) > > - rc = flush_batch(ctx); > > - > > - if ( rc == 0 ) > > - ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn; > > - > > - return rc; > > + assert(ctx->save.nr_batch_pfns < MAX_BATCH_SIZE); > > + ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn; > > } > > > > /* > > @@ -352,10 +361,15 @@ static int suspend_domain(struct xc_sr_context *ctx) > > * Send a subset of pages in the guests p2m, according to the dirty bitmap. > > * Used for each subsequent iteration of the live migration loop. > > * > > + * During the precopy stage of a live migration, test the user-supplied > > + * policy function after each batch of pages and cut off the operation > > + * early if indicated. Unless aborting, the dirty pages remaining in this > > round > > + * are transferred into the deferred_pages bitmap. > > Is this actually a sensible thing to do? On iteration 0, this is going > to be a phenomenal number of RPC calls, which are all going to make the > same decision. With the existing policy? No. However, the grand idea is to permit other policies where this does make sense. As an example, I think it would be really useful for users to be able to specify a timeout, in seconds, for the precopy phase, after which the migration advances to its next phase (I'll elaborate more on this in the other discussion thread). It's true that this means a lot of RPC. The hope is that the cost of each RPC should be negligible in comparison to a 4MB synchronous network copy. > > > + * > > * Bitmap is bounded by p2m_size. > > */ > > static int send_dirty_pages(struct xc_sr_context *ctx, > > - unsigned long entries) > > + unsigned long entries, bool precopy) > > Shouldn't this precopy boolean be some kind of state variable in ctx ? I suppose it could be. I was a bit worried that there would be objections to piling too many additional variables into the context, because each is essentially an implicit extra parameter to every function here. > > > { > > xc_interface *xch = ctx->xch; > > xen_pfn_t p; > > @@ -364,31 +378,57 @@ static int send_dirty_pages(struct xc_sr_context *ctx, > > DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap, > > &ctx->save.dirty_bitmap_hbuf); > > > > - for ( p = 0, written = 0; p < ctx->save.p2m_size; ++p ) > > + int (*precopy_policy)(struct precopy_stats, void *) = > > + ctx->save.callbacks->precopy_policy; > > + void *data = ctx->save.callbacks->data; > > + > > + assert(batch_empty(ctx)); > > + for ( p = 0, written = 0; p < ctx->save.p2m_size; ) > > This looks suspicious without an increment. Conceptually, it might be > better as a do {} while ( decision == XGS_POLICY_CONTINUE_PRECOPY ); loop? Sure, I think that would read just fine too. > > > { > > - if ( !test_bit(p, dirty_bitmap) ) > > - continue; > > + if ( ctx->save.live && precopy ) > > + { > > + ctx->save.policy_decision = precopy_policy(ctx->save.stats, > > data); > > Newline here please. > > > + if ( ctx->save.policy_decision == XGS_POLICY_ABORT ) > > + { > > Please but a log message here indicating that abort has been requested. > Otherwise, the migration will give up with a failure and no obvious > indication why. > > > + return -1; > > + } > > + else if ( ctx->save.policy_decision != > > XGS_POLICY_CONTINUE_PRECOPY ) > > + { > > + /* Any outstanding dirty pages are now deferred until the > > next > > + * phase of the migration. */ > > /* > * The comment style for multiline comments > * is like this. > */ > > > + bitmap_or(ctx->save.deferred_pages, dirty_bitmap, > > + ctx->save.p2m_size); > > + if ( entries > written ) > > + ctx->save.nr_deferred_pages += entries - written; > > + > > + goto done; > > + } > > + } > > > > - rc = add_to_batch(ctx, p); > > + for ( ; p < ctx->save.p2m_size && !batch_full(ctx); ++p ) > > + { > > + if ( test_and_clear_bit(p, dirty_bitmap) ) > > + { > > + add_to_batch(ctx, p); > > + ++written; > > + ++ctx->save.stats.total_written; > > + } > > + } > > + > > + rc = flush_batch(ctx); > > if ( rc ) > > return rc; > > > > - /* Update progress every 4MB worth of memory sent. */ > > - if ( (written & ((1U << (22 - 12)) - 1)) == 0 ) > > - xc_report_progress_step(xch, written, entries); > > - > > - ++written; > > + /* Update progress after every batch (4MB) worth of memory sent. */ > > + xc_report_progress_step(xch, written, entries); > > } > > > > - rc = flush_batch(ctx); > > - if ( rc ) > > - return rc; > > - > > if ( written > entries ) > > DPRINTF("Bitmap contained more entries than expected..."); > > > > xc_report_progress_step(xch, entries, entries); > > > > + done: > > return ctx->save.ops.check_vm_state(ctx); > > } > > > > @@ -396,14 +436,14 @@ static int send_dirty_pages(struct xc_sr_context *ctx, > > * Send all pages in the guests p2m. Used as the first iteration of the > > live > > * migration loop, and for a non-live save. > > */ > > -static int send_all_pages(struct xc_sr_context *ctx) > > +static int send_all_pages(struct xc_sr_context *ctx, bool precopy) > > { > > DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap, > > &ctx->save.dirty_bitmap_hbuf); > > > > bitmap_set(dirty_bitmap, ctx->save.p2m_size); > > > > - return send_dirty_pages(ctx, ctx->save.p2m_size); > > + return send_dirty_pages(ctx, ctx->save.p2m_size, precopy); > > } > > > > static int enable_logdirty(struct xc_sr_context *ctx) > > @@ -446,8 +486,7 @@ static int update_progress_string(struct xc_sr_context > > *ctx, > > xc_interface *xch = ctx->xch; > > char *new_str = NULL; > > > > - if ( asprintf(&new_str, "Frames iteration %u of %u", > > - iter, ctx->save.max_iterations) == -1 ) > > + if ( asprintf(&new_str, "Frames iteration %u", iter) == -1 ) > > { > > PERROR("Unable to allocate new progress string"); > > return -1; > > @@ -468,20 +507,47 @@ static int send_memory_live(struct xc_sr_context *ctx) > > xc_interface *xch = ctx->xch; > > xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size }; > > char *progress_str = NULL; > > - unsigned x; > > int rc; > > > > + DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap, > > + &ctx->save.dirty_bitmap_hbuf); > > + > > + int (*precopy_policy)(struct precopy_stats, void *) = > > + ctx->save.callbacks->precopy_policy; > > + void *data = ctx->save.callbacks->data; > > + > > rc = update_progress_string(ctx, &progress_str, 0); > > if ( rc ) > > goto out; > > > > - rc = send_all_pages(ctx); > > +#define CONSULT_POLICY > > \ > > :( > > The reason this code is readable and (hopefully) easy to follow, is due > in large part to a lack of macros like this trying to hide what is > actually going on. Okay, I'll inline it. I guess I thought I might get away with it because it's never more than a screen buffer away from its callsites. > > > + do { > > \ > > + if ( ctx->save.policy_decision == XGS_POLICY_ABORT ) > > \ > > + { > > \ > > + rc = -1; > > \ > > + goto out; > > \ > > + } > > \ > > + else if ( ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY > > ) \ > > + { > > \ > > + rc = 0; > > \ > > + goto out; > > \ > > + } > > \ > > + } while (0) > > + > > + ctx->save.stats = (struct precopy_stats) > > + { > > + .iteration = 0, > > + .total_written = 0, > > + .dirty_count = -1 > > + }; > > + rc = send_all_pages(ctx, /* precopy */ true); > > if ( rc ) > > goto out; > > > > - for ( x = 1; > > - ((x < ctx->save.max_iterations) && > > - (stats.dirty_count > ctx->save.dirty_threshold)); ++x ) > > + /* send_all_pages() has updated the stats */ > > + CONSULT_POLICY; > > + > > + for ( ctx->save.stats.iteration = 1; ; ++ctx->save.stats.iteration ) > > Again, without an exit condition, this looks suspicious. Sure, I'll turn this one into a do {} while() too. Thank you for the review! Josh _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.