[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC 07/20] migration: defer precopy policy to libxl



On Wed, Mar 29, 2017 at 09:18:10PM +0100, Andrew Cooper wrote:
> On 27/03/17 10:06, Joshua Otto wrote:
> > The precopy phase of the xc_domain_save() live migration algorithm has
> > historically been implemented to run until either a) (almost) no pages
> > are dirty or b) some fixed, hard-coded maximum number of precopy
> > iterations has been exceeded.  This policy and its implementation are
> > less than ideal for a few reasons:
> > - the logic of the policy is intertwined with the control flow of the
> >   mechanism of the precopy stage
> > - it can't take into account facts external to the immediate
> >   migration context, such as interactive user input or the passage of
> >   wall-clock time
> > - it does not permit the user to change their mind, over time, about
> >   what to do at the end of the precopy (they get an unconditional
> >   transition into the stop-and-copy phase of the migration)
> >
> > To permit users to implement arbitrary higher-level policies governing
> > when the live migration precopy phase should end, and what should be
> > done next:
> > - add a precopy_policy() callback to the xc_domain_save() user-supplied
> >   callbacks
> > - during the precopy phase of live migrations, consult this policy after
> >   each batch of pages transmitted and take the dictated action, which
> >   may be to a) abort the migration entirely, b) continue with the
> >   precopy, or c) proceed to the stop-and-copy phase.
> > - provide an implementation of the old policy as such a callback in
> >   libxl and plumb it through the IPC machinery to libxc, effectively
> >   maintaing the old policy for now
> >
> > Signed-off-by: Joshua Otto <jtotto@xxxxxxxxxxxx>
> 
> This patch should be split into two.  One modifying libxc to use struct
> precopy_stats, and a second to wire up the RPC call.

Will do.

> > ---
> >  tools/libxc/include/xenguest.h     |  23 ++++-
> >  tools/libxc/xc_nomigrate.c         |   3 +-
> >  tools/libxc/xc_sr_common.h         |   7 +-
> >  tools/libxc/xc_sr_save.c           | 194 
> > ++++++++++++++++++++++++++-----------
> >  tools/libxl/libxl_dom_save.c       |  20 ++++
> >  tools/libxl/libxl_save_callout.c   |   3 +-
> >  tools/libxl/libxl_save_helper.c    |   7 +-
> >  tools/libxl/libxl_save_msgs_gen.pl |   4 +-
> >  8 files changed, 189 insertions(+), 72 deletions(-)
> >
> > diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
> > index aa8cc8b..30ffb6f 100644
> > --- a/tools/libxc/include/xenguest.h
> > +++ b/tools/libxc/include/xenguest.h
> > @@ -39,6 +39,14 @@
> >   */
> >  struct xenevtchn_handle;
> >  
> > +/* For save's precopy_policy(). */
> > +struct precopy_stats
> > +{
> > +    unsigned iteration;
> > +    unsigned total_written;
> > +    long dirty_count; /* -1 if unknown */
> 
> total_written and dirty_count are liable to be equal, so having them as
> different widths of integer clearly can't be correct.

Hmmm, I could have sworn that I chose the width to match the type of dirty_count
in the shadow op stats, but I've checked again and it's uint32_t there so I'm
not sure what I was thinking.

> 
> > +};
> > +
> >  /* callbacks provided by xc_domain_save */
> >  struct save_callbacks {
> >      /* Called after expiration of checkpoint interval,
> > @@ -46,6 +54,17 @@ struct save_callbacks {
> >       */
> >      int (*suspend)(void* data);
> >  
> > +    /* Called after every batch of page data sent during the precopy phase 
> > of a
> > +     * live migration to ask the caller what to do next based on the 
> > current
> > +     * state of the precopy migration.
> > +     */
> > +#define XGS_POLICY_ABORT          (-1) /* Abandon the migration entirely 
> > and
> > +                                        * tidy up. */
> > +#define XGS_POLICY_CONTINUE_PRECOPY 0  /* Remain in the precopy phase. */
> > +#define XGS_POLICY_STOP_AND_COPY    1  /* Immediately suspend and transmit 
> > the
> > +                                        * remaining dirty pages. */
> > +    int (*precopy_policy)(struct precopy_stats stats, void *data);
> 
> Structures shouldn't be passed by value like this, as the compiler has
> to do a lot of memcpy() work to make it happen.  You should pass by
> const pointer, as (as far as I can tell), they are strictly read-only to
> the implementation of this hook?

I chose to pass by value to make the IPC plumbing easier -
libxl_save_msgs_gen.pl doesn't know what to do about pointers, and (not being
the strongest Perl programmer...) I didn't want to volunteer to be the one to
teach it.

Is the memcpy() really significant here?  If this were a tight loop, sure, but
every invocation of the policy callback implies both a 4MB network transfer
_and_ a synchronous RPC.

> > +
> >      /* Called after the guest's dirty pages have been
> >       *  copied into an output buffer.
> >       * Callback function resumes the guest & the device model,
> > @@ -100,8 +119,8 @@ typedef enum {
> >   *        doesn't use checkpointing
> >   * @return 0 on success, -1 on failure
> >   */
> > -int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t 
> > max_iters,
> > -                   uint32_t max_factor, uint32_t flags /* XCFLAGS_xxx */,
> > +int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom,
> > +                   uint32_t flags /* XCFLAGS_xxx */,
> >                     struct save_callbacks* callbacks, int hvm,
> >                     xc_migration_stream_t stream_type, int recv_fd);
> 
> It would be cleaner for existing callers, and to extend in the future,
> to encapsulate all of these parameters in a struct domain_save_params
> and pass it by pointer to here.
> 
> That way, we'd avoid the situation we currently have where some
> information is passed in bitfields in a single parameter, whereas other
> booleans are passed as integers.
> 
> The hvm parameter specifically is useless, and can be removed by
> rearranging the sanity checks until after the xc_domain_getinfo() call.
> 
> >  
> > diff --git a/tools/libxc/xc_nomigrate.c b/tools/libxc/xc_nomigrate.c
> > index 15c838f..2af64e4 100644
> > --- a/tools/libxc/xc_nomigrate.c
> > +++ b/tools/libxc/xc_nomigrate.c
> > @@ -20,8 +20,7 @@
> >  #include <xenctrl.h>
> >  #include <xenguest.h>
> >  
> > -int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t 
> > max_iters,
> > -                   uint32_t max_factor, uint32_t flags,
> > +int xc_domain_save(xc_interface *xch, int io_fd, uint32_t dom, uint32_t 
> > flags,
> >                     struct save_callbacks* callbacks, int hvm,
> >                     xc_migration_stream_t stream_type, int recv_fd)
> >  {
> > diff --git a/tools/libxc/xc_sr_common.h b/tools/libxc/xc_sr_common.h
> > index b1aa88e..a9160bd 100644
> > --- a/tools/libxc/xc_sr_common.h
> > +++ b/tools/libxc/xc_sr_common.h
> > @@ -198,12 +198,11 @@ struct xc_sr_context
> >              /* Further debugging information in the stream. */
> >              bool debug;
> >  
> > -            /* Parameters for tweaking live migration. */
> > -            unsigned max_iterations;
> > -            unsigned dirty_threshold;
> > -
> >              unsigned long p2m_size;
> >  
> > +            struct precopy_stats stats;
> > +            int policy_decision;
> > +
> >              xen_pfn_t *batch_pfns;
> >              unsigned nr_batch_pfns;
> >              unsigned long *deferred_pages;
> > diff --git a/tools/libxc/xc_sr_save.c b/tools/libxc/xc_sr_save.c
> > index 797aec5..eb95334 100644
> > --- a/tools/libxc/xc_sr_save.c
> > +++ b/tools/libxc/xc_sr_save.c
> > @@ -271,13 +271,29 @@ static int write_batch(struct xc_sr_context *ctx)
> >  }
> >  
> >  /*
> > + * Test if the batch is full.
> > + */
> > +static bool batch_full(struct xc_sr_context *ctx)
> 
> const struct xc_sr_context *ctx
> 
> This is a predicate, after all.
> 
> > +{
> > +    return ctx->save.nr_batch_pfns == MAX_BATCH_SIZE;
> > +}
> > +
> > +/*
> > + * Test if the batch is empty.
> > + */
> > +static bool batch_empty(struct xc_sr_context *ctx)
> > +{
> > +    return ctx->save.nr_batch_pfns == 0;
> > +}
> > +
> > +/*
> >   * Flush a batch of pfns into the stream.
> >   */
> >  static int flush_batch(struct xc_sr_context *ctx)
> >  {
> >      int rc = 0;
> >  
> > -    if ( ctx->save.nr_batch_pfns == 0 )
> > +    if ( batch_empty(ctx) )
> >          return rc;
> >  
> >      rc = write_batch(ctx);
> > @@ -293,19 +309,12 @@ static int flush_batch(struct xc_sr_context *ctx)
> >  }
> >  
> >  /*
> > - * Add a single pfn to the batch, flushing the batch if full.
> > + * Add a single pfn to the batch.
> >   */
> > -static int add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
> > +static void add_to_batch(struct xc_sr_context *ctx, xen_pfn_t pfn)
> >  {
> > -    int rc = 0;
> > -
> > -    if ( ctx->save.nr_batch_pfns == MAX_BATCH_SIZE )
> > -        rc = flush_batch(ctx);
> > -
> > -    if ( rc == 0 )
> > -        ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
> > -
> > -    return rc;
> > +    assert(ctx->save.nr_batch_pfns < MAX_BATCH_SIZE);
> > +    ctx->save.batch_pfns[ctx->save.nr_batch_pfns++] = pfn;
> >  }
> >  
> >  /*
> > @@ -352,10 +361,15 @@ static int suspend_domain(struct xc_sr_context *ctx)
> >   * Send a subset of pages in the guests p2m, according to the dirty bitmap.
> >   * Used for each subsequent iteration of the live migration loop.
> >   *
> > + * During the precopy stage of a live migration, test the user-supplied
> > + * policy function after each batch of pages and cut off the operation
> > + * early if indicated.  Unless aborting, the dirty pages remaining in this 
> > round
> > + * are transferred into the deferred_pages bitmap.
> 
> Is this actually a sensible thing to do?  On iteration 0, this is going
> to be a phenomenal number of RPC calls, which are all going to make the
> same decision.

With the existing policy?  No.  However, the grand idea is to permit other
policies where this does make sense.  As an example, I think it would be really
useful for users to be able to specify a timeout, in seconds, for the precopy
phase, after which the migration advances to its next phase (I'll elaborate more
on this in the other discussion thread).

It's true that this means a lot of RPC.  The hope is that the cost of each RPC
should be negligible in comparison to a 4MB synchronous network copy.

> 
> > + *
> >   * Bitmap is bounded by p2m_size.
> >   */
> >  static int send_dirty_pages(struct xc_sr_context *ctx,
> > -                            unsigned long entries)
> > +                            unsigned long entries, bool precopy)
> 
> Shouldn't this precopy boolean be some kind of state variable in ctx ?

I suppose it could be.  I was a bit worried that there would be objections to
piling too many additional variables into the context, because each is
essentially an implicit extra parameter to every function here.

> 
> >  {
> >      xc_interface *xch = ctx->xch;
> >      xen_pfn_t p;
> > @@ -364,31 +378,57 @@ static int send_dirty_pages(struct xc_sr_context *ctx,
> >      DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
> >                                      &ctx->save.dirty_bitmap_hbuf);
> >  
> > -    for ( p = 0, written = 0; p < ctx->save.p2m_size; ++p )
> > +    int (*precopy_policy)(struct precopy_stats, void *) =
> > +        ctx->save.callbacks->precopy_policy;
> > +    void *data = ctx->save.callbacks->data;
> > +
> > +    assert(batch_empty(ctx));
> > +    for ( p = 0, written = 0; p < ctx->save.p2m_size; )
> 
> This looks suspicious without an increment.  Conceptually, it might be
> better as a do {} while ( decision == XGS_POLICY_CONTINUE_PRECOPY ); loop?

Sure, I think that would read just fine too.
> 
> >      {
> > -        if ( !test_bit(p, dirty_bitmap) )
> > -            continue;
> > +        if ( ctx->save.live && precopy )
> > +        {
> > +            ctx->save.policy_decision = precopy_policy(ctx->save.stats, 
> > data);
> 
> Newline here please.
> 
> > +            if ( ctx->save.policy_decision == XGS_POLICY_ABORT )
> > +            {
> 
> Please but a log message here indicating that abort has been requested. 
> Otherwise, the migration will give up with a failure and no obvious
> indication why.
> 
> > +                return -1;
> > +            }
> > +            else if ( ctx->save.policy_decision != 
> > XGS_POLICY_CONTINUE_PRECOPY )
> > +            {
> > +                /* Any outstanding dirty pages are now deferred until the 
> > next
> > +                 * phase of the migration. */
> 
> /*
>  * The comment style for multiline comments
>  * is like this.
>  */
> 
> > +                bitmap_or(ctx->save.deferred_pages, dirty_bitmap,
> > +                          ctx->save.p2m_size);
> > +                if ( entries > written )
> > +                    ctx->save.nr_deferred_pages += entries - written;
> > +
> > +                goto done;
> > +            }
> > +        }
> >  
> > -        rc = add_to_batch(ctx, p);
> > +        for ( ; p < ctx->save.p2m_size && !batch_full(ctx); ++p )
> > +        {
> > +            if ( test_and_clear_bit(p, dirty_bitmap) )
> > +            {
> > +                add_to_batch(ctx, p);
> > +                ++written;
> > +                ++ctx->save.stats.total_written;
> > +            }
> > +        }
> > +
> > +        rc = flush_batch(ctx);
> >          if ( rc )
> >              return rc;
> >  
> > -        /* Update progress every 4MB worth of memory sent. */
> > -        if ( (written & ((1U << (22 - 12)) - 1)) == 0 )
> > -            xc_report_progress_step(xch, written, entries);
> > -
> > -        ++written;
> > +        /* Update progress after every batch (4MB) worth of memory sent. */
> > +        xc_report_progress_step(xch, written, entries);
> >      }
> >  
> > -    rc = flush_batch(ctx);
> > -    if ( rc )
> > -        return rc;
> > -
> >      if ( written > entries )
> >          DPRINTF("Bitmap contained more entries than expected...");
> >  
> >      xc_report_progress_step(xch, entries, entries);
> >  
> > + done:
> >      return ctx->save.ops.check_vm_state(ctx);
> >  }
> >  
> > @@ -396,14 +436,14 @@ static int send_dirty_pages(struct xc_sr_context *ctx,
> >   * Send all pages in the guests p2m.  Used as the first iteration of the 
> > live
> >   * migration loop, and for a non-live save.
> >   */
> > -static int send_all_pages(struct xc_sr_context *ctx)
> > +static int send_all_pages(struct xc_sr_context *ctx, bool precopy)
> >  {
> >      DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
> >                                      &ctx->save.dirty_bitmap_hbuf);
> >  
> >      bitmap_set(dirty_bitmap, ctx->save.p2m_size);
> >  
> > -    return send_dirty_pages(ctx, ctx->save.p2m_size);
> > +    return send_dirty_pages(ctx, ctx->save.p2m_size, precopy);
> >  }
> >  
> >  static int enable_logdirty(struct xc_sr_context *ctx)
> > @@ -446,8 +486,7 @@ static int update_progress_string(struct xc_sr_context 
> > *ctx,
> >      xc_interface *xch = ctx->xch;
> >      char *new_str = NULL;
> >  
> > -    if ( asprintf(&new_str, "Frames iteration %u of %u",
> > -                  iter, ctx->save.max_iterations) == -1 )
> > +    if ( asprintf(&new_str, "Frames iteration %u", iter) == -1 )
> >      {
> >          PERROR("Unable to allocate new progress string");
> >          return -1;
> > @@ -468,20 +507,47 @@ static int send_memory_live(struct xc_sr_context *ctx)
> >      xc_interface *xch = ctx->xch;
> >      xc_shadow_op_stats_t stats = { 0, ctx->save.p2m_size };
> >      char *progress_str = NULL;
> > -    unsigned x;
> >      int rc;
> >  
> > +    DECLARE_HYPERCALL_BUFFER_SHADOW(unsigned long, dirty_bitmap,
> > +                                    &ctx->save.dirty_bitmap_hbuf);
> > +
> > +    int (*precopy_policy)(struct precopy_stats, void *) =
> > +        ctx->save.callbacks->precopy_policy;
> > +    void *data = ctx->save.callbacks->data;
> > +
> >      rc = update_progress_string(ctx, &progress_str, 0);
> >      if ( rc )
> >          goto out;
> >  
> > -    rc = send_all_pages(ctx);
> > +#define CONSULT_POLICY                                                     
> >    \
> 
> :(
> 
> The reason this code is readable and (hopefully) easy to follow, is due
> in large part to a lack of macros like this trying to hide what is
> actually going on.

Okay, I'll inline it.  I guess I thought I might get away with it because it's
never more than a screen buffer away from its callsites.

> 
> > +    do {                                                                   
> >    \
> > +        if ( ctx->save.policy_decision == XGS_POLICY_ABORT )               
> >    \
> > +        {                                                                  
> >    \
> > +            rc = -1;                                                       
> >    \
> > +            goto out;                                                      
> >    \
> > +        }                                                                  
> >    \
> > +        else if ( ctx->save.policy_decision != XGS_POLICY_CONTINUE_PRECOPY 
> > )  \
> > +        {                                                                  
> >    \
> > +            rc = 0;                                                        
> >    \
> > +            goto out;                                                      
> >    \
> > +        }                                                                  
> >    \
> > +    } while (0)
> > +
> > +    ctx->save.stats = (struct precopy_stats)
> > +        {
> > +            .iteration     = 0,
> > +            .total_written = 0,
> > +            .dirty_count   = -1
> > +        };
> > +    rc = send_all_pages(ctx, /* precopy */ true);
> >      if ( rc )
> >          goto out;
> >  
> > -    for ( x = 1;
> > -          ((x < ctx->save.max_iterations) &&
> > -           (stats.dirty_count > ctx->save.dirty_threshold)); ++x )
> > +    /* send_all_pages() has updated the stats */
> > +    CONSULT_POLICY;
> > +
> > +    for ( ctx->save.stats.iteration = 1; ; ++ctx->save.stats.iteration )
> 
> Again, without an exit condition, this looks suspicious.

Sure, I'll turn this one into a do {} while() too.

Thank you for the review!

Josh

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.