[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] libxc: succeed silently on restore



On Thu, 2010-09-02 at 19:16 +0100, Brendan Cully wrote:
> On Thursday, 02 September 2010 at 18:01, Ian Campbell wrote:
> > So it turns out that there is a similar issue on migration:
> >         xc: Saving memory: iter 3 (last sent 37 skipped 0): 0/32768    
> > 0%xc: error: rdexact failed (select returned 0): Internal error
> >         xc: error: Error when reading batch size (110 = Connection timed 
> > out): Internal error
> >         xc: error: error when buffering batch, finishing (110 = Connection 
> > timed out): Internal error
> > 
> > I'm not so sure what can be done about this case, the way
> > xc_domain_restore is (currently) designed it relies on the saving end to
> > close its FD when it is done in order to generate an EOF at the receiver
> > end to signal the end of the migration.
> > 
> > The xl migration protocol has a postamble which prevents us from closing
> > the FD and so instead what happens is that the sender finishes the save
> > and then sits waiting for the ACK from the receiver so the receiver hits
> > the remus heartbeat timeout which causes us to continue. This isn't
> > ideal from the downtime point of view nor from just a general design
> > POV.
> > 
> > Perhaps we should insert an explicit done marker into the xc save
> > protocol which would be appended in the non-checkpoint case? Only the
> > save end is aware if the migration is a checkpoint or not (and only
> > implicitly via callbacks->checkpoint <> NULL) but that is OK, I think.
> 
> I think this can be done trivially? We can just add another negative
> length record at the end of memory copying (like the debug flag, tmem,
> hvm extensions, etc) if we're running the new xl migration protocol
> and expect restore to exit after receiving the first full
> checkpoint. Or, if you're not as worried about preserving the existing
> semantics, make the minus flag indicate that callbacks->checkpoint is
> not null, and only continue reading past the first complete checkpoint
> if you see that minus flag on the receive side.
> 
> Isn't that sufficient?

It would probably work but isn't there a benefit to having the receiver
know that it is partaking in a multiple checkpoint restore and being
told how many iterations there were etc?

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.