[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [Patch v2] tools/migrate: Fix regression when migrating from older version of Xen
On Mon, 2013-07-22 at 18:52 +0100, Andrew Cooper wrote: > On 18/07/13 19:47, Shriram Rajagopalan wrote: > > > On Thu, Jul 18, 2013 at 12:27 PM, Andrew Cooper > > <andrew.cooper3@xxxxxxxxxx> wrote: > > On 18/07/13 17:20, Shriram Rajagopalan wrote: > > > > > On Tue, Jul 16, 2013 at 6:52 AM, Andrew Cooper > > > <andrew.cooper3@xxxxxxxxxx> wrote: > > > and you setting ctx->last_checkpoint = 0 basically means > > > that you are > > > banking on the far end (with older version of tools) to > > > close the socket, causing > > > "get pagebuf" to fail and subsequently land on finish. > > > IIRC, this was the behavior before > > > XC_SAVE_ID_LAST_CHECKPOINT was introduced > > > by Ian Campbell, to get rid of this benign error message. > > > > > > > > > That might have been 'fine' for PV guests, but it cause > > active breakage for HVM domains where the qemu save record > > immediately follows in the migration stream. > > > > > > > > > > Just to clarify.. the code flows like this, iirc. > > > > > > loadpages: > > while (1) > > if !completed > > get pagebufs > > if 0 pages sent, break > > endif > > apply batch (pagebufs) > > endwhile > > > > > > if !completed > > get tailbuf [[this is where the QEMU record would be obtained]] > > completed = 1 > > endif > > > > > > if last_checkpoint > > goto finish > > endif > > > > > > get pagebuf or goto finish on error ---> this is where old code > > used to exit > > get tailbuf > > goto loadpages > > finish: > > apply tailbuf [tailbuf obtained inside the 'if !completed' block] > > do the rest of the restore > > (Logically joining the two divergent threads, as this is the answer to > both) > > This has nothing to do with the buffering mode, and that is not where > the Qemu record would be obtained from. > > As the code currently stands, if a XC_SAVE_ID_LAST_CHECKPOINT chunk is > not seen in the stream, we complete the loadpages: section, including > the magic pages (TSS, console info etc). > > We then read the buffer_tail() on line 171, set ctx->completed on line > 1725, but fail the ctx->last_checkpoint check on line 1758. > > What we should do is pass the last_checkpoint test, and goto finish > which then calls dump_qemu(). What actually happens is a call to > pagebuf_get() on line 1766 which raises an error because of finding a > Qemu save record rather than more pages. > > So this is very much a bug directly introduced by 00a4b65f85, and can > only be fixed with knowledge from the higher levels of the toolstack. > > As for the wording of the parameter, I still prefer the original > "last_checkpoint_unaware" over "checkpointed_stream" as it is more > accurate. > > Any migration stream started from a version of the tools after c/s > 00a4b65f85 will work, whether it is checkpointed or not (as the > XC_SAVE_ID_LAST_CHECKPOINT chunk will be sent in the correct place). > Any migration started from a version of the tools before c/s > 00a4b65f85 will fail because it has no idea that the receiving end is > expecting it to insert XC_SAVE_ID_LAST_CHECKPOINT chunk. The fault > here is with the receiving end expecting to find a > XC_SAVE_ID_LAST_CHECKPOINT chunk. > > The only fix is for newer toolstack to be aware that the migration > stream is from an older toolstack, and to set > last_checkpoint_unaware=1, which will set ctx->last_checkpoint to 1, > allowing the receiving side of the migration to correctly read the > migration stream. The toolstack cannot in general know that (i.e. how does xl know?) It does know if it is doing checkpointing or not though. There are four cases: Regular migrate from pre-00a4b65f85 to post-00a4b65f85. Receiver sets last_checkpoint = 1 at start of day. Doesn't care that sender never sends XC_SAVE_ID_LAST_CHECKPOINT. Regular migrate from post-00a4b65f85 to post-00a4b65f85. Receiver sets last_checkpoint = 1 at start of day. Doesn't care that it sees XC_SAVE_ID_LAST_CHECKPOINT again, just ends up (pointlessly) setting last_checkpoint = 1. Checkpointed migrate from post-00a4b65f85 to post-00a4b65f85. Receiver doesn't set last_checkpoint = 1 at start of day. XC_SAVE_ID_LAST_CHECKPOINT is sent at the right time and both sides agree etc. Things work. Checkpointed migrate from pre-00a4b65f85 to post-00a4b65f85. We don't support this, because checkpoints should only be supported between like versions of Xen (N->N+1 case doesn't apply). Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |