Xen project Mailing List

Re: [Xen-devel] [Patch v2] tools/migrate: Fix regression when migrating from older version of Xen

To: Ian Campbell <ian.campbell@xxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Mon, 22 Jul 2013 19:06:39 +0100

Cc: rshriram@xxxxxxxxx, Xen-devel <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Mon, 22 Jul 2013 18:07:01 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 22/07/13 18:59, Ian Campbell wrote: > On Mon, 2013-07-22 at 18:52 +0100, Andrew Cooper wrote: >> On 18/07/13 19:47, Shriram Rajagopalan wrote: >> >>> On Thu, Jul 18, 2013 at 12:27 PM, Andrew Cooper >>> <andrew.cooper3@xxxxxxxxxx> wrote: >>> On 18/07/13 17:20, Shriram Rajagopalan wrote: >>> >>> > On Tue, Jul 16, 2013 at 6:52 AM, Andrew Cooper >>> > <andrew.cooper3@xxxxxxxxxx> wrote: >>> > and you setting ctx->last_checkpoint = 0 basically means >>> > that you are >>> > banking on the far end (with older version of tools) to >>> > close the socket, causing >>> > "get pagebuf" to fail and subsequently land on finish. >>> > IIRC, this was the behavior before >>> > XC_SAVE_ID_LAST_CHECKPOINT was introduced >>> > by Ian Campbell, to get rid of this benign error message. >>> > >>> >>> >>> That might have been 'fine' for PV guests, but it cause >>> active breakage for HVM domains where the qemu save record >>> immediately follows in the migration stream. >>> >>> >>> >>> >>> Just to clarify.. the code flows like this, iirc. >>> >>> >>> loadpages: >>> while (1) >>> if !completed >>> get pagebufs >>> if 0 pages sent, break >>> endif >>> apply batch (pagebufs) >>> endwhile >>> >>> >>> if !completed >>> get tailbuf [[this is where the QEMU record would be obtained]] >>> completed = 1 >>> endif >>> >>> >>> if last_checkpoint >>> goto finish >>> endif >>> >>> >>> get pagebuf or goto finish on error ---> this is where old code >>> used to exit >>> get tailbuf >>> goto loadpages >>> finish: >>> apply tailbuf [tailbuf obtained inside the 'if !completed' block] >>> do the rest of the restore >> (Logically joining the two divergent threads, as this is the answer to >> both) >> >> This has nothing to do with the buffering mode, and that is not where >> the Qemu record would be obtained from. >> >> As the code currently stands, if a XC_SAVE_ID_LAST_CHECKPOINT chunk is >> not seen in the stream, we complete the loadpages: section, including >> the magic pages (TSS, console info etc). >> >> We then read the buffer_tail() on line 171, set ctx->completed on line >> 1725, but fail the ctx->last_checkpoint check on line 1758. >> >> What we should do is pass the last_checkpoint test, and goto finish >> which then calls dump_qemu(). What actually happens is a call to >> pagebuf_get() on line 1766 which raises an error because of finding a >> Qemu save record rather than more pages. >> >> So this is very much a bug directly introduced by 00a4b65f85, and can >> only be fixed with knowledge from the higher levels of the toolstack. >> >> As for the wording of the parameter, I still prefer the original >> "last_checkpoint_unaware" over "checkpointed_stream" as it is more >> accurate. >> >> Any migration stream started from a version of the tools after c/s >> 00a4b65f85 will work, whether it is checkpointed or not (as the >> XC_SAVE_ID_LAST_CHECKPOINT chunk will be sent in the correct place). >> Any migration started from a version of the tools before c/s >> 00a4b65f85 will fail because it has no idea that the receiving end is >> expecting it to insert XC_SAVE_ID_LAST_CHECKPOINT chunk. The fault >> here is with the receiving end expecting to find a >> XC_SAVE_ID_LAST_CHECKPOINT chunk. >> >> The only fix is for newer toolstack to be aware that the migration >> stream is from an older toolstack, and to set >> last_checkpoint_unaware=1, which will set ctx->last_checkpoint to 1, >> allowing the receiving side of the migration to correctly read the >> migration stream. > The toolstack cannot in general know that (i.e. how does xl know?) > > It does know if it is doing checkpointing or not though. > > There are four cases: > > Regular migrate from pre-00a4b65f85 to post-00a4b65f85. Receiver sets > last_checkpoint = 1 at start of day. Doesn't care that sender never > sends XC_SAVE_ID_LAST_CHECKPOINT. > > Regular migrate from post-00a4b65f85 to post-00a4b65f85. Receiver sets > last_checkpoint = 1 at start of day. Doesn't care that it sees > XC_SAVE_ID_LAST_CHECKPOINT again, just ends up (pointlessly) setting > last_checkpoint = 1. > > Checkpointed migrate from post-00a4b65f85 to post-00a4b65f85. Receiver > doesn't set last_checkpoint = 1 at start of day. > XC_SAVE_ID_LAST_CHECKPOINT is sent at the right time and both sides > agree etc. Things work. > > Checkpointed migrate from pre-00a4b65f85 to post-00a4b65f85. We don't > support this, because checkpoints should only be supported between like > versions of Xen (N->N+1 case doesn't apply). > > Ian. > > Ok - that seems to make it easier. I will see about plumbing the correct information through from xend/remus. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.