Xen project Mailing List

Re: [Xen-devel] Migration memory corruption - PV backends need to quiesce

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

From: Ian Campbell <Ian.Campbell@xxxxxxxxxx>

Date: Mon, 30 Jun 2014 10:52:29 +0100

Cc: Tim Deegan <tim@xxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, Xen-devel List <xen-devel@xxxxxxxxxxxxx>, Paul Durrant <Paul.Durrant@xxxxxxxxxx>, David Vrabel <david.vrabel@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>

Delivery-date: Mon, 30 Jun 2014 09:52:48 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Mon, 2014-06-30 at 10:46 +0100, Andrew Cooper wrote: > On 30/06/14 10:21, Ian Campbell wrote: > > On Mon, 2014-06-30 at 10:02 +0100, Andrew Cooper wrote: > >> On 30/06/14 09:38, Ian Campbell wrote: > >>> On Fri, 2014-06-27 at 17:51 +0100, Andrew Cooper wrote: > >>>> Hello, > >>>> > >>>> After a long time fixing my own memory corruption bugs with migration > >>>> v2, I have finally tracked down (what I really really hope is) the last > >>>> of the corruption. > >>>> > >>>> > >>>> There appears to be a systematic problem affecting all PV drivers, > >>>> whereby a non-quiescent backend can cause memory corruption in the VM. > >>>> > >>>> Active grant mapped pages are only reflected in the dirty bitmap after > >>>> the grant has been unmapped, as mapping the ring read-only would be > >>>> catastrophic to performance, and remapping as read-only when logdirty is > >>>> enabled is (as far as I understand) impossible, as Xen doesn't track the > >>>> PTEs pointing at granted frames. > >>>> > >>>> PV backend drivers hold their mappings of the rings (and persistently > >>>> granted frames) open until the domain is destroyed, which is after the > >>>> memory image has been sent. Therefore, any requests which are processed > >>>> after the migration code sending the ring frame on its first pass will > >>>> not be reflected in the resumed domain, as this frame will never be > >>>> marked as dirty in Xen. > >>>> > >>>> Furthermore, as the migration code uses memcpy() on the frames, it is > >>>> possible that a backed update intersects with the copy, and a corrupt > >>>> descriptor appears on the resumed side. > >>>> > >>>> In addition, after the domain has been paused, the backend might still > >>>> process requests. The migration code excepts the guest be completely > >>>> quiesced after it has been suspended, so will only check the dirty > >>>> bitmap once. Any requests which get processed and completed might still > >>>> be missed by the migration code. > >>>> > >>>> From a heavily instrumented Xen and migration code, I am fairly sure I > >>>> have confirmed that all pages corrupted on migration are a result of > >>>> still-active grant maps, grant copies which complete after domain > >>>> suspend, or the xenstore ring which xenstored has a magic mapping of, > >>>> and will never be reflected in the dirty bitmap. > >>>> > >>>> > >>>> Overall, it would appear that there needs to be a hook for all PV > >>>> drivers to force quiescence. In particular, a backend must guarantee to > >>>> unmap all active grant maps (so the frames get properly reflected in the > >>>> dirty bitmap), and never process subsequent requests (so no new frames > >>>> appear dirty in the bitmap after the guest has been paused). > >>>> > >>>> Thoughts/comments? > >>> I thought PV drivers were already (supposed to be) handling this in the > >>> frontend. > >>> > >>> For reasons of checkpoint performance I think Linux's net and blkfront > >>> are handling this on resume rather than on suspend by tearing down on > >>> resume and then requeueing any outstanding I/O after they reattach to > >>> the new backend. In the blkfront case this is explicit, whereas IIRC > >>> netfront just discards any active requests and relies on L3+ > >>> retransmition to get the job done. (see netfront_resume and > >>> blkfront_resume/blkif_recover). > >>> > >>> Part of the tear down and reconnect should involve invalidating any > >>> inflight descriptors, whether or not they were partially completed or > >>> have corrupted replies in them etc. This ought to be happening before > >>> the new backend sees the ring at all. > >>> > >>> Can you give an example of an instance of the corruption which you've > >>> seen? > >> The intersection of requests processed by the backends and memcpy() in > >> the migration code means that the migration code might send a frame > >> which is actively being written by the backend at the same time. > > The I/O will either complete before or after the migration. > > > > If it is before then there is no problem, I think > > IO completed before vm suspend is fine. > > However, there is a window of time between the vm suspend (after which > the frontend is quiesced) and the migration completing during which IO > is still being processed by the backend. > > After the vm suspend, the migration code does not expect the contents of > VM memory to be changing, and it only consults the dirty bitmap once more. > > Grants which are active until after this time (mappings of rings, > persistent areas) are never identified as dirty, which means that the > contents being sent will be whatever contents was present during the > first iteration of migration. Correct. And then that random contents will be fixed by the replay of the I/O on the resume side. > Grant copy operations do get correctly reflected in the dirty bitmap, > but any grant copies which happen after the final consultation of the > dirty bitmap will still be ignored. That's fine since there is no way to signal that the I/O has completed to the frontend, since it is no longer running. As long as the I/O is incomplete then the frontend will replay it. > > , the associated grant > > unmap will mark the page dirty (unless e.g. persistent grants or > > something has broken one of the assumptions here?) > > All testing was done without persistent grant support, but persistent > grants will probably make the problem much more obvious, given a much > larger range of guest memory which will be out-of-date. > > > before the reply hits > > the ring (i.e. before the guest can be aware of it). > > > > If the I/O completes after the migration then it won't actually complete > > (at least not anywhere which matters). Instead it will be reissued by > > the frtonend and the new attempt will overwrite whatever partial data > > was there during the migration with the correct data. > > Correct, which is why this problem goes mostly unnoticed. So long as > there is no intersection of the backend writing a frame and the > migration code copying that frame, resulting a corrupt frame being sent, > the replay of the ring will completely hide the issue. It's not "mostly unnoticed", it is working as designed. I think you need to consider the content of a frame which is under I/O at the time of migration to be "undefined" rather than looking for a hard "correct" vs. "corrupt" distinction. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.