[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Migration memory corruption - PV backends need to quiesce



On Mon, 2014-06-30 at 10:46 +0100, Andrew Cooper wrote:
> On 30/06/14 10:21, Ian Campbell wrote:
> > On Mon, 2014-06-30 at 10:02 +0100, Andrew Cooper wrote:
> >> On 30/06/14 09:38, Ian Campbell wrote:
> >>> On Fri, 2014-06-27 at 17:51 +0100, Andrew Cooper wrote:
> >>>> Hello,
> >>>>
> >>>> After a long time fixing my own memory corruption bugs with migration
> >>>> v2, I have finally tracked down (what I really really hope is) the last
> >>>> of the corruption.
> >>>>
> >>>>
> >>>> There appears to be a systematic problem affecting all PV drivers,
> >>>> whereby a non-quiescent backend can cause memory corruption in the VM.
> >>>>
> >>>> Active grant mapped pages are only reflected in the dirty bitmap after
> >>>> the grant has been unmapped, as mapping the ring read-only would be
> >>>> catastrophic to performance, and remapping as read-only when logdirty is
> >>>> enabled is (as far as I understand) impossible, as Xen doesn't track the
> >>>> PTEs pointing at granted frames.
> >>>>
> >>>> PV backend drivers hold their mappings of the rings (and persistently
> >>>> granted frames) open until the domain is destroyed, which is after the
> >>>> memory image has been sent.  Therefore, any requests which are processed
> >>>> after the migration code sending the ring frame on its first pass will
> >>>> not be reflected in the resumed domain, as this frame will never be
> >>>> marked as dirty in Xen.
> >>>>
> >>>> Furthermore, as the migration code uses memcpy() on the frames, it is
> >>>> possible that a backed update intersects with the copy, and a corrupt
> >>>> descriptor appears on the resumed side.
> >>>>
> >>>> In addition, after the domain has been paused, the backend might still
> >>>> process requests.  The migration code excepts the guest be completely
> >>>> quiesced after it has been suspended, so will only check the dirty
> >>>> bitmap once.  Any requests which get processed and completed might still
> >>>> be missed by the migration code.
> >>>>
> >>>> From a heavily instrumented Xen and migration code, I am fairly sure I
> >>>> have confirmed that all pages corrupted on migration are a result of
> >>>> still-active grant maps, grant copies which complete after domain
> >>>> suspend, or the xenstore ring which xenstored has a magic mapping of,
> >>>> and will never be reflected in the dirty bitmap.
> >>>>
> >>>>
> >>>> Overall, it would appear that there needs to be a hook for all PV
> >>>> drivers to force quiescence.  In particular, a backend must guarantee to
> >>>> unmap all active grant maps (so the frames get properly reflected in the
> >>>> dirty bitmap), and never process subsequent requests (so no new frames
> >>>> appear dirty in the bitmap after the guest has been paused).
> >>>>
> >>>> Thoughts/comments?
> >>> I thought PV drivers were already (supposed to be) handling this in the
> >>> frontend.
> >>>
> >>> For reasons of checkpoint performance I think Linux's net and blkfront
> >>> are handling this on resume rather than on suspend by tearing down on
> >>> resume and then requeueing any outstanding I/O after they reattach to
> >>> the new backend. In the blkfront case this is explicit, whereas IIRC
> >>> netfront just discards any active requests and relies on L3+
> >>> retransmition to get the job done. (see netfront_resume and
> >>> blkfront_resume/blkif_recover).
> >>>
> >>> Part of the tear down and reconnect should involve invalidating any
> >>> inflight descriptors, whether or not they were partially completed or
> >>> have corrupted replies in them etc. This ought to be happening before
> >>> the new backend sees the ring at all.
> >>>
> >>> Can you give an example of an instance of the corruption which you've
> >>> seen?
> >> The intersection of requests processed by the backends and memcpy() in
> >> the migration code means that the migration code might send a frame
> >> which is actively being written by the backend at the same time.
> > The I/O will either complete before or after the migration.
> >
> > If it is before then there is no problem, I think
> 
> IO completed before vm suspend is fine.
> 
> However, there is a window of time between the vm suspend (after which
> the frontend is quiesced) and the migration completing during which IO
> is still being processed by the backend.
> 
> After the vm suspend, the migration code does not expect the contents of
> VM memory to be changing, and it only consults the dirty bitmap once more.
> 
> Grants which are active until after this time (mappings of rings,
> persistent areas) are never identified as dirty, which means that the
> contents being sent will be whatever contents was present during the
> first iteration of migration.

Correct. And then that random contents will be fixed by the replay of
the I/O on the resume side.

> Grant copy operations do get correctly reflected in the dirty bitmap,
> but any grant copies which happen after the final consultation of the
> dirty bitmap will still be ignored.

That's fine since there is no way to signal that the I/O has completed
to the frontend, since it is no longer running. As long as the I/O is
incomplete then the frontend will replay it.

> > , the associated grant
> > unmap will mark the page dirty (unless e.g. persistent grants or
> > something has broken one of the assumptions here?)
> 
> All testing was done without persistent grant support, but persistent
> grants will probably make the problem much more obvious, given a much
> larger range of guest memory which will be out-of-date.
> 
> >  before the reply hits
> > the ring (i.e. before the guest can be aware of it).
> >
> > If the I/O completes after the migration then it won't actually complete
> > (at least not anywhere which matters). Instead it will be reissued by
> > the frtonend and the new attempt will overwrite whatever partial data
> > was there during the migration with the correct data.
> 
> Correct, which is why this problem goes mostly unnoticed.  So long as
> there is no intersection of the backend writing a frame and the
> migration code copying that frame, resulting a corrupt frame being sent,
> the replay of the ring will completely hide the issue.

It's not "mostly unnoticed", it is working as designed.

I think you need to consider the content of a frame which is under I/O
at the time of migration to be "undefined" rather than looking for a
hard "correct" vs. "corrupt" distinction.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.