Xen project Mailing List

Re: [Xen-devel] [PATCH V4] X86/vMCE: handle broken page with regard to migration

To: "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>

From: Ian Campbell <Ian.Campbell@xxxxxxxxxx>

Date: Thu, 29 Nov 2012 10:02:36 +0000

Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>

Delivery-date: Thu, 29 Nov 2012 10:03:13 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote: > Ping? Sorry I've been meaning to reply but didn't manage to yet. Also you replied to V4 saying to ignore it, so I was half waiting for V5 but I see this should actually be labelled V5 anyway. I'm afraid I still don't fully grok the reason for the loop that goes with: + /* + * At the last iter, count the number of broken pages after sending, + * and if there are more than before sending, do one or more iter + * to make sure the pages are marked broken on the receiving side. + */ Can we go through it one more time? Sorry. Let me outline the sequence of events and you can point out where I'm going wrong. I'm afraid this has turned out to be rather long, again I'm sorry for that. First we do some number of iterations with the guest live. If an MCE occurs during this phase then the page will be marked dirty and we will pick this up on the next iteration and resend the page with the dirty type etc and all is fine. This all looks good to me, so we don't need to worry about anything at this stage. Eventually we get to the last iteration, at which point we pause the guest. From here on in the guest itself is not going to cause an MCE (e.g. by touching its RAM) because it is not running but we must still account for the possibility of an asynchronous MCE of some sort e.g. triggered by the error detection mechanisms in the hardware, cosmic rays and such like. The final iteration proceeds roughly as follows. 1. The domain is paused 2. We scan the dirty bitmap and add dirty pages to the batch of pages to process (there may be several batches in the last iteration, we only need to concern ourselves with any one batch here). 3. We map all of the pages in the resulting batch with xc_map_foreign_bulk 4. We query the types of all the pages in the batch with xc_get_pfn_type_batch 5. We iterate over the batch, checking the type of each page, in some cases we do some incidental processing. 6. We send the types of the pages in the batch over the wire. 7. We iterate over the batch again, and send any normal page (not broken, xtab etc) over the wire. Actually we do this as runs of normal pages, but the key point is we avoid touching any special page (including ones marked as broken by #4) Is this sequence of events accurate? Now lets consider the consequences of an MCE occurring at various stages here. Any MCE which happens before #4 is fine, we will pick that up in #4 and the following steps will do the right thing. Note that I am assuming that the mapping step in #3 is safe even for a broken page, so long as we don't actually try and use the mapping (more on that later), is this true? If an MCE occurs after #4 then the page will be marked as dirty in the bitmap and Xen will internally mark it as broken, but we won't see either of those with the current algorithm. There are two cases to think about here AFAICT, A. The page was not already dirty at #2. In this case we know that the guest hasn't dirtied the page since the previous iteration and therefore the target has a good copy of this page from that time. The page isn't even in the batch we are processing So we don't particularly care about the MCE here and can, from the PoV of migrating this guest, ignore it. B. The page was already dirty (but not broken, we handled that case above in "Any MCE which happens before #4...") at #2 which means we have do not have an up to date copy on the target. This has two subcases: I. The MCE occurs before (or during) #6 (sending the page) and therefore we do not have a good up to date copy of that data at either end. II. The MCE occurs after #6, in which case we already have a good copy at the target end. To fix B you have added an 8th step to the above: 8. Query the types of the pages again, using xc_get_pfn_type_batch, and if there are more pages dirty now than we say at #4 (actually #5 when we scanned the array, but that distinction doesn't matter) then a new MCE must have occurred. Go back to #2 and try again. This won't do anything for A since the page wasn't in the batch to start with and so neither #4 or #8 will look at its type, this is good and proper. So now we consider the two subcases of B. Lets consider B.II first since it seems to be the more obvious case. In case B.II the target end already has a good copy of the data page, there is no need to mark the page as broken on the far end, nor to arrange for a vMCE to be injected. I don't know if/how we arrange for vMCEs to be injected under these circumstances, however even if a vMCE does get injected into the guest when it eventually gets unpaused on the target then all that will happen is that it will needlessly throw away a good page. However this is a rare corner case which is not worth concerning ourselves with (it's largely indistinguishable from case A). If the MCE had happened even a single cycle earlier then this would have been a B.I event instead of a B.II one. In any case there is no need to return to #2 and try again, everything will be just fine if we complete the migration at this point. In case B.I the MCE occurs before (or while) we send the page onto the wire. We will therefore try to read from this page because we haven't looked at the type since #4 and have no idea that it is now broken. Reading from the broken page will cause a fault, perhaps causing a vMCE to be delivered to dom0, which causes the kernel to kill the process doing the migration. Or maybe it kills dom0 or the host entirely. Either way the idea of looping again is rather moot. Have I missed a case which needs thinking about? I suspect B.I is the case where you are most likely to find a flaw in my argument. Is there something else which is done in this case which would allow us to continue? Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.