[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH V4] X86/vMCE: handle broken page with regard to migration



Ian Campbell wrote:
> On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote:
>> Ping?
> 
> Sorry I've been meaning to reply but didn't manage to yet. Also you
> replied to V4 saying to ignore it, so I was half waiting for V5 but I
> see this should actually be labelled V5 anyway.
> 
> I'm afraid I still don't fully grok the reason for the loop that goes
> with:
> +            /*
> +             * At the last iter, count the number of broken pages
> after sending, +             * and if there are more than before
> sending, do one or more iter +             * to make sure the pages
> are marked broken on the receiving side. +             */
> 
> Can we go through it one more time? Sorry.

Sure, and I'm very much appreciated your kindly/patient review. Your comments 
help/force me to re-think it more detailed.

> 
> Let me outline the sequence of events and you can point out where I'm
> going wrong. I'm afraid this has turned out to be rather long, again
> I'm 
> sorry for that.
> 
> First we do some number of iterations with the guest live. If an MCE
> occurs during this phase then the page will be marked dirty and we
> will 
> pick this up on the next iteration and resend the page with the dirty
> type etc and all is fine. This all looks good to me, so we don't need
> to 
> worry about anything at this stage.

Hmm, seems not so safe. If the page is good and it's in dirty bitmap (get from 
step #2), while at the iteration it broken after #4, same issue occur as your 
analysis of last iteration --> migration process will read broken page (except 
this case I agree it's OK) --> so let's merge analysis of 'migration process 
read broken page' with last iteration B.I case.

> 
> Eventually we get to the last iteration, at which point we pause the
> guest. From here on in the guest itself is not going to cause an MCE
> (e.g. by touching its RAM) because it is not running but we must still
> account for the possibility of an asynchronous MCE of some sort e.g.
> triggered by the error detection mechanisms in the hardware, cosmic
> rays 
> and such like.
> 
> The final iteration proceeds roughly as follows.
> 
>      1. The domain is paused
>      2. We scan the dirty bitmap and add dirty pages to the batch of
>         pages to process (there may be several batches in the last
>         iteration, we only need to concern ourselves with any one
>         batch here).
>      3. We map all of the pages in the resulting batch with
>         xc_map_foreign_bulk
>      4. We query the types of all the pages in the batch with
>         xc_get_pfn_type_batch
>      5. We iterate over the batch, checking the type of each page, in
>         some cases we do some incidental processing.
>      6. We send the types of the pages in the batch over the wire.
>      7. We iterate over the batch again, and send any normal page (not
>         broken, xtab etc) over the wire. Actually we do this as runs
>         of normal pages, but the key point is we avoid touching any
>         special page (including ones marked as broken by #4)
> 
> Is this sequence of events accurate?

Yes, exactly.

> 
> Now lets consider the consequences of an MCE occurring at various
> stages 
> here.
> 
> Any MCE which happens before #4 is fine, we will pick that up in #4
> and 
> the following steps will do the right thing.
> 
> Note that I am assuming that the mapping step in #3 is safe even for a
> broken page, so long as we don't actually try and use the mapping
> (more 
> on that later), is this true?

I think so. #3 mmu mapping is safe itself.

> 
> If an MCE occurs after #4 then the page will be marked as dirty in the
> bitmap and Xen will internally mark it as broken, but we won't see
> either of those with the current algorithm. There are two cases to
> think 
> about here AFAICT,
>      A. The page was not already dirty at #2. In this case we know
>         that the guest hasn't dirtied the page since the previous
>         iteration and therefore the target has a good copy of this
>         page from that time. The page isn't even in the batch we are
>         processing So we don't particularly care about the MCE here
>         and can, from the PoV of migrating this guest, ignore it.
>      B. The page was already dirty (but not broken, we handled that
>         case above in "Any MCE which happens before #4...") at #2
>         which means we have do not have an up to date copy on the
>         target. This has two subcases:
>              I. The MCE occurs before (or during) #6 (sending the
>                 page) and therefore we do not have a good up to date
>                 copy of that data at either end.
>             II. The MCE occurs after #6, in which case we already
>                 have a good copy at the target end.
> 
> To fix B you have added an 8th step to the above:
> 
>         8. Query the types of the pages again, using
>         xc_get_pfn_type_batch, and if there are more pages dirty now
>         than we say at #4 (actually #5 when we scanned the array, but
>         that distinction doesn't matter) then a new MCE must have
>         occurred. Go back to #2 and try again.
> 
> This won't do anything for A since the page wasn't in the batch to
> start 
> with and so neither #4 or #8 will look at its type, this is good and
> proper.
> 
> So now we consider the two subcases of B. Lets consider B.II first
> since 
> it seems to be the more obvious case.
> 
> In case B.II the target end already has a good copy of the data page,
> there is no need to mark the page as broken on the far end, nor to
> arrange for a vMCE to be injected. I don't know if/how we arrange for
> vMCEs to be injected under these circumstances, however even if a vMCE
> does get injected into the guest when it eventually gets unpaused on
> the 
> target then all that will happen is that it will needlessly throw
> away a 
> good page. However this is a rare corner case which is not worth
> concerning ourselves with (it's largely indistinguishable from case
> A). 
> If the MCE had happened even a single cycle earlier then this would
> have 
> been a B.I event instead of a B.II one. In any case there is no need
> to 
> return to #2 and try again, everything will be just fine if we
> complete 
> the migration at this point.
> 
> In case B.I the MCE occurs before (or while) we send the page onto the
> wire. We will therefore try to read from this page because we haven't
> looked at the type since #4 and have no idea that it is now broken.

Right.

> Reading from the broken page will cause a fault, perhaps causing a
> vMCE 
> to be delivered to dom0, which causes the kernel to kill the process
> doing the migration. Or maybe it kills dom0 or the host entirely.

Not exactly I think. Reading the broken page will trigger a serious SRAR error. 
Under such case hypervisor will inject a vMCE to the guest which was migrating, 
not dom0. The reason of this injection is, guest is best one to handle it, w/ 
sufficient clue/status/information (other component like hypervisor/dom0 are 
not proper). For xl migration process, after return from MCE context, it 
*again* read the broken page ... this will kill system entirely --> so we 
definitely not care migration any more.

> Either 
> way the idea of looping again is rather moot.

Agree the conclusion, though a little different opinion about its reason. So 
let's remove step #8?

Thanks,
Jinsong

> 
> Have I missed a case which needs thinking about?
> 
> I suspect B.I is the case where you are most likely to find a flaw in
> my argument. Is there something else which is done in this case which
> would 
> allow us to continue?
> 
> Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.