[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH V4] X86/vMCE: handle broken page with regard to migration

On 29/11/12 10:02, Ian Campbell wrote:
On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote:
Sorry I've been meaning to reply but didn't manage to yet. Also you
replied to V4 saying to ignore it, so I was half waiting for V5 but I
see this should actually be labelled V5 anyway.

I'm afraid I still don't fully grok the reason for the loop that goes
+            /*
+             * At the last iter, count the number of broken pages after 
+             * and if there are more than before sending, do one or more iter
+             * to make sure the pages are marked broken on the receiving side.
+             */

Can we go through it one more time? Sorry.

Let me outline the sequence of events and you can point out where I'm
going wrong. I'm afraid this has turned out to be rather long, again I'm
sorry for that.

First we do some number of iterations with the guest live. If an MCE
occurs during this phase then the page will be marked dirty and we will
pick this up on the next iteration and resend the page with the dirty
type etc and all is fine. This all looks good to me, so we don't need to
worry about anything at this stage.

Eventually we get to the last iteration, at which point we pause the
guest. From here on in the guest itself is not going to cause an MCE
(e.g. by touching its RAM) because it is not running but we must still
account for the possibility of an asynchronous MCE of some sort e.g.
triggered by the error detection mechanisms in the hardware, cosmic rays
and such like.

The final iteration proceeds roughly as follows.

      1. The domain is paused
      2. We scan the dirty bitmap and add dirty pages to the batch of
         pages to process (there may be several batches in the last
         iteration, we only need to concern ourselves with any one batch
      3. We map all of the pages in the resulting batch with
      4. We query the types of all the pages in the batch with
      5. We iterate over the batch, checking the type of each page, in
         some cases we do some incidental processing.
      6. We send the types of the pages in the batch over the wire.
      7. We iterate over the batch again, and send any normal page (not
         broken, xtab etc) over the wire. Actually we do this as runs of
         normal pages, but the key point is we avoid touching any special
         page (including ones marked as broken by #4)

Is this sequence of events accurate?

Now lets consider the consequences of an MCE occurring at various stages

Any MCE which happens before #4 is fine, we will pick that up in #4 and
the following steps will do the right thing.

Note that I am assuming that the mapping step in #3 is safe even for a
broken page, so long as we don't actually try and use the mapping (more
on that later), is this true?

If an MCE occurs after #4 then the page will be marked as dirty in the
bitmap and Xen will internally mark it as broken, but we won't see
either of those with the current algorithm. There are two cases to think
about here AFAICT,
      A. The page was not already dirty at #2. In this case we know that
         the guest hasn't dirtied the page since the previous iteration
         and therefore the target has a good copy of this page from that
         time. The page isn't even in the batch we are processing So we
         don't particularly care about the MCE here and can, from the PoV
         of migrating this guest, ignore it.
      B. The page was already dirty (but not broken, we handled that case
         above in "Any MCE which happens before #4...") at #2 which means
         we have do not have an up to date copy on the target. This has
         two subcases:
              I. The MCE occurs before (or during) #6 (sending the page)
                 and therefore we do not have a good up to date copy of
                 that data at either end.
             II. The MCE occurs after #6, in which case we already have a
                 good copy at the target end.

To fix B you have added an 8th step to the above:

         8. Query the types of the pages again, using
         xc_get_pfn_type_batch, and if there are more pages dirty now
         than we say at #4 (actually #5 when we scanned the array, but
         that distinction doesn't matter) then a new MCE must have
         occurred. Go back to #2 and try again.

This won't do anything for A since the page wasn't in the batch to start
with and so neither #4 or #8 will look at its type, this is good and

So now we consider the two subcases of B. Lets consider B.II first since
it seems to be the more obvious case.

In case B.II the target end already has a good copy of the data page,
there is no need to mark the page as broken on the far end, nor to
arrange for a vMCE to be injected. I don't know if/how we arrange for
vMCEs to be injected under these circumstances, however even if a vMCE
does get injected into the guest when it eventually gets unpaused on the
target then all that will happen is that it will needlessly throw away a
good page. However this is a rare corner case which is not worth
concerning ourselves with (it's largely indistinguishable from case A).
If the MCE had happened even a single cycle earlier then this would have
been a B.I event instead of a B.II one. In any case there is no need to
return to #2 and try again, everything will be just fine if we complete
the migration at this point.

In case B.I the MCE occurs before (or while) we send the page onto the
wire. We will therefore try to read from this page because we haven't
looked at the type since #4 and have no idea that it is now broken.
Reading from the broken page will cause a fault, perhaps causing a vMCE
to be delivered to dom0, which causes the kernel to kill the process
doing the migration. Or maybe it kills dom0 or the host entirely. Either
way the idea of looping again is rather moot.

Have I missed a case which needs thinking about?

I suspect B.I is the case where you are most likely to find a flaw in my
argument. Is there something else which is done in this case which would
allow us to continue?

I think your analysis is correct -- the only question is whether B.I will 100% cause the migration to crash, or whether there's a chance of not crashing on the read. I had tried to ask that question before, and understood Jinsong's response to be that it's not 100% sure that the read would cause an error. However, looking back at the thread, I think I may have understood something that was not there. :-)

So I guess the question for Jinsong is this:

The only time this extra loop could help is if there is a page broken after being paused but before being sent the last time (B.I in Ian's analysis) -- in which case, the migration code is 100% guaranteed to read a now-broken page. What are the chances that this read of a broken page will *not* cause a fault which will kill at least the migration process, if not dom0? If the chances are "slim-to-none", then there is no point for the extra check.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.