[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH V4] X86/vMCE: handle broken page with regard to migration



George Dunlap wrote:
> On 29/11/12 10:02, Ian Campbell wrote:
>> On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote:
>>> Ping?
>> Sorry I've been meaning to reply but didn't manage to yet. Also you
>> replied to V4 saying to ignore it, so I was half waiting for V5 but I
>> see this should actually be labelled V5 anyway.
>> 
>> I'm afraid I still don't fully grok the reason for the loop that goes
>> with:
>> +            /*
>> +             * At the last iter, count the number of broken pages
>> after sending, +             * and if there are more than before
>> sending, do one or more iter +             * to make sure the pages
>> are marked broken on the receiving side. +             */ 
>> 
>> Can we go through it one more time? Sorry.
>> 
>> Let me outline the sequence of events and you can point out where I'm
>> going wrong. I'm afraid this has turned out to be rather long, again
>> I'm 
>> sorry for that.
>> 
>> First we do some number of iterations with the guest live. If an MCE
>> occurs during this phase then the page will be marked dirty and we
>> will 
>> pick this up on the next iteration and resend the page with the dirty
>> type etc and all is fine. This all looks good to me, so we don't
>> need to 
>> worry about anything at this stage.
>> 
>> Eventually we get to the last iteration, at which point we pause the
>> guest. From here on in the guest itself is not going to cause an MCE
>> (e.g. by touching its RAM) because it is not running but we must
>> still 
>> account for the possibility of an asynchronous MCE of some sort e.g.
>> triggered by the error detection mechanisms in the hardware, cosmic
>> rays 
>> and such like.
>> 
>> The final iteration proceeds roughly as follows.
>> 
>>       1. The domain is paused
>>       2. We scan the dirty bitmap and add dirty pages to the batch of
>>          pages to process (there may be several batches in the last
>>          iteration, we only need to concern ourselves with any one
>> batch          here). 
>>       3. We map all of the pages in the resulting batch with        
>> xc_map_foreign_bulk 
>>       4. We query the types of all the pages in the batch with
>>          xc_get_pfn_type_batch
>>       5. We iterate over the batch, checking the type of each page,
>>          in some cases we do some incidental processing.
>>       6. We send the types of the pages in the batch over the wire.
>>       7. We iterate over the batch again, and send any normal page
>>          (not broken, xtab etc) over the wire. Actually we do this
>>          as runs of normal pages, but the key point is we avoid
>>          touching any special page (including ones marked as broken
>> by #4) 
>> 
>> Is this sequence of events accurate?
>> 
>> Now lets consider the consequences of an MCE occurring at various
>> stages 
>> here.
>> 
>> Any MCE which happens before #4 is fine, we will pick that up in #4
>> and 
>> the following steps will do the right thing.
>> 
>> Note that I am assuming that the mapping step in #3 is safe even for
>> a 
>> broken page, so long as we don't actually try and use the mapping
>> (more 
>> on that later), is this true?
>> 
>> If an MCE occurs after #4 then the page will be marked as dirty in
>> the 
>> bitmap and Xen will internally mark it as broken, but we won't see
>> either of those with the current algorithm. There are two cases to
>> think 
>> about here AFAICT,
>>       A. The page was not already dirty at #2. In this case we know
>>          that the guest hasn't dirtied the page since the previous
>>          iteration and therefore the target has a good copy of this
>>          page from that time. The page isn't even in the batch we
>>          are processing So we don't particularly care about the MCE
>>          here and can, from the PoV of migrating this guest, ignore
>>       it. B. The page was already dirty (but not broken, we handled
>>          that case above in "Any MCE which happens before #4...") at
>>          #2 which means we have do not have an up to date copy on
>>               the target. This has          two subcases: I. The MCE
>>                  occurs before (or during) #6 (sending the page) and
>>                  therefore we do not have a good up to date copy of
>>              that data at either end. II. The MCE occurs after #6,
>>                  in which case we already have a good copy at the
>> target end. 
>> 
>> To fix B you have added an 8th step to the above:
>> 
>>          8. Query the types of the pages again, using
>>          xc_get_pfn_type_batch, and if there are more pages dirty now
>>          than we say at #4 (actually #5 when we scanned the array,
>>          but that distinction doesn't matter) then a new MCE must
>>          have occurred. Go back to #2 and try again.
>> 
>> This won't do anything for A since the page wasn't in the batch to
>> start 
>> with and so neither #4 or #8 will look at its type, this is good and
>> proper.
>> 
>> So now we consider the two subcases of B. Lets consider B.II first
>> since 
>> it seems to be the more obvious case.
>> 
>> In case B.II the target end already has a good copy of the data page,
>> there is no need to mark the page as broken on the far end, nor to
>> arrange for a vMCE to be injected. I don't know if/how we arrange for
>> vMCEs to be injected under these circumstances, however even if a
>> vMCE 
>> does get injected into the guest when it eventually gets unpaused on
>> the 
>> target then all that will happen is that it will needlessly throw
>> away a 
>> good page. However this is a rare corner case which is not worth
>> concerning ourselves with (it's largely indistinguishable from case
>> A). 
>> If the MCE had happened even a single cycle earlier then this would
>> have 
>> been a B.I event instead of a B.II one. In any case there is no need
>> to 
>> return to #2 and try again, everything will be just fine if we
>> complete 
>> the migration at this point.
>> 
>> In case B.I the MCE occurs before (or while) we send the page onto
>> the 
>> wire. We will therefore try to read from this page because we haven't
>> looked at the type since #4 and have no idea that it is now broken.
>> Reading from the broken page will cause a fault, perhaps causing a
>> vMCE 
>> to be delivered to dom0, which causes the kernel to kill the process
>> doing the migration. Or maybe it kills dom0 or the host entirely.
>> Either 
>> way the idea of looping again is rather moot.
>> 
>> Have I missed a case which needs thinking about?
>> 
>> I suspect B.I is the case where you are most likely to find a flaw
>> in my argument. Is there something else which is done in this case
>> which would 
>> allow us to continue?
> 
> I think your analysis is correct -- the only question is whether B.I
> will 100% cause the migration to crash, or whether there's a chance of
> not crashing on the read.  I had tried to ask that question before,
> and understood Jinsong's response to be that it's not 100% sure that
> the read would cause an error.  However, looking back at the thread,
> I think I may have understood something that was not there. :-)
> 
> So I guess the question for Jinsong is this:
> 
> The only time this extra loop could help is if there is a page broken
> after being paused but before being sent the last time (B.I in Ian's
> analysis) -- in which case, the migration code is 100% guaranteed to
> read a now-broken page.  What are the chances that this read of a
> broken page will *not* cause a fault which will kill at least the
> migration process, if not dom0?  If the chances are "slim-to-none",
> then there is no point for the extra check.
> 
>   -George

Yes, the extra check is pointless, xl migration reading broken page will 
eventually kill system. See my reply IanC's email :-)

Thanks,
Jinsong
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.