[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH V4] X86/vMCE: handle broken page with regard to migration



Liu, Jinsong wrote:
> Ian Campbell wrote:
>> On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote:
>>> Ping?
>> 
>> Sorry I've been meaning to reply but didn't manage to yet. Also you
>> replied to V4 saying to ignore it, so I was half waiting for V5 but I
>> see this should actually be labelled V5 anyway.
>> 
>> I'm afraid I still don't fully grok the reason for the loop that
>> goes with: +            /*
>> +             * At the last iter, count the number of broken pages
>> after sending, +             * and if there are more than before
>> sending, do one or more iter +             * to make sure the pages
>> are marked broken on the receiving side. +             */
>> 
>> Can we go through it one more time? Sorry.
> 
> Sure, and I'm very much appreciated your kindly/patient review. Your
> comments help/force me to re-think it more detailed. 
> 
>> 
>> Let me outline the sequence of events and you can point out where I'm
>> going wrong. I'm afraid this has turned out to be rather long, again
>> I'm sorry for that.
>> 
>> First we do some number of iterations with the guest live. If an MCE
>> occurs during this phase then the page will be marked dirty and we
>> will pick this up on the next iteration and resend the page with the
>> dirty type etc and all is fine. This all looks good to me, so we
>> don't need to worry about anything at this stage.
> 
> Hmm, seems not so safe. If the page is good and it's in dirty bitmap
> (get from step #2), while at the iteration it broken after #4, same
> issue occur as your analysis of last iteration --> migration process
> will read broken page (except this case I agree it's OK) --> so let's
> merge analysis of 'migration process read broken page' with last
> iteration B.I case.     
> 
>> 
>> Eventually we get to the last iteration, at which point we pause the
>> guest. From here on in the guest itself is not going to cause an MCE
>> (e.g. by touching its RAM) because it is not running but we must
>> still account for the possibility of an asynchronous MCE of some
>> sort e.g. triggered by the error detection mechanisms in the
>> hardware, cosmic rays and such like.
>> 
>> The final iteration proceeds roughly as follows.
>> 
>>      1. The domain is paused
>>      2. We scan the dirty bitmap and add dirty pages to the batch of
>>         pages to process (there may be several batches in the last
>>         iteration, we only need to concern ourselves with any one   
>> batch here). 
>>      3. We map all of the pages in the resulting batch with        
>> xc_map_foreign_bulk 
>>      4. We query the types of all the pages in the batch with       
>> xc_get_pfn_type_batch 
>>      5. We iterate over the batch, checking the type of each page, in
>>         some cases we do some incidental processing.
>>      6. We send the types of the pages in the batch over the wire.
>>      7. We iterate over the batch again, and send any normal page
>>         (not broken, xtab etc) over the wire. Actually we do this as
>>         runs of normal pages, but the key point is we avoid touching
>>         any special page (including ones marked as broken by #4)
>> 
>> Is this sequence of events accurate?
> 
> Yes, exactly.
> 
>> 
>> Now lets consider the consequences of an MCE occurring at various
>> stages here.
>> 
>> Any MCE which happens before #4 is fine, we will pick that up in #4
>> and the following steps will do the right thing.
>> 
>> Note that I am assuming that the mapping step in #3 is safe even for
>> a broken page, so long as we don't actually try and use the mapping
>> (more on that later), is this true?
> 
> I think so. #3 mmu mapping is safe itself.
> 
>> 
>> If an MCE occurs after #4 then the page will be marked as dirty in
>> the bitmap and Xen will internally mark it as broken, but we won't
>> see either of those with the current algorithm. There are two cases
>> to think about here AFAICT,
>>      A. The page was not already dirty at #2. In this case we know
>>         that the guest hasn't dirtied the page since the previous
>>         iteration and therefore the target has a good copy of this
>>         page from that time. The page isn't even in the batch we are
>>         processing So we don't particularly care about the MCE here
>>         and can, from the PoV of migrating this guest, ignore it.
>>      B. The page was already dirty (but not broken, we handled that
>>         case above in "Any MCE which happens before #4...") at #2
>>         which means we have do not have an up to date copy on the
>>         target. This has two subcases:
>>              I. The MCE occurs before (or during) #6 (sending the
>>                 page) and therefore we do not have a good up to date
>>                 copy of that data at either end.
>>             II. The MCE occurs after #6, in which case we already
>>                 have a good copy at the target end.
>> 
>> To fix B you have added an 8th step to the above:
>> 
>>         8. Query the types of the pages again, using
>>         xc_get_pfn_type_batch, and if there are more pages dirty now
>>         than we say at #4 (actually #5 when we scanned the array, but
>>         that distinction doesn't matter) then a new MCE must have
>>         occurred. Go back to #2 and try again.
>> 
>> This won't do anything for A since the page wasn't in the batch to
>> start with and so neither #4 or #8 will look at its type, this is
>> good and proper. 
>> 
>> So now we consider the two subcases of B. Lets consider B.II first
>> since it seems to be the more obvious case.
>> 
>> In case B.II the target end already has a good copy of the data page,
>> there is no need to mark the page as broken on the far end, nor to
>> arrange for a vMCE to be injected. I don't know if/how we arrange for
>> vMCEs to be injected under these circumstances, however even if a
>> vMCE does get injected into the guest when it eventually gets
>> unpaused on the target then all that will happen is that it will
>> needlessly throw away a good page. However this is a rare corner
>> case which is not worth concerning ourselves with (it's largely
>> indistinguishable from case A). If the MCE had happened even a
>> single cycle earlier then this would have been a B.I event instead
>> of a B.II one. In any case there is no need to return to #2 and try
>> again, everything will be just fine if we complete the migration at
>> this point. 
>> 
>> In case B.I the MCE occurs before (or while) we send the page onto
>> the wire. We will therefore try to read from this page because we
>> haven't looked at the type since #4 and have no idea that it is now
>> broken. 
> 
> Right.
> 
>> Reading from the broken page will cause a fault, perhaps causing a
>> vMCE to be delivered to dom0, which causes the kernel to kill the
>> process doing the migration. Or maybe it kills dom0 or the host
>> entirely. 
> 
> Not exactly I think. Reading the broken page will trigger a serious
> SRAR error. Under such case hypervisor will inject a vMCE to the
> guest which was migrating, not dom0. The reason of this injection is,
> guest is best one to handle it, w/ sufficient clue/status/information
> (other component like hypervisor/dom0 are not proper). For xl
> migration process, after return from MCE context, it *again* read the
> broken page ... this will kill system entirely --> so we definitely
> not care migration any more.       
> 
>> Either
>> way the idea of looping again is rather moot.
> 
> Agree the conclusion, though a little different opinion about its
> reason. So let's remove step #8? 
> 
> Thanks,
> Jinsong

More simply, we can remove not only step #8 for last iteration check, but also 
the action to 'mark broken page to dirty bitmap':
In any iteration, there is a key point #4, if vmce occur before #4 it will 
transfer proper pfn_type/pfn_number and (xl migration) will not access broken 
page (if guest access the broken page again it will be killed by hypervisor), 
if vmce occur after #4 system will crash and no need care migration any more.

So we can go back to the original patch which used to handle 'vmce occur before 
migration' and entirely don't need add specific code to handle 'vmce occur 
during migration', since it in fact has handled both cases (and simple). 
Thoughts?

Thanks,
Jinsong
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.