[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH V4] X86/vMCE: handle broken page with regard to migration
Ian Campbell wrote: > On Wed, 2012-11-28 at 14:37 +0000, Liu, Jinsong wrote: >> Ping? > > Sorry I've been meaning to reply but didn't manage to yet. Also you > replied to V4 saying to ignore it, so I was half waiting for V5 but I > see this should actually be labelled V5 anyway. > > I'm afraid I still don't fully grok the reason for the loop that goes > with: > + /* > + * At the last iter, count the number of broken pages > after sending, + * and if there are more than before > sending, do one or more iter + * to make sure the pages > are marked broken on the receiving side. + */ > > Can we go through it one more time? Sorry. Sure, and I'm very much appreciated your kindly/patient review. Your comments help/force me to re-think it more detailed. > > Let me outline the sequence of events and you can point out where I'm > going wrong. I'm afraid this has turned out to be rather long, again > I'm > sorry for that. > > First we do some number of iterations with the guest live. If an MCE > occurs during this phase then the page will be marked dirty and we > will > pick this up on the next iteration and resend the page with the dirty > type etc and all is fine. This all looks good to me, so we don't need > to > worry about anything at this stage. Hmm, seems not so safe. If the page is good and it's in dirty bitmap (get from step #2), while at the iteration it broken after #4, same issue occur as your analysis of last iteration --> migration process will read broken page (except this case I agree it's OK) --> so let's merge analysis of 'migration process read broken page' with last iteration B.I case. > > Eventually we get to the last iteration, at which point we pause the > guest. From here on in the guest itself is not going to cause an MCE > (e.g. by touching its RAM) because it is not running but we must still > account for the possibility of an asynchronous MCE of some sort e.g. > triggered by the error detection mechanisms in the hardware, cosmic > rays > and such like. > > The final iteration proceeds roughly as follows. > > 1. The domain is paused > 2. We scan the dirty bitmap and add dirty pages to the batch of > pages to process (there may be several batches in the last > iteration, we only need to concern ourselves with any one > batch here). > 3. We map all of the pages in the resulting batch with > xc_map_foreign_bulk > 4. We query the types of all the pages in the batch with > xc_get_pfn_type_batch > 5. We iterate over the batch, checking the type of each page, in > some cases we do some incidental processing. > 6. We send the types of the pages in the batch over the wire. > 7. We iterate over the batch again, and send any normal page (not > broken, xtab etc) over the wire. Actually we do this as runs > of normal pages, but the key point is we avoid touching any > special page (including ones marked as broken by #4) > > Is this sequence of events accurate? Yes, exactly. > > Now lets consider the consequences of an MCE occurring at various > stages > here. > > Any MCE which happens before #4 is fine, we will pick that up in #4 > and > the following steps will do the right thing. > > Note that I am assuming that the mapping step in #3 is safe even for a > broken page, so long as we don't actually try and use the mapping > (more > on that later), is this true? I think so. #3 mmu mapping is safe itself. > > If an MCE occurs after #4 then the page will be marked as dirty in the > bitmap and Xen will internally mark it as broken, but we won't see > either of those with the current algorithm. There are two cases to > think > about here AFAICT, > A. The page was not already dirty at #2. In this case we know > that the guest hasn't dirtied the page since the previous > iteration and therefore the target has a good copy of this > page from that time. The page isn't even in the batch we are > processing So we don't particularly care about the MCE here > and can, from the PoV of migrating this guest, ignore it. > B. The page was already dirty (but not broken, we handled that > case above in "Any MCE which happens before #4...") at #2 > which means we have do not have an up to date copy on the > target. This has two subcases: > I. The MCE occurs before (or during) #6 (sending the > page) and therefore we do not have a good up to date > copy of that data at either end. > II. The MCE occurs after #6, in which case we already > have a good copy at the target end. > > To fix B you have added an 8th step to the above: > > 8. Query the types of the pages again, using > xc_get_pfn_type_batch, and if there are more pages dirty now > than we say at #4 (actually #5 when we scanned the array, but > that distinction doesn't matter) then a new MCE must have > occurred. Go back to #2 and try again. > > This won't do anything for A since the page wasn't in the batch to > start > with and so neither #4 or #8 will look at its type, this is good and > proper. > > So now we consider the two subcases of B. Lets consider B.II first > since > it seems to be the more obvious case. > > In case B.II the target end already has a good copy of the data page, > there is no need to mark the page as broken on the far end, nor to > arrange for a vMCE to be injected. I don't know if/how we arrange for > vMCEs to be injected under these circumstances, however even if a vMCE > does get injected into the guest when it eventually gets unpaused on > the > target then all that will happen is that it will needlessly throw > away a > good page. However this is a rare corner case which is not worth > concerning ourselves with (it's largely indistinguishable from case > A). > If the MCE had happened even a single cycle earlier then this would > have > been a B.I event instead of a B.II one. In any case there is no need > to > return to #2 and try again, everything will be just fine if we > complete > the migration at this point. > > In case B.I the MCE occurs before (or while) we send the page onto the > wire. We will therefore try to read from this page because we haven't > looked at the type since #4 and have no idea that it is now broken. Right. > Reading from the broken page will cause a fault, perhaps causing a > vMCE > to be delivered to dom0, which causes the kernel to kill the process > doing the migration. Or maybe it kills dom0 or the host entirely. Not exactly I think. Reading the broken page will trigger a serious SRAR error. Under such case hypervisor will inject a vMCE to the guest which was migrating, not dom0. The reason of this injection is, guest is best one to handle it, w/ sufficient clue/status/information (other component like hypervisor/dom0 are not proper). For xl migration process, after return from MCE context, it *again* read the broken page ... this will kill system entirely --> so we definitely not care migration any more. > Either > way the idea of looping again is rather moot. Agree the conclusion, though a little different opinion about its reason. So let's remove step #8? Thanks, Jinsong > > Have I missed a case which needs thinking about? > > I suspect B.I is the case where you are most likely to find a flaw in > my argument. Is there something else which is done in this case which > would > allow us to continue? > > Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |