[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 4/5] Xen/MCE: Abort live migration when vMCE occur

Ian Jackson wrote:
> Liu, Jinsong writes ("[Xen-devel] [PATCH 4/5] Xen/MCE: Abort live
> migration when vMCE occur"): 
>> This patch monitor the critical area of live migration (from vMCE
>> point of view, the copypages stage of migration is the critical area
>> while other areas are not).
> Sorry for the delay reviewing this.
> Just to be clear, can you explain what a vMCE is ?  I think I know but
> I'm not entirely sure and it would be helpful if you'd confirm, as I
> seem to have missed the background here.  I couldn't easily find the
> 0/5 posting of your series (in part because the tool you're using to
> send your series doesn't link the messages together in the same
> thread).

vMCE is a virtual MCE interface to guest. Its general purpose is to emulate a 
well defined interface to guest, so that when MCE occur in the range of guest, 
hypervisor can filter/expose necessary MCE error information to guest which 
would continue handle it.

These vMCE series patches is used to replace old xen vMCE implement, since old 
vMCE has some issues, including
1). old vMCE bound to host MCE, which would bring troubles like non-arch issue, 
save/restore issue, etc;
2). weird per-domain msr semantic
3). questionable vMCE injection method

I don't know if I have introduced it clear, but we have the Xen vMCE design doc 
as attached, including many vMCE details.

>> If a vMCE occur at the critical area of live migration, there is
>> risk that error data may be copied to the target. Currently we don't
>> have convenient way to handle this case, so for the sake of safe, we
>> abort it and try migration later (at that time broken page would not
>> be mapped and copied to the target).
> The "error data" that you refer to is erroneous page contents, or
> something else ?

Yes, it's erroneous page contents.

> When you say "we abort it and try migration later", that's not
> actually implemented in your patch, is it ?  What actually happens is
> that the migration is aborted and the user is expected to retry later.

Yes, and to make it more accurate how about update as "... we abort it. User 
can try migration later (at that time the broken page would not be mapped and 
copied to the target)"?

> I think this situation deserves a specific error code at the very
> least.  That specific error code should be plumbed up to libxl.
> Wouldn't it be better to actually restart the migration somehow ?

Seems libxl save/restore changed greatly recently, and I know almost nothing 
about the new save helper mechanism (I spend some time to study it but still 
not quite clear).
Maybe to achieve 'restart migration' is some overkilled/complicated for vMCE 
itself? after all mce during migration rarely occur in real life, and the main 
target of this patch is only for the sake of safe, so 'abort migration' is an 
acceptable option?

> I have some more minor comments about the implementation:
>> @@ -1109,6 +1111,12 @@
>>          goto out;
>>      }
>> +    if ( xc_domain_vmce_monitor_start(xch, dom) )
>> +    {
>> +        PERROR("Error when start vmce monitor\n");
> "Error starting vmc monitor" would be better English.  Messages sent
> with PERROR should not have a \n.
>> +    vmce_while_monitor = xc_domain_vmce_monitor_end(xch, dom);
>> +    if ( vmce_while_monitor < 0 )
>> +    {
>> +        PERROR("Error when end vmce monitor\n");
> Grammar and \n again.
>> +    else if ( vmce_while_monitor > 0 )
>> +    {
>> +        fprintf(stderr, "vMCE occurred, abort this time and try
>> later.\n"); +        goto out;
> This message should be sent with one of the logging macros, not
> fprintf.  ERROR, probably.
> Ian.

Will update accordingly.


Attachment: xen vMCE design (v0 2).pdf
Description: xen vMCE design (v0 2).pdf

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.