[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] Xen vMCE bugfix: inject vMCE# to all vcpus



Jan Beulich wrote:
>>>> On 13.06.12 at 12:49, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx> wrote:
>> Jan Beulich wrote:
>>>>>> On 13.06.12 at 10:05, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
>>>>>> wrote: 
>>>> Xen vMCE bugfix: inject vMCE# to all vcpus
>>>> 
>>>> In our test for win8 guest mce, we find a bug in that no matter
>>>> what SRAO/SRAR error xen inject to win8 guest, it always reboot.
>>>> 
>>>> The root cause is, current Xen vMCE logic inject vMCE# only to
>>>> vcpu0, this is not correct for Intel MCE (Under Intel arch, h/w
>>>> generate MCE# to all CPUs). 
>>>> 
>>>> This patch fix vMCE injection bug, injecting vMCE# to all vcpus.
>>> 
>>> I see no correlation between the fix (and its description) and the
>>> problem at hand: Why would Win8 reboot if it doesn't receive a
>>> particular MCE on all CPUs? Isn't that model specific behavior?
>> 
>> It's not model specific. For Intel processor, MCE# is broadcast to
>> all logical processors on the system on which the UCR errors are
>> supported (refer Intel SDM 15.9.3.1 & 15.9.3.2). So for vMCE
>> injection under Intel arch, it's a bug if inject vMCE# only to vcpu0.
> 
> This is for a certain group of errors, but I can't find any statement
> that this is general architectural behavior.

Xen mce only inject SRAO/SRAR error to guest, which both belong to UCR error.
For other error like CE and UCNA, hypervisor just virq to dom0 mcelog for 
logging.

> 
>>> Furthermore I doubt that an MCE on one socket indeed causes
>>> MCE-s on all other sockets, not to speak of distinct NUMA nodes
>>> (it would already surprise me if MCE-s got broadcast across cores
>>> within a socket, unless they are caused by a resource shared
>>> across cores).
>> 
>> Somehow I agree. But at least currently from Intel SDM,
>> architecturally it would broadcast.
> 
> I can't even see how this would work reliably across packages or
> nodes: Suppose two entities each encounter a UCR - they would
> each signal the other one, and the handling of this signal would
> need to be synchronized with the handling of the local UCR
> (irrespective of the order in which the events arrive).

I don't think h/w story being so, though I in fact don't have all details.

Basically, processor is just a 'report point' (via bank) to indicate what error 
occur. For example, error (mostly) generated at memory and transfer through 
footpoints e.g. memory --> memory controller --> QPI --> cache controller --> 
... --> ... --> cpu core. It's not one processor signaling the other cpu. A mca 
mechanism monitor different units decide when and what error reported, and 
notify all processors (that's what broadcast mean).

> 
>>>> +        if ( !test_and_set_bool(v->mce_pending) &&
>>>> +            ((d->is_hvm) ? 1 :
>>>> +            guest_has_trap_callback(d, v->vcpu_id,
>>>> TRAP_machine_check)) )
>>> 
>>> Quite strange a way to say
>>> 
>>>             (d->is_hvm || guest_has_trap_callback(d, v->vcpu_id,
>>> TRAP_machine_check))
>> 
>> For hvm it's OK to just set v->mce_pending. While for pv it need
>> also check if guest callback has been registered.
> 
> Sure. I was just asking why you use a conditional operator when
> the simpler || would do.

Ah I see. It's better.

> 
>>>> +        {
>>>> +            mce_printk(MCE_VERBOSE, "MCE: inject vMCE to dom%d
>>>> vcpu%d\n", +                       d->domain_id, v->vcpu_id); +   
>>>> vcpu_kick(v); +        }
>>>> +        else
>>>> +        {
>>>> +            mce_printk(MCE_QUIET, "Fail to inject vMCE to dom%d
>>>> vcpu%d\n", +                       d->domain_id, v->vcpu_id);
>>>> +            return -1;
>>> 
>>> Why do you bail here? This is particularly bad if v->mce_pending
>>> was already set on some vCPU (as that could simply mean the guest
>>> just didn't get around to handle the vMCE yet).
>> 
>> the case v->mce_pending already set while new vmce come will result
>> in domain crash. I don't think guest can still handle this case,
>> e.g. under baremetal if 2nd mce occur while 1st mce have not been
>> handled os will reboot directly. 
> 
> Sorry, no. This goes back to the questionable nature of the
> broadcasting above - if a CPU encounters a UCR itself and gets
> signaled of a remote CPU having encountered one, it should be
> able to cope. Otherwise the risk of (pointless!) system death
> would increase with the number of CPUs.
> 
>>> Also, how does this whole change interact with vmce_{rd,wr}msr()?
>>> The struct bank_entry instances live on a per-domain list, so the
>>> vMCE being delivered to all vCPU-s means they will all race for the
>>> single entry (and might erroneously access others, particularly in
>>> vmce_wrmsr()'s MCG_STATUS handling).
>> 
>> Yes, I agree.
>> However, recently we have done vMCE re-design work (ongoing some
>> internal review) and would present sooner. In new approach, MSRs are
>> per-vcpu instead of per-domain.  MSRs race issue (and the effort to
>> make it clean) would be pointless then. So at this butfix patch, I
>> only touch vMCE# injection itself. 
> 
> I understand that you want the simpler version as bug fix, but
> you didn't answer whether it was at all verified that the single
> per-domain entry logic would actually cope with the broadcast
> delivery logic you're adding here. I'm afraid the per-vCPU entry
> logic is a prerequisite to this change (in whatever form it may end
> up going in).
> 
> Jan

Agree, let's hold it. After we present new vMCE approach, we can do it at that 
time, and no race issue then.

Thanks,
Jinsong
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.