[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 2/2] Xen/vMCE: bugfix to remove problematic is_vmce_ready check
Christoph Egger wrote: > On 06.05.13 18:00, Liu, Jinsong wrote: >> Christoph Egger wrote: >>> On 06.05.13 17:14, Liu, Jinsong wrote: >>>> Egger, Christoph wrote: >>>>> On 06.05.13 17:00, Liu, Jinsong wrote: >>>>>> Christoph Egger wrote: >>>>>>> On 06.05.13 11:50, Liu, Jinsong wrote: >>>>>>>> Christoph Egger wrote: >>>>>>>>> On 06.05.13 11:24, Liu, Jinsong wrote: >>>>>>>>>> Jan Beulich wrote: >>>>>>>>>>>>>> On 06.05.13 at 10:54, Christoph Egger <chegger@xxxxxxxxx> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>> On 03.05.13 17:51, Liu, Jinsong wrote: >>>>>>>>>>>>> Jan Beulich wrote: >>>>>>>>>>>>>>>>> On 03.05.13 at 16:16, "Liu, Jinsong" >>>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote: >>>>>>>>>>>>>>> Jan Beulich wrote: >>>>>>>>>>>>>>>>>>> On 03.05.13 at 10:41, "Liu, Jinsong" >>>>>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote: >>>>>>>>>>>>>>>>> Jan Beulich wrote: >>>>>>>>>>>>>>>>>>>>> On 27.04.13 at 10:38, "Liu, Jinsong" >>>>>>>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> wrote: >>>>>>>>>>>>>>>>>>> From 9098666db640183f894b9aec09599dd32dddb7fa Mon >>>>>>>>>>>>>>>>>>> Sep 17 00:00:00 2001 From: Liu Jinsong >>>>>>>>>>>>>>>>>>> <jinsong.liu@xxxxxxxxx> Date: Sat, 27 Apr 2013 >>>>>>>>>>>>>>>>>>> 22:37:35 +0800 >>>>>>>>>>>>>>>>>>> Subject: [PATCH 2/2] Xen/vMCE: bugfix to remove >>>>>>>>>>>>>>>>>>> problematic is_vmce_ready check >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> is_vmce_ready() is problematic: >>>>>>>>>>>>>>>>>>> * For dom0, it checks if virq bind to dom0 mcelog >>>>>>>>>>>>>>>>>>> driver. If not, it results dom0 crash. However, it's >>>>>>>>>>>>>>>>>>> problematic and overkilled since mcelog as a dom0 >>>>>>>>>>>>>>>>>>> feature could be enabled/disabled per dom0 option: >>>>>>>>>>>>>>>>>>> (XEN) MCE: This error page is ownded by DOM 0 (XEN) >>>>>>>>>>>>>>>>>>> DOM0 not ready for vMCE (XEN) domain_crash called >>>>>>>>>>>>>>>>>>> from mcaction.c:133 (XEN) Domain 0 reported crashed >>>>>>>>>>>>>>>>>>> by domain 32767 on cpu#31: (XEN) Domain 0 crashed: >>>>>>>>>>>>>>>>>>> rebooting machine in 5 seconds. (XEN) Resetting >>>>>>>>>>>>>>>>>>> with ACPI MEMORY or I/O RESET_REG. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> * For dom0, if really need check, it should check >>>>>>>>>>>>>>>>>>> whether vMCE injection for dom0 ready (say, >>>>>>>>>>>>>>>>>>> exception trap bounce check, which has been done at >>>>>>>>>>>>>>>>>>> inject_vmce()), not check dom0 mcelog ready (which >>>>>>>>>>>>>>>>>>> has been done at mce_softirq() before send global >>>>>>>>>>>>>>>>>>> virq to dom0). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Following the argumentation above, I wonder which of >>>>>>>>>>>>>>>>>> the other "goto vmce_failed" are really appropriate, >>>>>>>>>>>>>>>>>> i.e. whether the patch shouldn't be extended (at >>>>>>>>>>>>>>>>>> least for the Dom0 case). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> You mean other 'goto vmce_failed' are also not >>>>>>>>>>>>>>>>> appropriate (I'm not quite clear your point)? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Would you please point out which point you think not >>>>>>>>>>>>>>>>> appropriate? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I question whether it is correct/necessary to crash the >>>>>>>>>>>>>>>> domain in any of those failure cases. Perhaps when we >>>>>>>>>>>>>>>> fail to unmap the page it is, but failure of >>>>>>>>>>>>>>>> fill_vmsr_data() and inject_vmce() don't appear to be >>>>>>>>>>>>>>>> valid reasons once the is_vmce_ready() path is being >>>>>>>>>>>>>>>> dropped. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For fill_vmsr_data(), it failed only when >>>>>>>>>>>>>>> MCG_STATUS_MCIP bit still set when next vMCE# occur, >>>>>>>>>>>>>>> means the 2nd vMCE# occur when the 1st vMCE# not >>>>>>>>>>>>>>> handled yet. Per SDM it should shutdown. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For inject_vmce(), it failed when >>>>>>>>>>>>>>> 1). vcpu is still mce_pending, or >>>>>>>>>>>>>>> 2). pv not register trap callback >>>>>>>>>>>>>>> Maybe it's some overkilled for dom0 (for other guest, >>>>>>>>>>>>>>> it's ok to kill them), but any graceful way to quit? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Just exit and do nothing (except perhaps log a rate >>>>>>>>>>>>>> limited message)? >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> One concern of quiet exit is, the error will be totally >>>>>>>>>>>>> ignored by guest --> it >>>>>>>>>>>> didn't get preperly handled, and may recursively occur to >>>>>>>>>>>> make worse error --> it's better to kill guest under such >>>>>>>>>>>> case. >>>>>>>>>>>>> >>>>>>>>>>>>>>> or, considering it rarely happens, how about keep >>>>>>>>>>>>>>> current way (kill guest no matter dom0 or not)? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Possibly - I was merely asking why this one condition was >>>>>>>>>>>>>> found to be too strict, while the others are being left >>>>>>>>>>>>>> as is. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jan >>>>>>>>>>>>> >>>>>>>>>>>>> Ah, the reason of removing is_vmce_ready check is, it's >>>>>>>>>>>>> problematic (check mcelog driver, not vmce tap callback), >>>>>>>>>>>>> and overkilled (since defaultly dom0 will not start mcelog >>>>>>>>>>>>> driver, under which case system will crash whenever vmce >>>>>>>>>>>>> inject to dom0) >>>>>>>>>>>>> >>>>>>>>>>>>> --> So patch 2/2 is not too strict for dom0. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Please keep in mind the mcelog userland/kernel interface is >>>>>>>>>>>> not designed with xen in mind. mcelog cannot report which >>>>>>>>>>>> guest is impacted for example, although xen reports that >>>>>>>>>>>> to dom0. I object 'fixing' the hypervisor to come over >>>>>>>>>>>> with mcelog drawbacks. I prefer fixing Dom0 instead. >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Sure, xen mcelog driver in linux is implemented by me :-) >>>>>>>>>> This patch does not intend to 'fix' hypervisor but just avoid >>>>>>>>>> overkilled system (when xen mcelog driver in dom0 not loaded >>>>>>>>>> as default). >>>>>>>>> >>>>>>>>> I assume dom0 w/o xen mcelog driver active means dom0 is not >>>>>>>>> capable to deal with machine check errors. Is this correct? >>>>>>>>> >>>>>>>> >>>>>>>> No, w/o xen mcelog driver active, only user daemon 'mcelog' was >>>>>>>> affected. Dom0 is still capable of handling vmce as long as it >>>>>>>> registered trap callback (which is checked at hypervisor >>>>>>>> inject_vmce()). >>>>>>> >>>>>>> Oh, I thought the xen mcelog *driver* registers the trap >>>>>>> callback and in a Dom0 it additionally registers the logging >>>>>>> callback. What registers the logging callback and what registers >>>>>>> the trap callback? >>>>>>> >>>>>>> Christoph >>>>>>> >>>>>> >>> [...] >>>>> >>>>> Ok, that's how I understand the code, too. But you say above that >>>>> w/o this driver Dom0 is still capable to handle vmce. That >>>>> puzzle's me. >>>>> >>>>> Christoph >>>> >>>> Ah, what I meant is, w/o mcelog dirver, kernel still successfully >>>> register its mce handler as trap callback to hypervisor, hence the >>>> injection of vmce from hypervisor to dom0 is OK, and surely got >>>> properly handled at dom0. mcelog driver is just for error logging, >>>> getting error info from hypervisor and provide interface for user >>>> daemon tools 'mcelog'. >>> >>> Ah, so the trap handler is *always* installed in a Linux Dom0? >> >> Hypervisor didn't have such assumption, it's agnostic to guest. >> >>> >>> NetBSD Dom0/DomU still has no machine check support at all, so in >>> that case Dom0/DomU is not vmce ready. That means the vmce ready >>> check is needed. >>> >>> Christoph >> >> Hypervisor indeed check vmce ready or not at inject_vmce(). >> Point of patch 2/2 is, is_vmce_ready itself is problematic. > > ok. If I understand the problem right, xen sends a machine check > error for logging purpose via virq to dom0 regardless whether > it binds the virq or not. In the latter case dom0 crashes. Is this > correct? No, in latter case dom0 is not affected, only fails to receive error log (dom0 itself doesn't care error log info, it's just for user tools 'mcelog'). > > In this case xen should just log a rate limited message > (which a sysadmin should be able to understand) > and do nothing else. I think, this is what Jan already suggested. > No, that's another story --> Jan concern why other 'goto vmce_failed' are not overkilled. As for is_vmce_ready itself, patch 2/2 is necessary. Thanks, Jinsong _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |