[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: [Xen-devel] [PATCH] re-work MCA telemetry internals; use common code for Intel/AMD MCA
Christoph Egger <mailto:Christoph.Egger@xxxxxxx> wrote: > On Tuesday 17 March 2009 04:24:35 Jiang, Yunhong wrote: >> xen-devel-bounces@xxxxxxxxxxxxxxxxxxx <> wrote: >>> The following patch reworks the MCA error telemetry handling inside Xen, >>> and shares code between the Intel and AMD implementations as much as >>> possible. >>> >>> I've had this patch sitting around for a while, but it wasn't ported to >>> -unstable yet. I finished porting and testing it, and am submitting it >>> now, because the Intel folks want to go ahead and submit their new >>> changes, so we agreed that I should push our changes first. >>> >>> Brief explanation of the telemetry part: previously, the telemetry was >>> accessed in a global array, with index variables used to access it. >>> There were some issues with that: race conditions with regard to new >>> machine checks (or CMCIs) coming in while handling the telemetry, and >>> interaction with domains having been notified or not, which was a bit >>> hairy. Our changes (I should say: Gavin Maltby's changes, as >>> he did the >>> bulk of this work for our 3.1 based tree, I merely >>> ported/extended it to >>> 3.3 and beyond) make telemetry access transactional (think of a >>> database). Also, the internal database updates are atomic, since the >>> final commit is done by a pointer swap. There is a brief explanation of >>> the mechanism in mctelem.h.This patch also removes dom0->domU >>> notification, which is ok, since Intel's upcoming changes will replace >>> domU notification with a vMCE mechanism anyway. >>> >>> The common code part is pretty much what it says. It defines a common >>> MCE handler, with a few hooks for the special needs of the specific CPUs. >>> >>> I've been told that Intel's upcoming patch will need to make >>> some parts >>> of the common code specific to the Intel CPU again, but we'll work >>> together to use as much common code as possible. >> >> Yes, as shown in our previous patch, we do change the current MCA handler, >> the main changes are followed: >> >> 1) Most importantly, we implement a softIRQ mechanism for post MCE handler. >> The reason is, the #MC can happen in any time, that means: Firstly it is >> spin-lock unsafe, some code like vcpu_schedule_lock_irq(v) in current MCA >> handler is sure to cause hang if that lock is already hold by a ISR; >> Secondly, the execution context is uncertain, the "current " value in >> current MCA handler maybe incorrect (if set_current is interrupted by >> #MC), the page ownership maybe wrong (if still in change under heap_lock >> protection) etc. I remember this So our patch handling #MC is in two step. >> The MCA handler, which depends on the execution context when MCA happen >> (like checking if it is in Xen context) and especially it will bring all >> CPU to softIRQ context. The softIRQ handler (i.e. post handler), which will >> be spin_lock safe, and all CPU is redenzvous, so it can take more actions. >> >> 2) We implement a mechanism to handle the shared MSR resources similar to >> what we have done in CMCI handler. As the Intel SDM stated, some MC >> resource is shared by multiple logical CPU, we implement a ownership check. >> >> 3) As stated in linux MCA handler, on Intel platforms machine check >> exceptions are always broadcast to all CPUs, we add such support also. >> >> We have no idea how the issues for item 2 and 3 are handled on other >> platform, so we have no idea on how to do the common handler for it, hope >> Christoph can provide more suggestion, or we can just keep them different >> for different platform. >> >> But I think for item 1, it is software related, so it can be a enhancement >> to the common handler, the only thing I'm not sure is, if we need bring all >> CPU to softIRQ context in all platform, maybe Christoph can give more idea. > > The featureset of AMD Athlon K7 and Intel Pentium III are the common > denominator on x86. This is what can go into the common code. > In order to utilize features from newer cpus, allow to > register function > pointers and call them from the common code. Look into the amd_k8.c > and amd_f10.c for example code. I register a function pointer to read > the new MSRs. It can be easily extended to utilize features of coming CPUs. I suspect the difference between the mce_intel.c and amd_xxx.c will be far more than the difference between amd_k8.c and amd_f10.c. For example, the issue 2/3 listed above (will it exists on your side?) And how about the softIRQ mechanism? How do you think about apply it to both side? Thanks Yunhong Jiang > > >> Since we have get most consensus on the high level idea of MCA handler >> (i.e. Xen take action instead of dom0, use vMCE for guest MCA etc, check >> discussion with subject " [RFC] RAS(Part II)--MCA enalbing in XEN", the >> only thing left is the detail method of how to pass the recover action >> information to dom0), maybe we can turn to this second level discussion of >> how to enhance the (common) MCA handler. >> >> Thanks >> -- Yunhong Jiang >> >>> - Frank > > > -- > ---to satisfy European Law for business letters: > Advanced Micro Devices GmbH > Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen > Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni > Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen > Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |