Xen project Mailing List

Re: [Xen-devel] RFC: MCA/MCE concept

To: Christoph Egger <Christoph.Egger@xxxxxxx>

From: Gavin Maltby <Gavin.Maltby@xxxxxxx>

Date: Wed, 06 Jun 2007 13:25:26 +0100

Cc: xen-devel@xxxxxxxxxxxxxxxxxxx, Keir Fraser <keir@xxxxxxxxxxxxx>

Delivery-date: Wed, 06 Jun 2007 05:24:32 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Hi, On 06/06/07 12:57, Christoph Egger wrote:

For the first I've assumed so far that an event channel notification
of the MCA event will suffice;  as long as the hypervisor only polls
for correctable MCA errors at a low-frequency rate (currently 15s
interval) there is no danger of spamming that single notification.

Why polling?

Polling for correctable errors, but #MC as usual for others.  Setting
MCi_CTL bits for correctable errors does not produce a machine check,
so polling is the only approach unless one sets additional (and
undocumented, certainly for AMD chips) config bits.  What I was getting
at here is that polling at largish intervals for correctables is
the correct approach - trapping for them or polling at a high-frequency
is bad because in cases where you have some form of solid correctable
error (say a single bad pin in a dimm socket affecting one or two ranks
of that dimm but never able to produce a UE) the trap handling and
diagnosis software consume the machine and things make little useful
forward progress.


I still don't see, why #MC for all kind of errors is bad.

I'm talking about whether the hypervisor takes a machine check for an error or polls for it. We do not want #MC for correctable errors stopping the hypervisor from making progress. And if the hypervisor poll interval was to small a solid error would again keep the hypervisor busy producing (mostly/all duplicate) error telemetry and the diagnosis code in dom0 would burn cpu cycles, too. How errors observed by the hypervisor, be they from #MC or from a poll, are propogated to the domains is unimportant from this point of view - e.g., if we decide to take error telemetry discovered via a poll in the hypervisor and propogate it to the domain pretending it is undistinguishable from a machine check that will not hurt or limit the domain processing. An untested design I had in mind, unashamedly influenced by what we do in Solaris, was to have some common memory shared between hypervisor and domain into which the hypervisor produces error telemetry and the domain consumes that telemetry. Producing and consuming is lockless using compare-and-swap. There are two queues in this shared memory - one for uncorrectable error telemetry and one for correctable error telemetry. When the domain gets whatever event to notify it of telemetry for processing it processes the queues; the event would be synchronous for uncorrectable errors (ie, domain must process the telemetry right now) or asynchronous in the case of correctable errors (process when convenient). The separation of CE and UE queues stops CEs from flooding the more important UE events (you can always drop CEs if there is no more space, but you can never drop UEs). [cut]

After some code reading I found a nmi_pending, nmi_masked and nmi_addr in

[cut] Still chewing on that ... Cheers Gavin _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.