[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] RFC: MCA/MCE concept
Hi, On 06/06/07 12:57, Christoph Egger wrote: For the first I've assumed so far that an event channel notification of the MCA event will suffice; as long as the hypervisor only polls for correctable MCA errors at a low-frequency rate (currently 15s interval) there is no danger of spamming that single notification.Why polling?Polling for correctable errors, but #MC as usual for others. Setting MCi_CTL bits for correctable errors does not produce a machine check, so polling is the only approach unless one sets additional (and undocumented, certainly for AMD chips) config bits. What I was getting at here is that polling at largish intervals for correctables is the correct approach - trapping for them or polling at a high-frequency is bad because in cases where you have some form of solid correctable error (say a single bad pin in a dimm socket affecting one or two ranks of that dimm but never able to produce a UE) the trap handling and diagnosis software consume the machine and things make little useful forward progress.I still don't see, why #MC for all kind of errors is bad. I'm talking about whether the hypervisor takes a machine check for an error or polls for it. We do not want #MC for correctable errors stopping the hypervisor from making progress. And if the hypervisor poll interval was to small a solid error would again keep the hypervisor busy producing (mostly/all duplicate) error telemetry and the diagnosis code in dom0 would burn cpu cycles, too. How errors observed by the hypervisor, be they from #MC or from a poll, are propogated to the domains is unimportant from this point of view - e.g., if we decide to take error telemetry discovered via a poll in the hypervisor and propogate it to the domain pretending it is undistinguishable from a machine check that will not hurt or limit the domain processing. An untested design I had in mind, unashamedly influenced by what we do in Solaris, was to have some common memory shared between hypervisor and domain into which the hypervisor produces error telemetry and the domain consumes that telemetry. Producing and consuming is lockless using compare-and-swap. There are two queues in this shared memory - one for uncorrectable error telemetry and one for correctable error telemetry. When the domain gets whatever event to notify it of telemetry for processing it processes the queues; the event would be synchronous for uncorrectable errors (ie, domain must process the telemetry right now) or asynchronous in the case of correctable errors (process when convenient). The separation of CE and UE queues stops CEs from flooding the more important UE events (you can always drop CEs if there is no more space, but you can never drop UEs). [cut] After some code reading I found a nmi_pending, nmi_masked and nmi_addr in [cut] Still chewing on that ... Cheers Gavin _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |