[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] re-work MCA telemetry internals; use common code for Intel/AMD MCA



Ke, Liping wrote:
Hi, Frank

We did some small test here, found the CMCI problem is caused by the 
"mce_banks_owned" bitmap param passing.
When CMCI happened, we print the bitmap (in smp_cmci_interrupt) value, it's correct (For cpu0, 16c). When passing into mcheck_mca_logout, it turned to be "0xFFFF~~FFF" which is wrong. Only for your info -:)

Still, I suggest split the patch since this patch is realy big  -:)

Thanks a lot for your help!
Criping

Ok, since the patch is already in (thanks Keir!), let's work out any issues you may see with it.

When you say that the mce_banks_owned value is wrong in mcheck_mca_logout, do you mean that the parameter passing somehow went wrong? Is the value ok right before the call, but not in mcheck_mca_logout itself? That would be strange.

You mentioned the polling code: it didn't actually change much, in fact, the general polling code now is pretty much the same as the polling code that was originally in mce_intel.c. That code is very similar. Some AMD code contained AMD-specific things, and has not been touched (it still has its own poll function).

In general, the change is:

* machine_check_poll (going over the banks to look for non-fatal errors) and the loops inside the mcheck handlers that also walk the MC banks, have been combined in one function, which is mcheck_mca_logout. mcheck_mca_logout still takes a context argument, as machine_check_poll did.

* The mcheck handlers themselves mostly just call mcheck_cmn_handler, which calls mcheck_mca_logout, looks at the context, and then decides what to do.

* The mctelem structure as returned by mcheck_mca_logout, contains the relevant telemetry for the current error. If it should be put in the "database" for dom0 retrieval, use mctelem_commit to put it there. Otherwise, the usual course of action is to print the values and uses mctelem_dismiss to discard the information.

There is one thing that needs to be addressed: in your Intel-specific code, you do not reset the per-bank status in the case of a fatal error, so that the BIOS may look at the MSRs. mcheck_mca_logout does always reset the values. The best way to fix this is probably to pass in a flag to mcheck_mca_logout to tell it whether it should reset the status MSRs or not. The caller can then decide if they should be reset later.

Let me know if you have any more issues, I'll be happy to help merge your code and test it.

- Frank



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.