[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
On Thursday 05 March 2009 16:19:40 Jiang, Yunhong wrote: > Christoph Egger <mailto:Christoph.Egger@xxxxxxx> wrote: > > MC_ACT_CACHE_SHIRNK <-- typo. should be MC_ACT_CACHE_SHRINK > > Ahh, yes, I will fix it. > > > The L3 cache index disable feature works like this: > > > > You read the bits 17:6 from the MSR 0xC0000408 (which is MC4_MISC1) > > and write it into the index field. This MSR does not belong to > > the standard > > mc bank data and is therefore provided by mcinfo_extended. > > The index field are the bits 11:0 of the PCI function 3 register "L3 > > Cache Index Disable". > > So what's the offset of "L3 Cache Index Disable"? Is it in 256 byte or 4K > byte? Sorry, which offset do you mean ? > > For the PCI access, I'd prefer to have xen to control all these, i.e. even > if dom0 want to disable the L3 cache, it is done through a hypercall. The > reason is, Xen control the CPU, so keep it in Xen will make things simpler. > > Of course, it is ok for me too, if you want to keep Xen for #MC handler and > Dom0 for CE handler. We still need to define the rules to prevent interferes and clarify how to deal with Dom0/DomU going wild and breaking the rules. > > Why is the recover action bound to the bank ? > > I would like to see a struct mcinfo_recover rather extending > > struct mcinfo_bank. That gives us flexibility. > > I'd get input from Frank or Gavin. Place mcinfo_recover in mcinfo_back has > advantage of keep connection of the error source and the action, but it do > make the mcinfo_bank more complex. Or we can keep the cpu/bank information > in the mcinfo_recover also, so that we keep the flexibility and don't lose > the connection. From your suggestions I prefer the last one, but is still limited due to the assumption that each struct mcinfo_bank and each struct mcinfo_extended stands for exactly one error. This assumption doesn't cover follow-up errors which may be needed to determine the real root cause. Some of them may even be ignored depending on what is going on. Christoph > > Thanks > Yunhong Jiang > > > Christoph > > > > On Thursday 05 March 2009 09:31:27 Jiang, Yunhong wrote: > >> Christoph/Frank, Followed is the interface definition, please have a > >> look. > >> > >> Thanks > >> Yunhong Jiang > >> > >> 1) Interface between Xen/dom0 for passing xen's recovery action > >> information to dom0. Usage model: After offlining broken page, Xen might > >> pass its page-offline recovery action result information to dom0. Dom0 > >> will save the information in non-volatile memory for further proactive > >> actions, such as offlining the easy-broken page early when doing next > >> reboot. > >> > >> > >> struct page_offline_action > >> { > >> /* Params for passing the offlined page number to DOM0 */ > >> uint64_t mfn; uint64_t status; /* Similar to page offline hypercall */ > >> }; > >> > >> struct cpu_offline_action > >> { > >> /* Params for passing the identity of the offlined CPU to DOM0 */ > >> uint32_t mc_socketid; uint16_t mc_coreid; > >> uint16_t mc_core_threadid; > >> }; > >> > >> struct cache_shrink_action > >> { > >> /* TBD, Christoph, please fill it */ > >> }; > >> > >> /* Recover action flags, giving recovery result information to guest */ > >> /* Recovery successfully after taking certain recovery actions below */ > >> #define REC_ACT_RECOVERED (0x1 << 0) > >> /* For solaris's usage that dom0 will take ownership when crash */ > >> #define REC_ACT_RESET (0x1 << 2) > >> /* No action is performed by XEN */ > >> #define REC_ACT_INFO (0x1 << 3) > >> > >> /* Recover action type definition, valid only when flags & > >> REC_ACT_RECOVERED */ #define MC_ACT_PAGE_OFFLINE 1 > >> #define MC_ACT_CPU_OFFLINE 2 > >> #define MC_ACT_CACHE_SHIRNK 3 > >> > >> struct recovery_action > >> { > >> uint8_t flags; > >> uint8_t action_type; > >> union > >> { > >> struct page_offline_action page_retire; > >> struct cpu_offline_action cpu_offline; > >> struct cache_shrink_action cache_shrink; > >> uint8_t pad[MAX_ACTION_SIZE]; > >> } action_info; > >> } > >> > >> struct mcinfo_bank { > >> struct mcinfo_common common; > >> > >> uint16_t mc_bank; /* bank nr */ > >> uint16_t mc_domid; /* Usecase 5: domain referenced by mc_addr on > >> dom0 * and if mc_addr is valid. Never valid on DomU. */ uint64_t > >> mc_status; /* bank status */ > >> uint64_t mc_addr; /* bank address, only valid > >> * if addr bit is set in mc_status */ > >> uint64_t mc_misc; uint64_t mc_ctrl2; > >> uint64_t mc_tsc; > >> /* Recovery action is performed per bank */ > >> struct recovery_action action; > >> }; > >> > >> 2) Below two interfaces are for MCA processing internal use. > >> a. pre_handler will be called earlier in MCA ISR context, mainly for > >> early need_reset detection for avoiding log missing (flag MCA_RESET). > >> Also, pre_handler might be able to find the impacted domain if possible. > >> b. mca_error_handler is actually a (error_action_index, > >> recovery_handler pointer) pair. The defined recovery_handler function > >> performs the actual recovery operations in softIrq context after the > >> per_bank MCA error matching the corresponding mca_code index. If > >> pre_handler can't judge the impacted domain, recovery_handler must > >> figure it out. > >> > >> /* Error has been recovered successfully */ > >> #define MCA_RECOVERD 0 > >> /* Error impact one guest as stated in owner field */ #define MCA_OWNER > >> 1 /* Error can't be recovered and need reboot system */ #define > >> MCA_RESET 2 /* Error should be handled in softIRQ context */ > >> #define MCA_MORE_ACTION 3 > >> > >> struct mca_handle_result > >> { > >> uint32_t flags; > >> /* Valid only when flags & MCA_OWNER */ > >> domid_d owner; > >> /* valid only when flags & MCA_RECOVERD */ > >> struct recovery_action *action; > >> }; > >> > >> struct mca_error_handler > >> { > >> /* > >> * Assume we will need only architecture defined code. If the index > >> can't be setup by * mca_code, we will add a function to do the (index, > >> recovery_handler) mapping check. * This mca_code represents the recovery > >> handler pointer index for identifying this * particular error's > >> corresponding recover action */ > >> uint16_t mca_code; > >> > >> /* Handler to be called in softIRQ handler context */ > >> int recovery_handler(struct mcinfo_bank *bank, > >> struct mcinfo_global *global, > >> struct mcinfo_extended *extention, > >> struct mca_handle_result *result); > >> > >> }; > >> > >> struct mca_error_handler intel_mca_handler[] = > >> { > >> .... > >> }; > >> > >> struct mca_error_handler amd_mca_handler[] = > >> { > >> .... > >> }; > >> > >> > >> /* HandlVer to be called in MCA ISR in MCA context */ > >> int intel_mca_pre_handler(struct cpu_user_regs *regs, > >> struct mca_handle_result *result); > >> > >> int amd_mca_pre_handler(struct cpu_user_regs *regs, > >> struct mca_handle_result *result); > >> > >> Frank.Vanderlinden@xxxxxxx <mailto:Frank.Vanderlinden@xxxxxxx> wrote: > >>> Jiang, Yunhong wrote: > >>>> Frank/Christopher, can you please give more comments for it, or you > >>>> are OK with this? For the action reporting mechanism, we will send out > >>>> a proposal for review soon. > >>> > >>> I'm ok with this. We need a little more information on the AMD > >>> mechanism, but it seems to me that we can fit this in. > >>> > >>> Sometime this week, I'll also send out the last of our changes that > >>> haven't been sent upstream to xen-unstable yet. Maybe we can combine > >>> some things in to one patch, like the telemetry handling changes that > >>> Gavin did. The other changes are error injection (for debugging) and > >>> panic crash dump support for our FMA tools, but those are probably only > >>> interesting to us. > >>> > >>> - Frank > > > > -- > > ---to satisfy European Law for business letters: > > Advanced Micro Devices GmbH > > Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen > > Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni > > Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen > > Registergericht Muenchen, HRB Nr. 43632 -- ---to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |