[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN



On Thursday 05 March 2009 16:19:40 Jiang, Yunhong wrote:
> Christoph Egger <mailto:Christoph.Egger@xxxxxxx> wrote:
> > MC_ACT_CACHE_SHIRNK  <-- typo. should be MC_ACT_CACHE_SHRINK
>
> Ahh, yes, I will fix it.
>
> > The L3 cache index disable feature works like this:
> >
> > You read the bits 17:6  from the MSR 0xC0000408 (which is MC4_MISC1)
> > and write it into the index field. This MSR does not belong to
> > the standard
> > mc bank data and is therefore provided by mcinfo_extended.
> > The index field are the bits 11:0 of the PCI function 3 register "L3
> > Cache Index Disable".
>
> So what's the offset of "L3 Cache Index Disable"? Is it in 256 byte or 4K
> byte?

Sorry, which offset do you mean ?

>
> For the PCI access, I'd prefer to have xen to control all these, i.e. even
> if dom0 want to disable the L3 cache, it is done through a hypercall. The
> reason is, Xen control the CPU, so keep it in Xen will make things simpler.
>
> Of course, it is ok for me too, if you want to keep Xen for #MC handler and
> Dom0 for CE handler.

We still need to define the rules to prevent interferes and clarify how to
deal with Dom0/DomU going wild and breaking the rules.

> > Why is the recover action bound to the bank ?
> > I would like to see a struct mcinfo_recover  rather extending
> > struct mcinfo_bank.  That gives us flexibility.
>
> I'd get input from Frank or Gavin. Place mcinfo_recover in mcinfo_back has
> advantage of keep connection of the error source and the action, but it do
> make the mcinfo_bank more complex. Or we can keep the cpu/bank information
> in the mcinfo_recover also, so that we keep the flexibility and don't lose
> the connection.

From your suggestions I prefer the last one, but is still limited due
to the assumption that each struct mcinfo_bank and each struct mcinfo_extended
stands for exactly one error.

This assumption doesn't cover follow-up errors which may be needed to 
determine the real root cause. Some of them may even be ignored
depending on what is going on.

Christoph

>
> Thanks
> Yunhong Jiang
>
> > Christoph
> >
> > On Thursday 05 March 2009 09:31:27 Jiang, Yunhong wrote:
> >> Christoph/Frank, Followed is the interface definition, please have a
> >> look.
> >>
> >> Thanks
> >> Yunhong Jiang
> >>
> >> 1) Interface between Xen/dom0 for passing xen's recovery action
> >> information to dom0. Usage model: After offlining broken page, Xen might
> >> pass its page-offline recovery action result information to dom0. Dom0
> >> will save the information in non-volatile memory for further proactive
> >> actions, such as offlining the easy-broken page early when doing next
> >> reboot.
> >>
> >>
> >> struct page_offline_action
> >> {
> >>     /* Params for passing the offlined page number to DOM0 */    
> >> uint64_t mfn; uint64_t status; /* Similar to page offline hypercall */
> >> };
> >>
> >> struct cpu_offline_action
> >> {
> >>     /* Params for passing the identity of the offlined CPU to DOM0 */
> >>     uint32_t mc_socketid; uint16_t mc_coreid;
> >>     uint16_t mc_core_threadid;
> >> };
> >>
> >> struct cache_shrink_action
> >> {
> >>     /* TBD, Christoph, please fill it */
> >> };
> >>
> >> /* Recover action flags, giving recovery result information to guest */
> >> /* Recovery successfully after taking certain recovery actions below */
> >> #define REC_ACT_RECOVERED      (0x1 << 0)
> >> /* For solaris's usage that dom0 will take ownership when crash */
> >> #define REC_ACT_RESET          (0x1 << 2)
> >> /* No action is performed by XEN */
> >> #define REC_ACT_INFO           (0x1 << 3)
> >>
> >> /* Recover action type definition, valid only when flags &
> >> REC_ACT_RECOVERED */ #define MC_ACT_PAGE_OFFLINE 1
> >> #define MC_ACT_CPU_OFFLINE   2
> >> #define MC_ACT_CACHE_SHIRNK 3
> >>
> >> struct recovery_action
> >> {
> >>     uint8_t flags;
> >>     uint8_t action_type;
> >>     union
> >>     {
> >>         struct page_offline_action page_retire;
> >>         struct cpu_offline_action cpu_offline;
> >>         struct cache_shrink_action cache_shrink;
> >>         uint8_t pad[MAX_ACTION_SIZE];
> >>     } action_info;
> >> }
> >>
> >> struct mcinfo_bank {
> >>     struct mcinfo_common common;
> >>
> >>     uint16_t mc_bank; /* bank nr */
> >>     uint16_t mc_domid; /* Usecase 5: domain referenced by mc_addr on
> >> dom0 * and if mc_addr is valid. Never valid on DomU. */ uint64_t
> >> mc_status; /* bank status */
> >>     uint64_t mc_addr;   /* bank address, only valid
> >>                          * if addr bit is set in mc_status */    
> >> uint64_t mc_misc; uint64_t mc_ctrl2;
> >>     uint64_t mc_tsc;
> >>     /* Recovery action is performed per bank */
> >>     struct recovery_action action;
> >> };
> >>
> >> 2) Below two interfaces are for MCA processing internal use.
> >>     a. pre_handler will be called earlier in MCA ISR context, mainly for
> >> early need_reset detection for avoiding log missing (flag MCA_RESET).
> >> Also, pre_handler might be able to find the impacted domain if possible.
> >>     b. mca_error_handler is actually a (error_action_index,
> >> recovery_handler pointer) pair. The defined recovery_handler function
> >> performs the actual recovery operations in softIrq context after the
> >> per_bank MCA error matching the corresponding mca_code index. If
> >> pre_handler can't judge the impacted domain, recovery_handler must
> >> figure it out.
> >>
> >> /* Error has been recovered successfully */
> >> #define MCA_RECOVERD    0
> >> /* Error impact one guest as stated in owner field */ #define MCA_OWNER
> >> 1 /* Error can't be recovered and need reboot system */ #define
> >> MCA_RESET 2 /* Error should be handled in softIRQ context */
> >> #define MCA_MORE_ACTION 3
> >>
> >> struct mca_handle_result
> >> {
> >>     uint32_t flags;
> >>     /* Valid only when flags & MCA_OWNER */
> >>     domid_d owner;
> >>     /* valid only when flags & MCA_RECOVERD */
> >>     struct  recovery_action *action;
> >> };
> >>
> >> struct mca_error_handler
> >> {
> >>     /*
> >>      * Assume we will need only architecture defined code. If the index
> >> can't be setup by * mca_code, we will add a function to do the (index,
> >> recovery_handler) mapping check. * This mca_code represents the recovery
> >> handler pointer index for identifying this * particular error's
> >>     corresponding recover action */
> >>     uint16_t mca_code;
> >>
> >>     /* Handler to be called in softIRQ handler context */
> >>     int recovery_handler(struct mcinfo_bank *bank,
> >>                      struct mcinfo_global *global,
> >>                      struct mcinfo_extended *extention,
> >>                      struct mca_handle_result *result);
> >>
> >> };
> >>
> >> struct mca_error_handler intel_mca_handler[] =
> >> {
> >>     ....
> >> };
> >>
> >> struct mca_error_handler amd_mca_handler[] =
> >> {
> >>     ....
> >> };
> >>
> >>
> >> /* HandlVer to be called in MCA ISR in MCA context */
> >> int intel_mca_pre_handler(struct cpu_user_regs *regs,
> >>                                 struct mca_handle_result *result);
> >>
> >> int amd_mca_pre_handler(struct cpu_user_regs *regs,
> >>                             struct mca_handle_result *result);
> >>
> >> Frank.Vanderlinden@xxxxxxx <mailto:Frank.Vanderlinden@xxxxxxx> wrote:
> >>> Jiang, Yunhong wrote:
> >>>> Frank/Christopher, can you please give more comments for it, or you
> >>>> are OK with this? For the action reporting mechanism, we will send out
> >>>> a proposal for review soon.
> >>>
> >>> I'm ok with this. We need a little more information on the AMD
> >>> mechanism, but it seems to me that we can fit this in.
> >>>
> >>> Sometime this week, I'll also send out the last of our changes that
> >>> haven't been sent upstream to xen-unstable yet. Maybe we can combine
> >>> some things in to one patch, like the telemetry handling changes that
> >>> Gavin did. The other changes are error injection (for debugging) and
> >>> panic crash dump support for our FMA tools, but those are probably only
> >>> interesting to us.
> >>>
> >>> - Frank
> >
> > --
> > ---to satisfy European Law for business letters:
> > Advanced Micro Devices GmbH
> > Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
> > Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
> > Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> > Registergericht Muenchen, HRB Nr. 43632



-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.