[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Regression, host crash with 4.5rc1



>>> On 27.11.14 at 06:29, <sflist@xxxxxxxxx> wrote:
> On 11/25/2014 03:00 AM, Jan Beulich wrote:
>> Okay, so it's not really the mwait-idle driver causing the regression,
>> but it is C-state related. Hence we're now down to seeing whether all
>> or just the deeper C states are affected, i.e. I now need to ask you
>> to play with "max_cstate=". For that you'll have to remember that the
>> option's effect differs between the ACPI and the MWAIT idle drivers.
>> In the spirit of bisection I'd suggest using "max_cstate=2" first no
>> matter which of the two scenarios you pick. If that still hangs,
>> "max_cstate=1" obviously is the only other thing to try. Should that
>> not hang (and you left out "mwait-idle=0"), trying "max_cstate=3"
>> in that same scenario would be the other case to check.
>>
>> No need for 'd' and 'a' output for the time being, but 'c' output would
>> be much appreciated for all cases where you observe hangs.
>>
> 
> Okay, working through that now. I tried max_cstate=2 and got no hangs, 
> whether with or without mwait-idle=0. However, I was puzzled by this:
> 
> (XEN) 'c' pressed -> printing ACPI Cx structures
> (XEN) ==cpu0==
> (XEN) active state:             C0
> (XEN) max_cstate:               C2
> (XEN) states:
> (XEN)     C1:   type[C1] latency[003] usage[12219860] method[  FFH] 
> duration[1190961948551]
> (XEN)     C2:   type[C1] latency[010] usage[10205554] method[  FFH] 
> duration[2015393965907]
> (XEN)     C3:   type[C2] latency[020] usage[50926286] method[  FFH] 
> duration[30527997858148]
> (XEN)    *C0:   usage[73351700] duration[9974627547595]
> (XEN) max=0 pwr=0 urg=0 nxt=0
> (XEN) PC2[0] PC3[8589642315848] PC6[0] PC7[0]
> (XEN) CC3[28794734145697] CC6[0] CC7[0]
> (XEN) ==cpu1==
> (XEN) active state:             C3
> (XEN) max_cstate:               C2
> (XEN) states:
> (XEN)     C1:   type[C1] latency[003] usage[10699950] method[  FFH] 
> duration[1141422044112]
> (XEN)     C2:   type[C1] latency[010] usage[06382904] method[  FFH] 
> duration[1329739264322]
> (XEN)    *C3:   type[C2] latency[020] usage[44630764] method[  FFH] 
> duration[31676618425954]
> (XEN)     C0:   usage[61713618] duration[9561201640320]
> (XEN) max=0 pwr=0 urg=0 nxt=0
> (XEN) PC2[0] PC3[8589642315848] PC6[0] PC7[0]
> (XEN) CC3[30066495105056] CC6[0] CC7[0]
>[...]
> 
> Why would some of the cores be in C3 even though they list max_cstate as C2?

This was precisely the reason why I told you that the numbering
differs (and is confusing and has nothing to do with actual C state
numbers): What max_cstate refers to in the mwait-idle driver is
what above is listed as type[Cx], i.e. the state at index 1 is C1, at
2 we've got C1E, and at 3 we've got C2. And those still aren't in
line with the numbering the CPU documentation uses, it's rather
kind of meant to refer to the ACPI numbering (but probably also
not fully matching up).

So max_cstate=2 working suggests a problem with what the CPU
calls C6, which presumably isn't all that surprising considering the
many errata (BD35, BD38, BD40, BD59, BD87, and BD104). Not
sure how to proceed from here - I suppose you already made
sure you run with the latest available BIOS. And with 6 errata
documented it's not all that unlikely that there's a 7th one with
MONITOR/MWAIT behavior. The commit you bisected to (and
which you had verified to be the culprit by just forcing
arch_skip_send_event_check() to always return false) could be
reasonably assumed to be broken only when MWAIT use for all
C states didn't work.

Don, Jun - is there anything known but not yet publicly
documented for Family 6 Model 44 Xeons?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.