[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Regression, host crash with 4.5rc1
>>> On 27.11.14 at 06:29, <sflist@xxxxxxxxx> wrote: > On 11/25/2014 03:00 AM, Jan Beulich wrote: >> Okay, so it's not really the mwait-idle driver causing the regression, >> but it is C-state related. Hence we're now down to seeing whether all >> or just the deeper C states are affected, i.e. I now need to ask you >> to play with "max_cstate=". For that you'll have to remember that the >> option's effect differs between the ACPI and the MWAIT idle drivers. >> In the spirit of bisection I'd suggest using "max_cstate=2" first no >> matter which of the two scenarios you pick. If that still hangs, >> "max_cstate=1" obviously is the only other thing to try. Should that >> not hang (and you left out "mwait-idle=0"), trying "max_cstate=3" >> in that same scenario would be the other case to check. >> >> No need for 'd' and 'a' output for the time being, but 'c' output would >> be much appreciated for all cases where you observe hangs. >> > > Okay, working through that now. I tried max_cstate=2 and got no hangs, > whether with or without mwait-idle=0. However, I was puzzled by this: > > (XEN) 'c' pressed -> printing ACPI Cx structures > (XEN) ==cpu0== > (XEN) active state: C0 > (XEN) max_cstate: C2 > (XEN) states: > (XEN) C1: type[C1] latency[003] usage[12219860] method[ FFH] > duration[1190961948551] > (XEN) C2: type[C1] latency[010] usage[10205554] method[ FFH] > duration[2015393965907] > (XEN) C3: type[C2] latency[020] usage[50926286] method[ FFH] > duration[30527997858148] > (XEN) *C0: usage[73351700] duration[9974627547595] > (XEN) max=0 pwr=0 urg=0 nxt=0 > (XEN) PC2[0] PC3[8589642315848] PC6[0] PC7[0] > (XEN) CC3[28794734145697] CC6[0] CC7[0] > (XEN) ==cpu1== > (XEN) active state: C3 > (XEN) max_cstate: C2 > (XEN) states: > (XEN) C1: type[C1] latency[003] usage[10699950] method[ FFH] > duration[1141422044112] > (XEN) C2: type[C1] latency[010] usage[06382904] method[ FFH] > duration[1329739264322] > (XEN) *C3: type[C2] latency[020] usage[44630764] method[ FFH] > duration[31676618425954] > (XEN) C0: usage[61713618] duration[9561201640320] > (XEN) max=0 pwr=0 urg=0 nxt=0 > (XEN) PC2[0] PC3[8589642315848] PC6[0] PC7[0] > (XEN) CC3[30066495105056] CC6[0] CC7[0] >[...] > > Why would some of the cores be in C3 even though they list max_cstate as C2? This was precisely the reason why I told you that the numbering differs (and is confusing and has nothing to do with actual C state numbers): What max_cstate refers to in the mwait-idle driver is what above is listed as type[Cx], i.e. the state at index 1 is C1, at 2 we've got C1E, and at 3 we've got C2. And those still aren't in line with the numbering the CPU documentation uses, it's rather kind of meant to refer to the ACPI numbering (but probably also not fully matching up). So max_cstate=2 working suggests a problem with what the CPU calls C6, which presumably isn't all that surprising considering the many errata (BD35, BD38, BD40, BD59, BD87, and BD104). Not sure how to proceed from here - I suppose you already made sure you run with the latest available BIOS. And with 6 errata documented it's not all that unlikely that there's a 7th one with MONITOR/MWAIT behavior. The commit you bisected to (and which you had verified to be the culprit by just forcing arch_skip_send_event_check() to always return false) could be reasonably assumed to be broken only when MWAIT use for all C states didn't work. Don, Jun - is there anything known but not yet publicly documented for Family 6 Model 44 Xeons? Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |