[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On Fri, Mar 22, 2013 at 04:34:11PM +0100, Marek Marczykowski wrote:
> On 15.03.2013 14:02, Konrad Rzeszutek Wilk wrote:
> > On Wed, Mar 13, 2013 at 09:50:39PM +0100, Marek Marczykowski wrote:
> >> Hi,
> >>
> >> I've still have problems with ACPI(?) on Xen. After some system startup or
> >> resume CPU temperature goes high although all domUs (and dom0) are idle. On
> >> "good" system startup it is about 50-55C, on "bad" - above 67C (most time
> >> above 70C). I've noticed difference in C-states repored by Xen (attached
> >> files). On "bad" startups in addition suspend doesn't work - system 
> >> restarts
> >> during suspend (still didn't managed to get console messages - I don't have
> >> serial port on this system). Note that sometimes system boots fine ("good"
> >> state), but problem occurs after some suspend/resume cycles. Some time ago
> >> I've got other symptoms: only CPU0 was used - for all VCPUs (according to 
> >> xl
> >> vcpu-list). Maybe it is related?
> >>
> >> Hardware: Dell Latitude E6420
> >> CPU: Intel i5-2520M
> >>
> >> Software:
> >> xen stable-4.1 as of 15.02 (last commit: "xen: sched_creadit: improve 
> >> picking
> >> up the idle CPU for a VCPU"), with reverted commit "Introduce system_state
> >> variable."
> >> But the same problem on vanilla xen 4.1.2.
> >>
> >> Linux 3.7.6 - happens almost every boot. On Linux 3.7.4 happens much rarer
> >> (but still occurs).
> >> Kernel config:
> >> http://git.qubes-os.org/gitweb/?p=marmarek/kernel.git;a=blob;f=config-pvops;h=a6e953f71cdc84556571b592b8af87a5a4f9a8d0;hb=HEAD
> >> I've tried some bisect from 3.7.4 to 3.7.6, but without success because
> >> problem isn't 100% reproducible.
> >>
> >> Any ideas?
> > 
> > That C-states difference is important. The SYSIO part on your box means 
> > that the
> > CPU ends up doing an MWAIT. An HALT on the other hand is not so power-saving
> > friendly.
> > 
> > Looking at this:
> >> (XEN) no cpu_id for acpi_id 5
> >> (XEN) no cpu_id for acpi_id 6
> >> (XEN) no cpu_id for acpi_id 7
> >> (XEN) no cpu_id for acpi_id 8
> > 
> > .. means that xen-acpi-processor was trying to probe for the ACPI IDs of the
> > the other CPUs that the machine theoritcally can support. That means it got
> > the ACPI information for the first four CPUs (which is good).
> > 
> > You can as the first step in trying to figure this out, add #define DEBUG 1
> > in xen-acpi-processor.c right before any of the #includes. And also boot
> > Xen with 'cpufreq=verbose'. That should tell you what kind of C-states the
> > xen-acpi-processor uploaded (And if it did it for all of the vCPUS).
> > 
> > If both bootups show that we do upload the C-states for all the CPUs but 
> > they
> > vary that means digging a bit deeper in the ACPI code. Specifically in 
> > acpi_processor_get_power_info_cst and seeing if it hits any of the 
> > 'continue'.
> > 
> > Then I would say take also the DSDT for both bootups and compare them. It 
> > might
> > be that the BIOS is using a scratch register at reboot to construct the 
> > C-states
> > and somehow it ends up being corrupted. Which means that on the next warm 
> > reboot
> > the C-states has bogus data. This does show up in the field :-(
> Finally I've found some time for further debugging this. And it looks like
> some deeper ACPI code problem...
> I've switched to 3.8.4, on which problem is much easier to reproduce (almost
> every startup).
> On bad bootup, xen-acpi-processor didn't found any C-state: for each CPU
> _pr.flags.power and _pr->power.count was 0 (but flags.power_setup_done=1). In
> this case suspend (or shutdown) always ends up with reset.

This is you booting the machine from a cold-state or a warm one?

There are some BIOSes out there that I know that use the scratchpad registers in
IOH (so depending on the platform that can be 0:0e.1 , Reg 0x84). If Xen or 
touch it then the P-states and C-states that the BIOS generates are buggy.

But that is not the case here - you are saying that the DSDT after disassembling
(so cat /sys/firmware/acpi/tables/DSDT, or SSDT* and the iasl -d on them), the
_PSD, _PSS, and _PCT look the same?

You could also look at the FACP table and see if they are different.
> On good one xen-acpi-processor got C1-C3 states for each CPU, then suspend
> succeeded, but after resume CPU0 had C1-C3, but others only C1. Reloading
> xen-acpi-processor (rmmod -f...) fixes this (according to xl debug-key c), but
> still temperature keep high. Regardless of xen-acpi-processor reloading, next
> suspend always fails.

If you reload, and look at the runqeueus, are all of them using the ACPI
idler or the default one?

> Not sure how C-states can be related to S3 suspend, but perhaps something more
> general with ACPI is wrong?

This reminds me of something. I recall a long long time ago seeing something 
like this....
Completly forgot about this until now. The difference was whether the Xen's 
as running a) the acpi_idle (so using the different C-states), or b) the 
default one
(so just using HLT).

With the b), during resume it would get half-way through
(http://darnok.org/xen/devel.acpi-s3.v1.serial.log) while with a) it would 
continue on - http://darnok.org/xen/devel.acpi-s3.v0.serial.log

This was on some MSI MS-7680/H61M-P23 (MS-7680) motherboard.

Oh look: http://lists.xen.org/archives/html/xen-devel/2011-06/msg02059.html

And it looks Kevin's recommendation was use the a) case with max_cstates=1
to narrow it down.

> Each time DSDT (get from /sys/firmware/acpi/tables) is exactly the same.
> -- 
> Best Regards / Pozdrawiam,
> Marek Marczykowski
> Invisible Things Lab

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.