[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x



On Mon, Apr 15, 2013 at 11:09 PM, Marek Marczykowski
<marmarek@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> On 02.04.2013 03:13, Marek Marczykowski wrote:
>> On 01.04.2013 15:53, Ben Guthro wrote:
>>> On Thu, Mar 28, 2013 at 3:03 PM, Marek Marczykowski
>>> <marmarek@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>> (XEN) Restoring affinity for d2v3
>>>> (XEN) Assertion '!cpus_empty(cpus) && cpu_isset(cpu, cpus)' failed at
>>>> sched_credit.c:481
>>>
>>>
>>> I think the "fix-suspend-scheduler-*" patches posted here are applicable 
>>> here:
>>> http://markmail.org/message/llj3oyhgjzvw3t23
>>>
>>>
>>> Specifically, I think you need this bit:
>>>
>>> diff --git a/xen/common/cpu.c b/xen/common/cpu.c
>>> index 630881e..e20868c 100644
>>> --- a/xen/common/cpu.c
>>> +++ b/xen/common/cpu.c
>>> @@ -5,6 +5,7 @@
>>>  #include <xen/init.h>
>>>  #include <xen/sched.h>
>>>  #include <xen/stop_machine.h>
>>> +#include <xen/sched-if.h>
>>>
>>>  unsigned int __read_mostly nr_cpu_ids = NR_CPUS;
>>>  #ifndef nr_cpumask_bits
>>> @@ -212,6 +213,8 @@ void enable_nonboot_cpus(void)
>>>              BUG_ON(error == -EBUSY);
>>>              printk("Error taking CPU%d up: %d\n", cpu, error);
>>>          }
>>> +        if (system_state == SYS_STATE_resume)
>>> +            cpumask_set_cpu(cpu, cpupool0->cpu_valid);
>>>      }
>>>
>>>      cpumask_clear(&frozen_cpus);
>>>
>>
>> Indeed, this makes things better, but still not ideal.
>> Now after resume all CPUs are in Pool-0, which is good. But CPU0 is much more
>> preferred than others (xl vcpu-list). For example if I start 4 busy loops in
>> dom0, I got (even after some time):
>> [user@dom0 ~]$ xl vcpu-list
>> Name                                ID  VCPU   CPU State   Time(s) CPU 
>> Affinity
>> dom0                                 0     0    0   r--      98.5  any cpu
>> dom0                                 0     1    0   ---     181.3  any cpu
>> dom0                                 0     2    2   r--     262.4  any cpu
>> dom0                                 0     3    3   r--     230.8  any cpu
>> netvm                                1     0    0   -b-      18.4  any cpu
>> netvm                                1     1    0   -b-       9.1  any cpu
>> netvm                                1     2    0   -b-       7.1  any cpu
>> netvm                                1     3    0   -b-       5.4  any cpu
>> firewallvm                           2     0    0   -b-      10.7  any cpu
>> firewallvm                           2     1    0   -b-       3.0  any cpu
>> firewallvm                           2     2    0   -b-       2.5  any cpu
>> firewallvm                           2     3    3   -b-       3.6  any cpu
>>
>> If I remove some CPU from Pool-0 and re-add it, things back to normal for 
>> this
>> particular CPU (so I got two equally used CPUs) - to fully restore system I
>> must remove all but CPU0 from Pool-0 and add it again.
>>
>> Also still only CPU0 have all C-states (C0-C3), all others have only C0-C1.
>> This probably could be fixed by your "xen: Re-upload processor PM data to
>> hypervisor after S3 resume" patch (reload of xen-acpi-processor module helps
>> here). But I don't think it is a right way. It isn't necessary on other
>> systems (with somehow older hardware). It must be something missing on resume
>> path. The question is what...
>>
>> Perhaps someone need to go through enable_nonboot_cpus() (__cpu_up?) and 
>> check
>> if it restore all things disabled in disable_nonboot_cpus() (__cpu_disable?).
>> Unfortunately I don't know x86 details so good to follow that code...
>
> Summarize ACPI S3 issues:
>
> I. Fixed issues:
>
> 1. IRQ problem fixed by "x86: irq_move_cleanup_interrupt() must ignore legacy
> vectors" commit
> 2. Assertion failure on resume with vcpu affinity used, fixes by "x86/S3:
> Restore broken vcpu affinity on resume" commit
>
>
> II. Not (fully) fixed issues:
>
> 1. CPU Pool-0 contains only CPU0 after resume - patch quoted above fixes the
> issue, but it isn't applied to xen-unstable
> 2. After resume scheduler chooses (almost) only CPU0 (above quoted listing).
> Removing and re-adding all CPUs to Pool-0 solves the problem. Perhaps some
> timers are not restarted after resume?

Marek,
Please try the patch from this thread to see if it solves your 2 issues above:
http://markmail.org/thread/35ecqimv7bwq3k6d

This patch was NAK'ed due to cpupool breakage...but in my testing, it
solved both of these problems.

I don't know how to properly solve it in a cpupool compatible way...
but I also haven't put much additional effort into doing so.


> 3. ACPI C-states are only present for CPU0 (after resume of course), fixed by
> "xen: Re-upload processor PM data to hypervisor after S3" patch by Ben, but it
> isn't in upstream linux (nor Konrad's acpi-s3 branches).

I don't recall seeing any ACK / NAK from Konrad on this.

Original post:
https://patchwork.kernel.org/patch/2033981/

Konrad - do you have any thoughts about incorporating this into a
future merge window?

Ben

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.