[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764



On 19/05/2010 15:30, "Roger Cruz" <roger.cruz@xxxxxxxxxxxxxxxxxxx> wrote:

> 2) The way I narrow down the problem to these lines of code was by inserting a
> "while(1);" loop at different points in the code.  When it didn't reboot, I
> knew it had gotten to my while loop.  I just kept moving the while loop until
> I found the lines I highlighted in my previous msg.  Below is what my debug
> code looks like:

Your system seems to hobble along just fine if you remove the BUG_ON()s, so
why not convert them into printk() warnings? Or if it's too early for
printk, stash some info in memory and printk() it at the very end of S3
resume.

> 3) You can see above that the vmx_vmexit_control check was the point at which
> the crash/reboot was being triggered.  However, if I commented out just that
> line, I would still see a reboot.  Only when I commented the whole block out
> did it finally work.   Is something overwriting the location of these
> variables such that when I commented out a line of code, it moved the data
> segment causing a different variable to be overwritten?    I need to be able
> to explain this behavior.  So I will working towards that today.

I would assume that more than one of the BUG_ON()s is triggering. So if you
just comment out the first offending one that you find, you instead fall
foul of a second one.

> 4) My initial thoughts were that the BIOS was overwriting some of these
> locations, so I performed an experiment that I believe rules out the BIOS.  I
> commented out the code in power.c that puts the CPU into the sleep mode.  This
> had the effect of going through most of the sleep and wakeup code in power.c
> (it does not go through all the wakeup.S initialization as well).  When I did
> this, it still failed to resume from sleep as long as an HVM domain was
> present.  Here is the diff on power.c

Yep, that patch should do the expected thing and do everything except the
actual BIOS S3 transition.

Well, overall this does sound like a memory corruption issue, not a BIOS or
platform issue. You need to printk out the contents of variables
contributing to your failing BUG_ON()s and see what's written there, I
think.

 -- Keir

> 5) The problem occurs even when Xen is run in uni-processor mode.  I achieved
> this by adding "nosmp=1 maxcpus=1" to the grub command line that boots xen.  I
> confirmed that Xen only reported one physical CPU, namely CPU0.  This should
> have avoided any issues with waking up other non-boot processors.
> 
> 6) Finally, I narrowed down the type of domain and condition of the domain
> that would exhibit the problem, by using python to create a domain with me
> being able to control its definition.  If I set "flags" to 0, the problem is
> does not show up.  If I set it to "1" (hvm) and do NOT execute the
> "xc.domain_max_vcpus" call, the problem does not show up.  However, once I add
> one VCPU to this domain, the problem occurs.
> 
> #! /usr/bin/python
> import sys
> sys.path.append('/usr/lib/python2.6/site-packages')
> import xen.lowlevel.xc
> from xen.xend import uuid
> xc = xen.lowlevel.xc.xc()
> domid=xc.domain_create(domid=0,ssidref=0,handle=uuid.fromString("bad0beef-dead
> -beef-dead-beefdeadbeef"), flags=1)
> 
> print domid
> xc.domain_max_vcpus(domid, 1)
> 
> 
> Roger R. Cruz
> 
> 
> 
> -----Original Message-----
> From: Keir Fraser [mailto:keir.fraser@xxxxxxxxxxxxx]
> Sent: Wed 5/19/2010 3:25 AM
> To: Roger Cruz; xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: Re: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764
> 
> On 18/05/2010 23:34, "Roger Cruz" <roger.cruz@xxxxxxxxxxxxxxxxxxx> wrote:
> 
>> A little more info.  I am now able to wake up the Dell Inspiron 1764 after I
>> put it to sleep.  I found that the code commented out below would cause the
>> problems in my system.  I have yet to understand why these variables don't
>> end
>> up with the expected values.  If anyone has any thoughts that they would like
>> to share on how this code works and why it is comparing to stored variables,
>> I
>> would very much like to hear them.
> 
> The BUG_ONs are to detect VMX versioning inconsistencies between processors.
> The weird thing here is that you presumably brought all CPUs online during
> initial system boto with no problem. So somehow something has changed only
> after resume from S3. I think you will need to add tracing to discover which
> BUG_ON is failing, and why.
> 
> Incidentally, in my CPU hotplug cleanup I will be making it so that CPUs
> that fail the checks will fail to come online, rather than crash the system.
> Which is a bit of an improvement, but obviously something is buggy
> underlying this (possibly in BIOS code).
> 
>  -- Keir
> 
>> Thank you
>> Roger R. Cruz
>> 
>> 
>> diff -r 6b2b1470f009 xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
>> --- a/xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
>> +++ b/xen-3.4.2/xen/arch/x86/hvm/vmx/vmcs.c
>> 
>> @@ -191,19 +192,25 @@
>>          cpu_has_vmx_ins_outs_instr_info = !!(vmx_basic_msr_high & (1U<<22));
>>          vmx_display_features();
>>      }
>> +#if 0
>>      else
>>      {
>>          /* Globals are already initialised: re-check them. */
>>          BUG_ON(vmcs_revision_id != vmx_basic_msr_low);
>>          BUG_ON(vmx_pin_based_exec_control != _vmx_pin_based_exec_control);
>>          BUG_ON(vmx_cpu_based_exec_control != _vmx_cpu_based_exec_control);
>>          BUG_ON(vmx_secondary_exec_control != _vmx_secondary_exec_control);
>>          BUG_ON(vmx_vmexit_control != _vmx_vmexit_control);
>>          BUG_ON(vmx_vmentry_control != _vmx_vmentry_control);
>>          BUG_ON(cpu_has_vmx_ins_outs_instr_info !=
>>                 !!(vmx_basic_msr_high & (1U<<22)));
>>      }
>> 
>> +#endif
>>      /* IA-32 SDM Vol 3B: VMCS size is never greater than 4kB. */
>>      BUG_ON((vmx_basic_msr_high & 0x1fff) > PAGE_SIZE);
>> 
>> 
>> -----Original Message-----
>> From: Roger Cruz
>> Sent: Wed 5/12/2010 2:38 PM
>> To: Roger Cruz; xen-devel@xxxxxxxxxxxxxxxxxxx
>> Subject: RE: [Xen-devel] ACPI suspend/resume on Dell Inspirons 1464/1564/1764
>> 
>> 
>> We have made some progress in getting the inspiron laptops to work under Xen.
>> We tried xenunstable and xen-4.0.0 and discovered that xenunstable can resume
>> whereas xen-4.0.0 cannot.  Through trial and error, we have been able to
>> narrow down the actual changes that allowed it to work.  It looks like moving
>> the trampoline code down from its 0x8c000 location allowed it to resume.
>> 
>> So we took the change below and applied it to our 3.4.2 tree.  However, we
>> still have a problem in our 3.4.2 tree with this patch applied.  If an HVM
>> guest is running, the resume will fail with the exact same behavior as
>> before.
>> Due to our environment setup, we have not been able to test xenunstable with
>> an HVM guest, so we can't say if this problem is fixed in xenunstable or not.
>> Can someone familiar with these changes provide a clue as to what is going
>> on?
>> how does having an HVM guest running affect the resume functionality?
>> Running
>> PV linux guests does not affect resume, only HVM guests do.
>> 
>> 
>> --- old/xen-3.4.2/xen/include/asm-x86/config.h  2010-05-12 11:44:35.243564976
>> -0400
>> +++ new/xen-3.4.2/xen/include/asm-x86/config.h  2010-05-12 11:44:35.026578602
>> -0400
>> @@ -96,7 +96,7 @@
>>  /* Primary stack is restricted to 8kB by guard pages. */
>>  #define PRIMARY_STACK_SIZE 8192
>> 
>> -#define BOOT_TRAMPOLINE 0x8c000
>> +#define BOOT_TRAMPOLINE 0x7c000
>>  #define bootsym_phys(sym)                                 \
>>      (((unsigned long)&(sym)-(unsigned
>> long)&trampoline_start)+BOOT_TRAMPOLINE)
>>  #define bootsym(sym)                                      \
>> 
>> 
>> 
>> --- old/xen-3.4.2/xen/include/asm-x86/config.h  2010-05-12 11:44:35.243564976
>> -0400
>> +++ new/xen-3.4.2/xen/include/asm-x86/config.h  2010-05-12 11:44:35.026578602
>> -0400
>> @@ -96,7 +96,7 @@
>>  /* Primary stack is restricted to 8kB by guard pages. */
>>  #define PRIMARY_STACK_SIZE 8192
>> 
>> -#define BOOT_TRAMPOLINE 0x8c000
>> +#define BOOT_TRAMPOLINE 0x7c000
>>  #define bootsym_phys(sym)                                 \
>>      (((unsigned long)&(sym)-(unsigned
>> long)&trampoline_start)+BOOT_TRAMPOLINE)
>>  #define bootsym(sym)                                      \
>> 
>> -------
>> 
>> Hello fellow Xen developers,
>> 
>> I'm about to start debugging why Dell Inspirons running Xen 3.4.2 fail to
>> resume after a suspend operation.  A colleague has also found that the
>> problem
>> exists on bare-metal Linux
>> (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/571422) and an upstream
>> patch has been created
>> 
(http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-lucid.git;a=commitdiff;h=29c60c>>
c
>> c1a408371885d79d8f8c081fbcb9b10be).
>> 
>> I would like to find out if anyone in the Xen community has encountered this
>> problem and if a fix is in the works.  Otherwise, I will attempt to provide a
>> similar solution to Linux's patch.
>> 
>> thanks
>> Roger
>> 
>> 
>> 
> 
> 
> 
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.