[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen4.2 S3 regression?



Attached is a new run for
new boot (pre-s3)
first suspend / resume cycle (s3-first)
second (failing) suspend / resume cycle (s3-second)



To go into greater detail on the kernel used -

It is a 3.2.23 kernel based off of the Ubuntu 12.04 git tree here
http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-precise.git;a=summary

To that, I also have some of Konrad's branches - specifically
/devel/ioperm
/devel/acpi-s3.v7
/stable/misc  (mostly for the microcode fixes)
/stable/for-linus-fixes-3.3
/stable/for-linus-3.3
/devel/ttm.dma_pool.v2.9
/stable/for-x86

On top of that, are some more patches specific to our operations, not
terribly interesting here, but I can provide them, if necessary.


The 3.5 tree I tested with has a similar makeup - with some fewer
branches from Konrad.


On Wed, Aug 8, 2012 at 6:39 AM, Ben Guthro <ben@xxxxxxxxxx> wrote:
> Thanks for taking the time to reply.
>
> I'm out of the office today, so don't have direct access to the
> machine in question until tomorrow... but I'll do my best to answer
> (inline below) and I'll follow up tomorrow with concrete answers.
>
> On Wed, Aug 8, 2012 at 4:35 AM, Jan Beulich <JBeulich@xxxxxxxx> wrote:
>>>>> On 07.08.12 at 22:14, Ben Guthro <ben@xxxxxxxxxx> wrote:
>>> Any suggestions on how best to chase this down?
>>>
>>> The first S3 suspend/resume cycle works, but the second does not.
>>>
>>> On the second try, I never get any interrupts delivered to ahci.
>>> (at least according to /proc/interrupts)
>>>
>>>
>>> syslog traces from the first (good) and the second (bad) are attached,
>>> as well as the output from the "*" debug Ctrl+a handler in both cases.
>>
>> You should have provided this also for the state before the
>> first suspend. The state after the first resume already looks
>> corrupted (presumably just not as badly):
>
> I'll be able to send this tomorrow.
>
>>
>> (XEN) PCI-MSI interrupt information:
>> (XEN)  MSI    26 vec=71 lowest  edge   assert  log lowest dest=00000001 
>> mask=0/1/-1
>> (XEN)  MSI    27 vec=00  fixed  edge deassert phys lowest dest=00000001 
>> mask=0/1/-1
>>                      ^^
>> (XEN)  MSI    28 vec=29 lowest  edge   assert  log lowest dest=00000001 
>> mask=0/1/-1
>> (XEN)  MSI    29 vec=79 lowest  edge   assert  log lowest dest=00000001 
>> mask=0/1/-1
>> (XEN)  MSI    30 vec=81 lowest  edge   assert  log lowest dest=00000001 
>> mask=0/1/-1
>> (XEN)  MSI    31 vec=99 lowest  edge   assert  log lowest dest=00000001 
>> mask=0/1/-1
>>
>> so this is likely the reason for thing falling apart on the second
>> iteration:
>>
>> (XEN)   Interrupt Remapping: supported and enabled.
>> (XEN)   Interrupt remapping table (nr_entry=0x10000. Only dump P=1 entries 
>> here):
>> (XEN)        SVT  SQ   SID      DST  V  AVL DLM TM RH DM FPD P
>> (XEN)   0000:  1   0  f0f8 00000001 38    0   1  0  1  1   0 1
>> ...
>> (XEN)   0014:  1   0  00d8 00000001 a1    0   1  0  1  1   0 1
>> (XEN)   0015:  1   0  00fa 00000001 00    0   0  0  0  0   0 1
>>                                               ^     ^  ^
>> (XEN)   0016:  1   0  f0f8 00000001 31    0   1  1  1  1   0 1
>> (XEN)   0017:  1   0  00a0 00000001 a9    0   1  0  1  1   0 1
>> (XEN)   0018:  1   0  0200 00000001 b1    0   1  0  1  1   0 1
>> (XEN)   0019:  1   0  00c8 00000001 c9    0   1  0  1  1   0 1
>>
>> Surprisingly in both cases we get (with the other vector fields varying
>> accordingly)
>>
>> (XEN)    IRQ:  26 affinity:0001 vec:71 type=PCI-MSI         status=00000010 
>> in-flight=0 domain-list=0:279(-S--),
>> (XEN)    IRQ:  27 affinity:0001 vec:21 type=PCI-MSI         status=00000010 
>> in-flight=0 domain-list=0:278(-S--),
>>                                     ^^
>> (XEN)    IRQ:  28 affinity:0001 vec:29 type=PCI-MSI         status=00000010 
>> in-flight=0 domain-list=0:277(-S--),
>> (XEN)    IRQ:  29 affinity:0001 vec:79 type=PCI-MSI         status=00000010 
>> in-flight=0 domain-list=0:276(-S--),
>> (XEN)    IRQ:  30 affinity:0001 vec:81 type=PCI-MSI         status=00000010 
>> in-flight=0 domain-list=0:275(PS--),
>> (XEN)    IRQ:  31 affinity:0001 vec:99 type=PCI-MSI         status=00000010 
>> in-flight=0 domain-list=0:274(PS--),
>>
>> The interrupt in question belongs to 0000:00:1f.2, i.e. the
>> AHCI contoller.
>
> This would be consistent with what I've observed.
>
>>
>> Unfortunately I can't make sense of the kernel side config space
>> restore messages - an offset of 1 gets reported for the device in
>> question (and various other odd offsets exist), yet 3.5's
>> drivers/pci/pci.c:pci_restore_config_space_range() calls
>> pci_restore_config_dword() with an offset that's always divisible
>> by 4. Could you clarify which kernel version you were using here?
>> We first need to determine whether the kernel corrupts something
>> (after all, config space isn't protected from Dom0 modifications) -
>> if that's the case, we may need to understand why older Xen was
>> immune against that. If that's not the case, adding some extra
>> logging to Xen's pci_restore_msi_state() would seem the best
>> first step, plus (maybe) logging of Dom0 post-resume config space
>> accesses to the device in question.
>
> This particular failure is using linux-3.2.23 + some of Konrad's
> branches that haven't been merged into mainline (s3 branches, are
> probably the most appropriate here)
>
>>
>> The most likely thing happening (though unclear where) is that
>> the corresponding struct msi_msg instance gets cleared in the
>> course of the first resume (but after the corresponding interrupt
>> remapping entry already got restored).
>>
>> Jan
>>

Attachment: xen-dump-s3-second.txt
Description: Text document

Attachment: xen-dump-s3-first.txt
Description: Text document

Attachment: xen-dump-pre-s3.txt
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.