[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] High CPU temp, suspend problem - xen 4.1.5-pre, linux 3.7.x

On 28.03.2013 18:50, Andrew Cooper wrote:
> On 28/03/2013 17:44, Marek Marczykowski wrote:
>> On 28.03.2013 18:41, Andrew Cooper wrote:
>>> On 27/03/2013 17:15, Marek Marczykowski wrote:
>>>> On 27.03.2013 17:56, Andrew Cooper wrote:
>>>>> On 27/03/2013 15:47, Konrad Rzeszutek Wilk wrote:
>>>>>> On Wed, Mar 27, 2013 at 02:52:14PM +0000, Andrew Cooper wrote:
>>>>>>> On 27/03/2013 14:46, Andrew Cooper wrote:
>>>>>>>> On 27/03/2013 14:31, Marek Marczykowski wrote:
>>>>>>>>> On 27.03.2013 09:52, Jan Beulich wrote:
>>>>>>>>>>>>> On 26.03.13 at 19:50, Andrew Cooper <andrew.cooper3@xxxxxxxxxx> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>> So vector e9 doesn't appear to be programmed in anywhere.
>>>>>>>>>> Quite obviously, as it's the 8259A vector for IRQ 9. The question
>>>>>>>>>> really is why an IRQ appears on that vector in the first place. The
>>>>>>>>>> 8259A resume code _should_ leave all IRQs masked on a fully
>>>>>>>>>> IO-APIC system (see my question raised yesterday).
>>>>>>>>>> And that's also why I suggested, for an experiment, to fiddle with
>>>>>>>>>> the loop exit condition to exclude legacy vectors (which wouldn't
>>>>>>>>>> be a final solution, but would at least tell us whether the direction
>>>>>>>>>> is the right one). In the end, besides understanding why an
>>>>>>>>>> interrupt on vector E9 gets raised at all, we may also need to
>>>>>>>>>> tweak the IRQ migration logic to not do anything on legacy IRQs,
>>>>>>>>>> but that would need to happen earlier than in
>>>>>>>>>> smp_irq_move_cleanup_interrupt(). Considering that 4.3
>>>>>>>>>> apparently doesn't have this problem, we may need to go hunt for
>>>>>>>>>> a change that isn't directly connected to this, yet deals with the
>>>>>>>>>> problem as a side effect (at least I don't recall any particular fix
>>>>>>>>>> since 4.2). One aspect here is the double mapping of legacy IRQs
>>>>>>>>>> (once to their IO-APIC vector, and once to their legacy vector,
>>>>>>>>>> i.e. vector_irq[] having two entries pointing to the same IRQ).
>>>>>>>>> So tried change loop condition to LAST_DYNAMIC_VECTOR and it doesn't 
>>>>>>>>> hit that
>>>>>>>>> BUG/ASSERT. But still it doesn't work - only CPU0 used by scheduler, 
>>>>>>>>> also some
>>>>>>>>> errors from dom0 kernel, and errors about PCI devices used by domU(1).
>>>>>>>>> Messages from resume (different tries):
>>>>>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector.log
>>>>>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-last-dynamic-vector2.log
>>>>>>>>> Also one time I've got fatal page fault error, earlier in resume (it 
>>>>>>>>> isn't
>>>>>>>>> deterministic):
>>>>>>>>> http://duch.mimuw.edu.pl/~marmarek/qubes/xen-4.1-resume-page-fault.log
>>>>>>>> This pagefault is a Null structure pointer dereference, likely the
>>>>>>>> scheduling data.  At a first glance, it looks related to the assertion
>>>>>>>> failures I have been seeing sporadically in testing, but unable to
>>>>>>>> reproduce reliably.  There seems to be something quite dodgy with
>>>>>>>> interaction of vcpu_wake and scheduling loops.
>>>>>>>> The other logs indicate that dom0 appears to have a domain id of 1,
>>>>>>>> which is sure to cause problems.
>>>>>>> Actually - ignore this
>>>>>>> >From the log,
>>>>>>> (XEN) physdev.c:153: dom0: can't create irq for msi!
>>>>>>> [  113.637037] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752
>>>>>>> domain
>>>>>>> (XEN) physdev.c:153: dom0: can't create irq for msi!
>>>>>>> [  113.657911] xhci_hcd 0000:03:00.0: xen map irq failed -22 for 32752
>>>>>>> domain
>>>>>>> and later
>>>>>>> (XEN) physdev.c:153: dom1: can't create irq for msi!
>>>>>>> [  121.909814] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain
>>>>>>> [  121.954080] error enable msi for guest 1 status ffffffea
>>>>>>> (XEN) physdev.c:153: dom1: can't create irq for msi!
>>>>>>> [  122.035355] pciback 0000:00:19.0: xen map irq failed -22 for 1 domain
>>>>>>> [  122.044421] error enable msi for guest 1 status ffffffea
>>>>>>> I think that there is a separate bug where mapped irqs are not unmapped
>>>>>>> on the suspend path.
>>>>>> You thinking this is a Linux (xen irq machinery) issue? Meaning it should
>>>>>> end up calling PHYSDEV_unmap_pirq as part of the suspend process?
>>>>> I am not sure.  Without looking at the code, I am only speculating.
>>>>> Beyond that, the main question is about the expected behaviour.  Do we
>>>>> expect dom0/U to unmap its irqs and remap them after resume?  What do we
>>>>> expect from domains which are unaware of the host sleep action?
>>>> BTW this is the case: domain 1 isn't fully aware of sleep. It have some PCI
>>>> devices assigned. The only action taken there before suspend is shutdown
>>>> network interfaces (without this system hanged during suspend).
>>> What do you mean here by shutting down the network interfaces? Are the
>>> devices being assigned back to dom0?  
>> No, just simple ip link set eth0 down. Seems to be enough to suspend succeed,
>> at least on most hardware...
> In which case repeat map_pirq hypercalls will fail with -EINVAL because
> the pirq is already set up.  It is probably worth putting a printk in
> map_pirq and unmap_pirq to see exactly what is happening across the
> sleep/resume cycle.

No unmap/map is done during sleep/resume cycle regarding that domain (have two
mapped pirqs). Even for dom0 I see only one unmap/map during suspend/resume.
For most devices this doesn't break anything. Few exceptions needs module
reload after resume (e.g. sky2), but not sure about the reason (no additional
logs, simply no link detected).

Best Regards / Pozdrawiam,
Marek Marczykowski
Invisible Things Lab

Attachment: signature.asc
Description: OpenPGP digital signature

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.