Xen project Mailing List

Re: [Xen-devel] Re: DomU lockups after resume from S3 on Core i5 processors

To: Joanna Rutkowska <joanna@xxxxxxxxxxxxxxxxxxxxxx>

From: Jeremy Fitzhardinge <jeremy@xxxxxxxx>

Date: Mon, 05 Jul 2010 16:17:15 -0700

Delivery-date: Mon, 05 Jul 2010 16:18:09 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

On 07/05/2010 03:52 PM, Joanna Rutkowska wrote: > On 07/06/10 00:43, Jeremy Fitzhardinge wrote: > >> On 07/05/2010 03:07 PM, Joanna Rutkowska wrote: >> >>> On 07/05/10 23:28, Joanna Rutkowska wrote: >>> >>> >>>> On 07/05/10 12:38, Joanna Rutkowska wrote: >>>> >>>> >>>>> I'm experiencing very reproducible DomU lockups that occur after I >>>>> resume the system from an S3 sleep. Strangely this seem to happen only >>>>> on my Core i5 systems (tested on two different machines), but not on >>>>> older Core 2 Duo systems. >>>>> >>>>> Usually this causes the apps (e.g. Firefox) running in DomUs to become >>>>> unresponsive, but sometimes I see that some very limited functionality >>>>> of the app is still available (e.g. I can open/close Tabs in Firefox, >>>>> but cannot do much anything more). Also, when I log in to the DomU via >>>>> xm console, I usually can see the login prompt, can enter the username, >>>>> but then the console hangs. >>>>> >>>>> I tried to attach to such a hanged DomU using gdbserver-xen, but when I >>>>> subsequently try to attach to the server from gdb (via the target >>>>> 127.0.0.1:9999 command), my gdb segfaults (how funny!). >>>>> >>>>> I'm running Xen 3.4.3, and fairly recent pvops0 kernel in DomU. In Dom0 >>>>> I run 2.6.34-xenlinux kernel (opensuse patches), but I doubt it is >>>>> relevant in any way. >>>>> >>>>> This seems like a scheduling problem, and, because it seems to affect >>>>> Core i5 processors, but not Core 2 Duos, it might have something to do >>>>> with Hyperthreading perhaps? >>>>> >>>>> >>>>> >>>> Ok, finally got the gdbsever working. This is the backtrace I get when >>>> attaching to a lockedup DomU after resume: >>>> >>>> #0 0xffffffff810093aa in ?? () >>>> #1 0xffffffff8168be18 in ?? () >>>> #2 0xffff880003a21600 in ?? () >>>> #3 0xffffffff8100ee63 in HYPERVISOR_sched_op () >>>> at >>>> /usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/xen/hypercall.h:292 >>>> #4 xen_safe_halt () at arch/x86/xen/irq.c:104 >>>> #5 0xffffffff8100c33e in raw_safe_halt () at >>>> /usr/src/debug/kernel-2.6.32/linux-2.6.32.x86_64/arch/x86/include/asm/paravirt.h:110 >>>> #6 xen_idle () at arch/x86/xen/setup.c:193 >>>> #7 0xffffffff81011cdd in cpu_idle () at arch/x86/kernel/process_64.c:143 >>>> #8 0xffffffff8144b997 in rest_init () at init/main.c:445 >>>> #9 0xffffffff81824ddc in start_kernel () at init/main.c:695 >>>> #10 0xffffffff818242c1 in x86_64_start_reservations >>>> (real_mode_data=<value optimized out>) at arch/x86/kernel/head64.c:123 >>>> #11 0xffffffff81828160 in xen_start_kernel () at >>>> arch/x86/xen/enlighten.c:1300 >>>> #12 0xffffffff838f3000 in ?? () >>>> #13 0xffffffff838f4000 in ?? () >>>> #14 0xffffffff838f5000 in ?? () >>>> >>>> Any ideas? >>>> >>>> >>>> >>> ... and when I disabled Hyperthreading in BIOS, the problem seems to >>> gone. Obviously this is not a desired solution... >>> >>> >> HT has historically been very good at flushing out race conditions which >> would normally be tricky to hit on SMP systems. I assume your domain is >> single CPU? >> > Actually no. It used to be indeed, but then I thought it might be the > issue and assigned 2 vcpus to it, but it still they were locking up. > Does the other cpu have the same backtrace into idle? >> Do you know what's going on it in that it might be waiting >> for? >> > No idea. I might be guessing that it would be different kernel > subsystems each time -- e.g. when I'm lucky and when the apps got only > "partially" locked up, I can e.g. open new tabs in Google Chrome, I can > see some thumbnails of my popular websites, but without their contents. > This would suggest the networking subsystem is dead, but at the same > time Chrome is apparently communicating fine with the X server in the > DomU (and which in turn talks fine with Dom0 over Xen shared > memory/evtchanl). > > I experienced the above behavior also when had only one VCPU er DomU. > I've seen similar things with just normal domain save/restore, where the timer interrupt seems to be failing. Can you ssh into the domain? I found that I couldn't do an interactive ssh (hung at the prompt), but a non-interactive command would work, so I could cat /proc/interrupts. This was on my non-HT i7 box, and it affected both pvops domUs, and CentOS 5 ones. >> Is it not longer getting timer events or something? Does the Xen >> 'q' debug-key make it do anything? >> > Ah, that's some secret option I've never heard of... Is in the gdb when > using with gdbserver-xen? > No, on the xen console: type ^A^A^A to switch input to Xen, then press q (h gets a list of other magic keys). ^A^A^A switches the console back to dom0. You can also trigger it with "xm debug-key q" and look at "xm dmesg" to see the results if you can't get to the Xen console. J _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.