[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Dom0 Locked up for 4 hours "BUG: soft lockup - CPU#3 stuck for 61s!"


  • To: Todd Deshane <todd.deshane@xxxxxxx>
  • From: Javier Frias <jfrias@xxxxxxxxx>
  • Date: Tue, 29 Mar 2011 09:35:25 -0400
  • Cc: xen-users@xxxxxxxxxxxxxxxxxxx
  • Delivery-date: Tue, 29 Mar 2011 06:36:36 -0700
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=PkUyhgCdfXimC/JFXJP0z/7oPkwJB6LBRUyIe8dB4hztN+SzZuVxZYA2qOoQC4GbiI PgiR6oUL5jf48sa3gq+qNMvl2YYmgi6gXy5gQd15DHnVvY8CBrfwAA0hnWT5eRqVZVyp cg7quohhuyXy5f2Qrj/ULbA6R7udzRDAbMPpw=
  • List-id: Xen user discussion <xen-users.lists.xensource.com>

Never saw this reply, sorry for the delay. Answers inline. ( still
seeing the issue )

On Sat, Feb 26, 2011 at 2:50 PM, Todd Deshane <todd.deshane@xxxxxxx> wrote:
> On Sat, Feb 26, 2011 at 12:11 AM, Javier Frias <jfrias@xxxxxxxxx> wrote:
>> I posted a bug about this, but figured I'd ask the mailing list to see
>> if someone had seen this.
>> Bugzilla: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1746
>>
>> Basically, I had a dom0, after 57 days of non issues, lock up for 4
>> hours, completely unresponsive, and then recovered. The domU's were
>> unaffected except for the fact that I could not shut them down. (
>> since dom0 was unresponsive ). Although I was able to gain access via
>> xapi/xencenter, and I atleast had some access ( console, status, etc,
>> all worked via xapi).
>>
>
> Could you clarify this explanation a bit. What access was not
> available for 4 hours?
>

The dom0 was so loaded, ssh and any services running on (snmp for
one), were just unavailable. It was swapping, and just thoroughly
overloaded. I think this was due to the high io being done by one of
the guests, since I was able to log in to the host as one of the
events happened, and saw this via top.

Tasks: 228 total,   2 running, 226 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.4%us,  0.0%sy,  0.0%ni, 98.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.8%st
Mem:    771328k total,   747572k used,    23756k free,   139952k buffers
Swap:   524280k total,     5440k used,   518840k free,   342188k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15715 root      20   0  3796 2388 1868 S 9293.8  0.3 686893:40
tapdisk2
24367 root      20   0  4128 2720 1896 S 8004.8  0.4 553094:39
tapdisk2
3133 root      20   0  3928 2520 1868 S 5264.2  0.3 695773:20 tapdisk2
26586 root      20   0  4924 3516 1868 S 1370.3  0.5 450796:40 tapdisk2


Everywhere I read, they say tapdisk2 is way more cpu intensive than
any other driver, is there a way to use raw LVM in xcp? In our case, I
think that would be the best choice, since we have a beefy subsystem.


> You say you could access via xapi/xencenter was this after the 4 hours
> or during?
>
Oddly enough, during. Which was puzzling since every other service was
affected by the high loads and swapping going on in the host. Things
like shutting down a host did not work though, seemed only read only
things ( like verying vm running state and params worked via xencenter
or hitting the api directly )

> Did you happen to look at the guest performance during those times?
> Was one of the guest doing a lot of disk I/O? Could you give some more
> information as to how the guests access their virtual disks (local,
> NFS, iSCSI, etc.) and any other information about your setup that
> could give us hints as to what might have caused this.

Yes, absolutely, two vms in this host that locked up have what would
be considered high i/o characteristics. ( one is lots of small file
i/o, and the other just large files being appended to )

My hardware looks like the following ( i use no shared storage )

Dell R710
72Gb Ram
2 x X5650  @ 2.67GHz ( 12 physical cores, 12 additional threads )
6 x 600GB 15K disks in raid 10
Dell H700 raid controller ( 512MB version )

So the hardware should handle the i/o that's being done by the vm no problem.

The dom0 has the default cpu and ram allocation ( 768MB and 4 vcpus )

any help greatly appreciated.

Also, here's a kernel message of a new vm as it went nuts ... ( seems related )

===dmesg====

[6954775.046768] BUG: soft lockup - CPU#2 stuck for 61s! [apache2:20139]
[6954775.046776] Modules linked in: xenfs lp parport
[6954775.046784] CPU 2
[6954775.046786] Modules linked in: xenfs lp parport
[6954775.046793]
[6954775.046796] Pid: 20139, comm: apache2 Tainted: G      D
2.6.35-22-virtual #34~lucid1-Ubuntu /
[6954775.046802] RIP: e030:[<ffffffff812526a5>]  [<ffffffff812526a5>]
sys_semtimedop+0x625/0x690
[6954775.046811] RSP: e02b:ffff8800fb0fbcf8  EFLAGS: 00000246
[6954775.046815] RAX: 0000000000000001 RBX: 0000000000430000 RCX:
ffff8800fb0fbfd8
[6954775.046820] RDX: 0000000000000000 RSI: ffff8800eeb744a0 RDI:
00000000ffffffff
[6954775.046825] RBP: ffff8800fb0fbf68 R08: 0000000000000000 R09:
0000000000000000
[6954775.046830] R10: 0000000000000000 R11: 0000000000000001 R12:
0000000000000001
[6954775.046835] R13: 0000000000000000 R14: 0000000000000001 R15:
ffff8800fae5ee50
[6954775.046843] FS:  00007f3943fd2740(0000) GS:ffff880003e76000(0000)
knlGS:0000000000000000
[6954775.046848] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[6954775.046852] CR2: 00007f9da4c3b000 CR3: 00000000fa0b3000 CR4:
0000000000002660
[6954775.046857] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[6954775.046863] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[6954775.046868] Process apache2 (pid: 20139, threadinfo
ffff8800fb0fa000, task ffff8800fa9496e0)
[6954775.046874] Stack:
[6954775.046876]  ffff8800ffc39400 ffff8800fb0fbf28 ffff8800fa9496e0
ffffffff81a514a8
[6954775.046884] <0> ffff8800fa4ec060 0000000000000000
00000001810072d2 ffff8800fb0fbd48
[6954775.046893] <0> ffff8800fae5ee50 ffff8800fb0fbd48
ffff1000ffff0000 ffff8800fa9402e0
[6954775.046904] Call Trace:
[6954775.046909]  [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1
[6954775.046916]  [<ffffffff8100611d>] ? xen_flush_tlb_single+0x9d/0xb0
[6954775.046921]  [<ffffffff8100527f>] ? xen_set_pte_at+0x6f/0xf0
[6954775.046927]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954775.046932]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954775.046938]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954775.046943]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954775.046949]  [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1
[6954775.046954]  [<ffffffff810041a1>] ? xen_clts+0x71/0x80
[6954775.046959]  [<ffffffff8101407c>] ? restore_i387_xstate+0xcc/0x1c0
[6954775.046965]  [<ffffffff81252720>] sys_semop+0x10/0x20
[6954775.046970]  [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[6954775.046974] Code: 57 48 45 85 f6 74 65 48 8b 4a 10 48 89 42 10 48
83 c2 08 48 89 95 60 ff ff ff 48 89 8d 68 ff ff ff 48 89 01 e9 29 fe
ff ff f3 90 <e9> 63 fe ff ff 48 8b 95 60 ff ff ff 48 8b 85 68 ff ff ff
49 b8
[6954775.047036] Call Trace:
[6954775.047040]  [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1
[6954775.047045]  [<ffffffff8100611d>] ? xen_flush_tlb_single+0x9d/0xb0
[6954775.047050]  [<ffffffff8100527f>] ? xen_set_pte_at+0x6f/0xf0
[6954775.047055]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954775.047061]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954775.047066]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954775.047071]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954775.047077]  [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1
[6954775.047082]  [<ffffffff810041a1>] ? xen_clts+0x71/0x80
[6954775.047087]  [<ffffffff8101407c>] ? restore_i387_xstate+0xcc/0x1c0
[6954775.047092]  [<ffffffff81252720>] sys_semop+0x10/0x20
[6954775.047097]  [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[6954777.197935] BUG: soft lockup - CPU#3 stuck for 61s! [apache2:20145]
[6954777.197949] Modules linked in: xenfs lp parport
[6954777.197959] CPU 3
[6954777.197961] Modules linked in: xenfs lp parport
[6954777.197969]
[6954777.197973] Pid: 20145, comm: apache2 Tainted: G      D
2.6.35-22-virtual #34~lucid1-Ubuntu /
[6954777.197979] RIP: e030:[<ffffffff812526a5>]  [<ffffffff812526a5>]
sys_semtimedop+0x625/0x690
[6954777.197993] RSP: e02b:ffff880048ed3cf8  EFLAGS: 00000246
[6954777.197997] RAX: 0000000000000001 RBX: 0000000000430000 RCX:
ffff880048ed3fd8
[6954777.198002] RDX: 0000000000000000 RSI: ffff8800032d16e0 RDI:
00000000ffffffff
[6954777.198007] RBP: ffff880048ed3f68 R08: 0000000000000000 R09:
0000000000000000
[6954777.198012] R10: 0000000000000000 R11: 0000000000000001 R12:
0000000000000001
[6954777.198017] R13: 0000000000000000 R14: 0000000000000001 R15:
ffff8800fae5ee50
[6954777.198027] FS:  00007f3943fd2740(0000) GS:ffff880003e94000(0000)
knlGS:0000000000000000
[6954777.198032] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[6954777.198036] CR2: 00007f393dc39030 CR3: 00000000faf56000 CR4:
0000000000002660
[6954777.198042] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[6954777.198047] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[6954777.198052] Process apache2 (pid: 20145, threadinfo
ffff880048ed2000, task ffff8800fb16c4a0)
[6954777.198057] Stack:
[6954777.198060]  0000000000000293 ffff880048ed3f28 ffff8800fb16c4a0
ffffffff81a514a8
[6954777.198068] <0> ffff8800fa4ec8a0 0000000000000000
0000000148ed3dd8 ffff880048ed3d48
[6954777.198077] <0> ffff8800fae5ee50 ffff880048ed3d48
ffff1000ffff0000 ffff8800fb2d4480
[6954777.198088] Call Trace:
[6954777.198097]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954777.198104]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954777.198109]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954777.198115]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954777.198123]  [<ffffffff81036e88>] ? pvclock_clocksource_read+0x58/0xd0
[6954777.198129]  [<ffffffff81007161>] ? xen_clocksource_read+0x21/0x30
[6954777.198137]  [<ffffffff8108931a>] ? do_gettimeofday+0x1a/0x50
[6954777.198142]  [<ffffffff81252720>] sys_semop+0x10/0x20
[6954777.198148]  [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b
[6954777.198152] Code: 57 48 45 85 f6 74 65 48 8b 4a 10 48 89 42 10 48
83 c2 08 48 89 95 60 ff ff ff 48 89 8d 68 ff ff ff 48 89 01 e9 29 fe
ff ff f3 90 <e9> 63 fe ff ff 48 8b 95 60 ff ff ff 48 8b 85 68 ff ff ff
49 b8
[6954777.198218] Call Trace:
[6954777.198223]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954777.198229]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954777.198234]  [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10
[6954777.198239]  [<ffffffff810072d2>] ? check_events+0x12/0x20
[6954777.198245]  [<ffffffff81036e88>] ? pvclock_clocksource_read+0x58/0xd0
[6954777.198251]  [<ffffffff81007161>] ? xen_clocksource_read+0x21/0x30
[6954777.198256]  [<ffffffff8108931a>] ? do_gettimeofday+0x1a/0x50
[6954777.198261]  [<ffffffff81252720>] sys_semop+0x10/0x20
[6954777.198267]  [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.