[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-users] Dom0 Locked up for 4 hours "BUG: soft lockup - CPU#3 stuck for 61s!"
Never saw this reply, sorry for the delay. Answers inline. ( still seeing the issue ) On Sat, Feb 26, 2011 at 2:50 PM, Todd Deshane <todd.deshane@xxxxxxx> wrote: > On Sat, Feb 26, 2011 at 12:11 AM, Javier Frias <jfrias@xxxxxxxxx> wrote: >> I posted a bug about this, but figured I'd ask the mailing list to see >> if someone had seen this. >> Bugzilla: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1746 >> >> Basically, I had a dom0, after 57 days of non issues, lock up for 4 >> hours, completely unresponsive, and then recovered. The domU's were >> unaffected except for the fact that I could not shut them down. ( >> since dom0 was unresponsive ). Although I was able to gain access via >> xapi/xencenter, and I atleast had some access ( console, status, etc, >> all worked via xapi). >> > > Could you clarify this explanation a bit. What access was not > available for 4 hours? > The dom0 was so loaded, ssh and any services running on (snmp for one), were just unavailable. It was swapping, and just thoroughly overloaded. I think this was due to the high io being done by one of the guests, since I was able to log in to the host as one of the events happened, and saw this via top. Tasks: 228 total, 2 running, 226 sleeping, 0 stopped, 0 zombie Cpu(s): 0.4%us, 0.0%sy, 0.0%ni, 98.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.8%st Mem: 771328k total, 747572k used, 23756k free, 139952k buffers Swap: 524280k total, 5440k used, 518840k free, 342188k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15715 root 20 0 3796 2388 1868 S 9293.8 0.3 686893:40 tapdisk2 24367 root 20 0 4128 2720 1896 S 8004.8 0.4 553094:39 tapdisk2 3133 root 20 0 3928 2520 1868 S 5264.2 0.3 695773:20 tapdisk2 26586 root 20 0 4924 3516 1868 S 1370.3 0.5 450796:40 tapdisk2 Everywhere I read, they say tapdisk2 is way more cpu intensive than any other driver, is there a way to use raw LVM in xcp? In our case, I think that would be the best choice, since we have a beefy subsystem. > You say you could access via xapi/xencenter was this after the 4 hours > or during? > Oddly enough, during. Which was puzzling since every other service was affected by the high loads and swapping going on in the host. Things like shutting down a host did not work though, seemed only read only things ( like verying vm running state and params worked via xencenter or hitting the api directly ) > Did you happen to look at the guest performance during those times? > Was one of the guest doing a lot of disk I/O? Could you give some more > information as to how the guests access their virtual disks (local, > NFS, iSCSI, etc.) and any other information about your setup that > could give us hints as to what might have caused this. Yes, absolutely, two vms in this host that locked up have what would be considered high i/o characteristics. ( one is lots of small file i/o, and the other just large files being appended to ) My hardware looks like the following ( i use no shared storage ) Dell R710 72Gb Ram 2 x X5650 @ 2.67GHz ( 12 physical cores, 12 additional threads ) 6 x 600GB 15K disks in raid 10 Dell H700 raid controller ( 512MB version ) So the hardware should handle the i/o that's being done by the vm no problem. The dom0 has the default cpu and ram allocation ( 768MB and 4 vcpus ) any help greatly appreciated. Also, here's a kernel message of a new vm as it went nuts ... ( seems related ) ===dmesg==== [6954775.046768] BUG: soft lockup - CPU#2 stuck for 61s! [apache2:20139] [6954775.046776] Modules linked in: xenfs lp parport [6954775.046784] CPU 2 [6954775.046786] Modules linked in: xenfs lp parport [6954775.046793] [6954775.046796] Pid: 20139, comm: apache2 Tainted: G D 2.6.35-22-virtual #34~lucid1-Ubuntu / [6954775.046802] RIP: e030:[<ffffffff812526a5>] [<ffffffff812526a5>] sys_semtimedop+0x625/0x690 [6954775.046811] RSP: e02b:ffff8800fb0fbcf8 EFLAGS: 00000246 [6954775.046815] RAX: 0000000000000001 RBX: 0000000000430000 RCX: ffff8800fb0fbfd8 [6954775.046820] RDX: 0000000000000000 RSI: ffff8800eeb744a0 RDI: 00000000ffffffff [6954775.046825] RBP: ffff8800fb0fbf68 R08: 0000000000000000 R09: 0000000000000000 [6954775.046830] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001 [6954775.046835] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8800fae5ee50 [6954775.046843] FS: 00007f3943fd2740(0000) GS:ffff880003e76000(0000) knlGS:0000000000000000 [6954775.046848] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [6954775.046852] CR2: 00007f9da4c3b000 CR3: 00000000fa0b3000 CR4: 0000000000002660 [6954775.046857] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [6954775.046863] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [6954775.046868] Process apache2 (pid: 20139, threadinfo ffff8800fb0fa000, task ffff8800fa9496e0) [6954775.046874] Stack: [6954775.046876] ffff8800ffc39400 ffff8800fb0fbf28 ffff8800fa9496e0 ffffffff81a514a8 [6954775.046884] <0> ffff8800fa4ec060 0000000000000000 00000001810072d2 ffff8800fb0fbd48 [6954775.046893] <0> ffff8800fae5ee50 ffff8800fb0fbd48 ffff1000ffff0000 ffff8800fa9402e0 [6954775.046904] Call Trace: [6954775.046909] [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1 [6954775.046916] [<ffffffff8100611d>] ? xen_flush_tlb_single+0x9d/0xb0 [6954775.046921] [<ffffffff8100527f>] ? xen_set_pte_at+0x6f/0xf0 [6954775.046927] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10 [6954775.046932] [<ffffffff810072d2>] ? check_events+0x12/0x20 [6954775.046938] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10 [6954775.046943] [<ffffffff810072d2>] ? check_events+0x12/0x20 [6954775.046949] [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1 [6954775.046954] [<ffffffff810041a1>] ? xen_clts+0x71/0x80 [6954775.046959] [<ffffffff8101407c>] ? restore_i387_xstate+0xcc/0x1c0 [6954775.046965] [<ffffffff81252720>] sys_semop+0x10/0x20 [6954775.046970] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b [6954775.046974] Code: 57 48 45 85 f6 74 65 48 8b 4a 10 48 89 42 10 48 83 c2 08 48 89 95 60 ff ff ff 48 89 8d 68 ff ff ff 48 89 01 e9 29 fe ff ff f3 90 <e9> 63 fe ff ff 48 8b 95 60 ff ff ff 48 8b 85 68 ff ff ff 49 b8 [6954775.047036] Call Trace: [6954775.047040] [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1 [6954775.047045] [<ffffffff8100611d>] ? xen_flush_tlb_single+0x9d/0xb0 [6954775.047050] [<ffffffff8100527f>] ? xen_set_pte_at+0x6f/0xf0 [6954775.047055] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10 [6954775.047061] [<ffffffff810072d2>] ? check_events+0x12/0x20 [6954775.047066] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10 [6954775.047071] [<ffffffff810072d2>] ? check_events+0x12/0x20 [6954775.047077] [<ffffffff810072bf>] ? xen_restore_fl_direct_end+0x0/0x1 [6954775.047082] [<ffffffff810041a1>] ? xen_clts+0x71/0x80 [6954775.047087] [<ffffffff8101407c>] ? restore_i387_xstate+0xcc/0x1c0 [6954775.047092] [<ffffffff81252720>] sys_semop+0x10/0x20 [6954775.047097] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b [6954777.197935] BUG: soft lockup - CPU#3 stuck for 61s! [apache2:20145] [6954777.197949] Modules linked in: xenfs lp parport [6954777.197959] CPU 3 [6954777.197961] Modules linked in: xenfs lp parport [6954777.197969] [6954777.197973] Pid: 20145, comm: apache2 Tainted: G D 2.6.35-22-virtual #34~lucid1-Ubuntu / [6954777.197979] RIP: e030:[<ffffffff812526a5>] [<ffffffff812526a5>] sys_semtimedop+0x625/0x690 [6954777.197993] RSP: e02b:ffff880048ed3cf8 EFLAGS: 00000246 [6954777.197997] RAX: 0000000000000001 RBX: 0000000000430000 RCX: ffff880048ed3fd8 [6954777.198002] RDX: 0000000000000000 RSI: ffff8800032d16e0 RDI: 00000000ffffffff [6954777.198007] RBP: ffff880048ed3f68 R08: 0000000000000000 R09: 0000000000000000 [6954777.198012] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001 [6954777.198017] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8800fae5ee50 [6954777.198027] FS: 00007f3943fd2740(0000) GS:ffff880003e94000(0000) knlGS:0000000000000000 [6954777.198032] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [6954777.198036] CR2: 00007f393dc39030 CR3: 00000000faf56000 CR4: 0000000000002660 [6954777.198042] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [6954777.198047] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [6954777.198052] Process apache2 (pid: 20145, threadinfo ffff880048ed2000, task ffff8800fb16c4a0) [6954777.198057] Stack: [6954777.198060] 0000000000000293 ffff880048ed3f28 ffff8800fb16c4a0 ffffffff81a514a8 [6954777.198068] <0> ffff8800fa4ec8a0 0000000000000000 0000000148ed3dd8 ffff880048ed3d48 [6954777.198077] <0> ffff8800fae5ee50 ffff880048ed3d48 ffff1000ffff0000 ffff8800fb2d4480 [6954777.198088] Call Trace: [6954777.198097] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10 [6954777.198104] [<ffffffff810072d2>] ? check_events+0x12/0x20 [6954777.198109] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10 [6954777.198115] [<ffffffff810072d2>] ? check_events+0x12/0x20 [6954777.198123] [<ffffffff81036e88>] ? pvclock_clocksource_read+0x58/0xd0 [6954777.198129] [<ffffffff81007161>] ? xen_clocksource_read+0x21/0x30 [6954777.198137] [<ffffffff8108931a>] ? do_gettimeofday+0x1a/0x50 [6954777.198142] [<ffffffff81252720>] sys_semop+0x10/0x20 [6954777.198148] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b [6954777.198152] Code: 57 48 45 85 f6 74 65 48 8b 4a 10 48 89 42 10 48 83 c2 08 48 89 95 60 ff ff ff 48 89 8d 68 ff ff ff 48 89 01 e9 29 fe ff ff f3 90 <e9> 63 fe ff ff 48 8b 95 60 ff ff ff 48 8b 85 68 ff ff ff 49 b8 [6954777.198218] Call Trace: [6954777.198223] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10 [6954777.198229] [<ffffffff810072d2>] ? check_events+0x12/0x20 [6954777.198234] [<ffffffff81006b3d>] ? xen_force_evtchn_callback+0xd/0x10 [6954777.198239] [<ffffffff810072d2>] ? check_events+0x12/0x20 [6954777.198245] [<ffffffff81036e88>] ? pvclock_clocksource_read+0x58/0xd0 [6954777.198251] [<ffffffff81007161>] ? xen_clocksource_read+0x21/0x30 [6954777.198256] [<ffffffff8108931a>] ? do_gettimeofday+0x1a/0x50 [6954777.198261] [<ffffffff81252720>] sys_semop+0x10/0x20 [6954777.198267] [<ffffffff8100a0f2>] system_call_fastpath+0x16/0x1b _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |