[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] BUG: soft lockup - CPU#1 stuck for 61s! [udevd:4865]


  • To: "Jeremy Fitzhardinge" <jeremy@xxxxxxxx>
  • From: "James Harper" <james.harper@xxxxxxxxxxxxxxxx>
  • Date: Mon, 6 Sep 2010 09:44:23 +1000
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
  • Delivery-date: Sun, 05 Sep 2010 16:45:30 -0700
  • List-id: Xen developer discussion <xen-devel.lists.xensource.com>
  • Thread-index: ActK/q882zvhES+FSZ6IH+PETbgLzAAAAuywAJSVS+A=
  • Thread-topic: [Xen-devel] BUG: soft lockup - CPU#1 stuck for 61s! [udevd:4865]

This is still happening with 2.6.32.18-ge6b9b2c but I'm now not so sure
if it's a problem with blkback after all. My script does this:

echo source=$source >>/tmp/attach-snapshot.log
echo dest=$dest >>/tmp/attach-snapshot.log
lvcreate --snapshot --name=$dest --size=1G $source
lvscan >>/tmp/attach-snapshot.log
xm block-attach $domname phy:$dest $destdev w

/tmp/attach-snapshot.log shows "source=" and "dest=" but not a result of
"lvscan", which would indicate that it's the "lvcreate" that's hung, not
the "xm block-attach". Disk writes are still working so I know that it's
not just a missed buffer or anything. That would explain why udevd is
hanging too - lvcreate would involve creating a new device while "xm
block-attach" wouldn't. I think I'm barking up the wrong tree on
xen-devel unless anyone else has any suggestions?

Thanks

James

> -----Original Message-----
> From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx [mailto:xen-devel-
> bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of James Harper
> Sent: Friday, 3 September 2010 10:36
> To: Jeremy Fitzhardinge
> Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: RE: [Xen-devel] BUG: soft lockup - CPU#1 stuck for 61s!
[udevd:4865]
> 
> >
> >  On 09/02/2010 05:20 PM, James Harper wrote:
> > > I've had "BUG: soft lockup - CPU#1 stuck for 61s! [udevd:4865]"
> happen
> > > twice now and the server fairly quickly (over the course of 5-10
> > > minutes) becomes unusable after that happens. The call trace
appears
> to
> > > be:
> > >
> > > Call Trace:
> > >  [<ffffffff8100e8e9>] ? xen_force_evtchn_callback+0x9/0xa
> > >  [<ffffffff8100ef72>] ? check_events+0x12/0x20
> > >  [<ffffffff8100ef19>] ? xen_irq_enable_direct_end+0x0/0x7
> > >  [<ffffffff8102fcc1>] ? do_page_fault+0x9e/0x27b
> > >  [<ffffffff812b39a5>] ? page_fault+0x25/0x30
> > >  [<ffffffff8116f28d>] ? __put_user_4+0x1d/0x30
> > >  [<ffffffff81045265>] ? schedule_tail+0x92/0x96
> > >  [<ffffffff81011983>] ? ret_from_fork+0x13/0x80
> > >
> > > That was under 2.6.32.17-g69a73fa. I've since upgraded to
> > > 2.6.32.18-ge6b9b2c but the changelogs don't show anything
obviously
> > > related to the crash (but as usual there are a lot of them so
maybe
> I've
> > > missed something?)
> > >
> > > Is this a known problem under 2.6.32.17-g69a73fa that has since
been
> > > fixed?
> >
> > Hard to know; nothing springs to mind right now.  What else is going
> on
> > at the time?  What's the full softlockup message?  Is there any more
> > context?
> 
> I just noticed that it coincided pretty much exactly when a backup
runs.
> I take a snapshot in Dom0 then block-attach it to the linux DomU so it
> can be backed up from DomU which makes restores easier etc, rather
than
> doing the backup in Dom0. I hadn't noticed before because it's the
> Windows DomU that people complain about first, not the linux DomU.
> Block-attach/detach would explain why udevd is involved too which
makes
> more sense. The block-attach happens every hour on the hour. The full
> cut&paste of the first hang is:
> 
> Sep  3 09:00:42 keechsvr2 kernel: [235893.079566] blkback: ring-ref
> 1442, event-channel 13, protocol 1 (x86_64-abi)
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] BUG: soft lockup -
> CPU#1 stuck for 61s! [udevd:4865]
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Modules linked in:
> ipt_MASQUERADE xt_physdev pppoe pppox ipmi_devintf autofs4 tun
> act_police sch_ingress cls_u32 sch_sfq sch_cbq ppp_generic slhc
ipt_ULO
> G xt_limit ipt_REJECT xt_state xt_dscp iptable_filter ipt_REDIRECT
> xt_conntrack iptable_nat xt_TCPMSS xt_tcpmss xt_TCPOPTSTRIP xt_tcpudp
> iptable_mangle ip_tables x_tables ipv6 8021q garp bridge stp llc n
> f_nat_ftp nf_conntrack_ftp ip_gre nf_nat_pptp nf_conntrack_pptp
> nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack_ipv4
> nf_conntrack nf_defrag_ipv4 xen_gntdev xen_evtchn xen_netback
> xen_blkback
> blkback_pagemap loop snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
> tpm_tis tpm sd_mod soundcore ipmi_si tpm_bios hpilo snd_page_alloc
> ipmi_msghandler pcspkr psmouse serio_raw rng_core i5000_edac edac_c
> ore i5k_amb button evdev xfs exportfs dm_mirror dm_region_hash dm_log
> dm_snapshot dm_mod ide_cd_mod cdrom ata_piix ata_generic libata
> usb_storage scsi_mod usbhid hid piix ide_pci_generic bnx2 ehci_hcd id
> e_core uhci_hc
> Sep  3 09:02:13 keechsvr2 kernel:  cciss thermal fan [last unloaded:
> scsi_wait_scan]
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] CPU 1:
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Modules linked in:
> ipt_MASQUERADE xt_physdev pppoe pppox ipmi_devintf autofs4 tun
> act_police sch_ingress cls_u32 sch_sfq sch_cbq ppp_generic slhc
ipt_ULO
> G xt_limit ipt_REJECT xt_state xt_dscp iptable_filter ipt_REDIRECT
> xt_conntrack iptable_nat xt_TCPMSS xt_tcpmss xt_TCPOPTSTRIP xt_tcpudp
> iptable_mangle ip_tables x_tables ipv6 8021q garp bridge stp llc n
> f_nat_ftp nf_conntrack_ftp ip_gre nf_nat_pptp nf_conntrack_pptp
> nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack_ipv4
> nf_conntrack nf_defrag_ipv4 xen_gntdev xen_evtchn xen_netback
> xen_blkback
> blkback_pagemap loop snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
> tpm_tis tpm sd_mod soundcore ipmi_si tpm_bios hpilo snd_page_alloc
> ipmi_msghandler pcspkr psmouse serio_raw rng_core i5000_edac edac_c
> ore i5k_amb button evdev xfs exportfs dm_mirror dm_region_hash dm_log
> dm_snapshot dm_mod ide_cd_mod cdrom ata_piix ata_generic libata
> usb_storage scsi_mod usbhid hid piix ide_pci_generic bnx2 ehci_hcd id
> e_core uhci_hc
> Sep  3 09:02:13 keechsvr2 kernel:  cciss thermal fan [last unloaded:
> scsi_wait_scan]
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Pid: 4865, comm:
udevd
> Not tainted 2.6.32.17-g69a73fa #1 ProLiant ML370 G5
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RIP:
> e030:[<ffffffff8100922a>]  [<ffffffff8100922a>]
> hypercall_page+0x22a/0x1001
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RSP:
> e02b:ffff8800114d5dd0  EFLAGS: 00000246
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RAX:
0000000000040000
> RBX: 0000000000000001 RCX: ffffffff8100922a
> keechsvr2:/usr/local/src/linux-2.6-xen# head -100 /var/log/kern.log
> Sep  3 07:00:44 keechsvr2 kernel: [228694.767691] blkback: ring-ref
311,
> event-channel 13, protocol 1 (x86_64-abi)
> Sep  3 08:00:42 keechsvr2 kernel: [232293.072248] blkback: ring-ref
> 1334, event-channel 13, protocol 1 (x86_64-abi)
> Sep  3 09:00:42 keechsvr2 kernel: [235893.079566] blkback: ring-ref
> 1442, event-channel 13, protocol 1 (x86_64-abi)
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] BUG: soft lockup -
> CPU#1 stuck for 61s! [udevd:4865]
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Modules linked in:
> ipt_MASQUERADE xt_physdev pppoe pppox ipmi_devintf autofs4 tun
> act_police sch_ingress cls_u32 sch_sfq sch_cbq ppp_generic slhc
ipt_ULOG
> xt_limit ipt_REJECT xt_state xt_dscp iptable_filter ipt_REDIRECT
> xt_conntrack iptable_nat xt_TCPMSS xt_tcpmss xt_TCPOPTSTRIP xt_tcpudp
> iptable_mangle ip_tables x_tables ipv6 8021q garp bridge stp llc
> nf_nat_ftp nf_conntrack_ftp ip_gre nf_nat_pptp nf_conntrack_pptp
> nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack_ipv4
> nf_conntrack nf_defrag_ipv4 xen_gntdev xen_evtchn xen_netback
> xen_blkback blkback_pagemap loop snd_pcm_oss snd_mixer_oss snd_pcm
> snd_timer snd tpm_tis tpm sd_mod soundcore ipmi_si tpm_bios hpilo
> snd_page_alloc ipmi_msghandler pcspkr psmouse serio_raw rng_core
> i5000_edac edac_core i5k_amb button evdev xfs exportfs dm_mirror
> dm_region_hash dm_log dm_snapshot dm_mod ide_cd_mod cdrom ata_piix
> ata_generic libata usb_storage scsi_mod usbhid hid piix
ide_pci_generic
> bnx2 ehci_hcd ide_core uhci_hc
> Sep  3 09:02:13 keechsvr2 kernel:  cciss thermal fan [last unloaded:
> scsi_wait_scan]
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] CPU 1:
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Modules linked in:
> ipt_MASQUERADE xt_physdev pppoe pppox ipmi_devintf autofs4 tun
> act_police sch_ingress cls_u32 sch_sfq sch_cbq ppp_generic slhc
ipt_ULOG
> xt_limit ipt_REJECT xt_state xt_dscp iptable_filter ipt_REDIRECT
> xt_conntrack iptable_nat xt_TCPMSS xt_tcpmss xt_TCPOPTSTRIP xt_tcpudp
> iptable_mangle ip_tables x_tables ipv6 8021q garp bridge stp llc
> nf_nat_ftp nf_conntrack_ftp ip_gre nf_nat_pptp nf_conntrack_pptp
> nf_conntrack_proto_gre nf_nat_proto_gre nf_nat nf_conntrack_ipv4
> nf_conntrack nf_defrag_ipv4 xen_gntdev xen_evtchn xen_netback
> xen_blkback blkback_pagemap loop snd_pcm_oss snd_mixer_oss snd_pcm
> snd_timer snd tpm_tis tpm sd_mod soundcore ipmi_si tpm_bios hpilo
> snd_page_alloc ipmi_msghandler pcspkr psmouse serio_raw rng_core
> i5000_edac edac_core i5k_amb button evdev xfs exportfs dm_mirror
> dm_region_hash dm_log dm_snapshot dm_mod ide_cd_mod cdrom ata_piix
> ata_generic libata usb_storage scsi_mod usbhid hid piix
ide_pci_generic
> bnx2 ehci_hcd ide_core uhci_hc
> Sep  3 09:02:13 keechsvr2 kernel:  cciss thermal fan [last unloaded:
> scsi_wait_scan]
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Pid: 4865, comm:
udevd
> Not tainted 2.6.32.17-g69a73fa #1 ProLiant ML370 G5
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RIP:
> e030:[<ffffffff8100922a>]  [<ffffffff8100922a>]
> hypercall_page+0x22a/0x1001
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RSP:
> e02b:ffff8800114d5dd0  EFLAGS: 00000246
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RAX:
0000000000040000
> RBX: 0000000000000001 RCX: ffffffff8100922a
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RDX:
ffffffff814293d0
> RSI: 0000000000000000 RDI: 0000000000000000
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] RBP:
00007fbc628a9800
> R08: 0000000000000000 R09: 0000000000000000
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] R10:
0000000000000002
> R11: 0000000000000246 R12: 0000000000000002
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] R13:
ffff8800114d5e88
> R14: ffff880017f9cfc0 R15: ffff880017eed500
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] FS:
> 00007fbc628a9770(0000) GS:ffff880002de8000(0000)
knlGS:0000000000000000
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] CS:  e033 DS: 0000
ES:
> 0000 CR0: 000000008005003b
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] CR2:
00007fbc628a9800
> CR3: 000000001e2d7000 CR4: 0000000000002660
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] DR0:
0000000000000000
> DR1: 0000000000000000 DR2: 0000000000000000
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] DR3:
0000000000000000
> DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505] Call Trace:
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]
[<ffffffff8100e8e9>]
> ? xen_force_evtchn_callback+0x9/0xa
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]
[<ffffffff8100ef72>]
> ? check_events+0x12/0x20
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]
[<ffffffff8100ef19>]
> ? xen_irq_enable_direct_end+0x0/0x7
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]
[<ffffffff8102fcc1>]
> ? do_page_fault+0x9e/0x27b
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]
[<ffffffff812b39a5>]
> ? page_fault+0x25/0x30
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]
[<ffffffff8116f28d>]
> ? __put_user_4+0x1d/0x30
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]
[<ffffffff81045265>]
> ? schedule_tail+0x92/0x96
> Sep  3 09:02:13 keechsvr2 kernel: [235983.760505]
[<ffffffff81011983>]
> ? ret_from_fork+0x13/0x80
> 
> There's another hang at 09:03:09 which I can post if you'd find it
> useful.
> 
> If the block-attach happened at 09:00:42 and the crash happened at
> 09:02:13 then I assume that whatever udevd started doing started at
> 09:01:12 (eg 61 seconds earlier) which means that it's more likely to
be
> the block-detach at the end of the backup causing the hang not the
> block-attach. The backup runs pretty quick.
> 
> I'm inclined to turn off the snapshot backup if this isn't a known and
> resolved problem...
> 
> Thanks
> 
> James
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.