[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Crashing / unable to start domUs due to high number of luns?



On Tue, Jan 31, 2012 at 01:42:23PM -0800, Nathan March wrote:
> Hi All,
> 
> We've got a xen setup based around a dell iscsi device with each xen 
> host having 2 lun's, we then run multipath on top of that. After adding 
> a couple new virtual disks the other day, a couple of our online stable 
> VM's suddenly hard locked up. Attaching to the console gave me nothing, 
> looked like they lost their disk devices.
> 
> Attempting to restart them on the same dom0 failed with hot plug errors, 
> as did attempting to start them on a few different dom0's. After doing a 
> "multipath -F" to remove unused devices and manually bringing in just 
> the selected LUN's via "multipath diskname", I was able to successfully 
> start them. This initially made me think perhaps I was hitting some sort 
> of udev / multipath / iscsi device lun limit (136 luns, 8 paths per lun 
> = 1088 iscsi connections). Just to be clear, the problem occurred on 
> multiple dom0's at the same time so it definitely seems iscsi related.
> 
> Now, a day later, I'm debugging this further and I'm again unable to 
> start VM's, even with all extra multipath devices removed. I rebooted 
> one of the dom0's and was able to successfully migrate our production 
> VM's off a broken server, so I've now got an empty dom0 that's unable to 
> start any vm's.
> 
> Starting a VM results in the following in xend.log:
> 
> [2012-01-31 13:06:16 12353] DEBUG (DevController:144) Waiting for 0.
> [2012-01-31 13:06:16 12353] DEBUG (DevController:628) 
> hotplugStatusCallback /local/domain/0/backend/vif/35/0/hotplug-status.
> [2012-01-31 13:07:56 12353] ERROR (SrvBase:88) Request wait_for_devices 
> failed.
> Traceback (most recent call last):
>   File "/usr/lib64/python2.6/site-packages/xen/web/SrvBase.py", line 
> 85, in perform
>     return op_method(op, req)
>   File 
> "/usr/lib64/python2.6/site-packages/xen/xend/server/SrvDomain.py", line 
> 85, in op_wait_for_devices
>     return self.dom.waitForDevices()
>   File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py", 
> line 1237, in waitForDevices
>     self.getDeviceController(devclass).waitForDevices()
>   File 
> "/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py", 
> line 140, in waitForDevices
>     return map(self.waitForDevice, self.deviceIDs())
>   File 
> "/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py", 
> line 155, in waitForDevice
>     (devid, self.deviceClass))
> VmError: Device 0 (vif) could not be connected. Hotplug scripts not working.


Was there anything in the kernel (dmesg) about vifs? What does your 
/proc/interrupts look like? Can you provide the dmesg that you get
during startup. I am mainly looking for:

NR_IRQS:16640 nr_irqs:1536 16

How many guests are your running when this happens?

One theory is that your are running out dom0 interrupts. Thought
I *think* that was made dynamic in 3.0..


Thought that does explain your iSCSI network wonky in the guest -
was there anything in the dmesg when the guest started going bad?

> [2012-01-31 13:07:56 12353] DEBUG (XendDomainInfo:3071) 
> XendDomainInfo.destroy: domid=35
> [2012-01-31 13:07:58 12353] DEBUG (XendDomainInfo:2401) Destroying 
> device model
> 
> I tried turning up udev's log level but that didn't reveal anything. 
> Reading the xenstore for the vif doesn't show anything unusual either:
> 
> ukxen1 ~ # xenstore-ls /local/domain/0/backend/vif/35
> 0 = ""
>  bridge = "vlan91"
>  domain = "nathanxenuk1"
>  handle = "0"
>  uuid = "2128d0b7-d50f-c2ad-4243-8a42bb598b81"
>  script = "/etc/xen/scripts/vif-bridge"
>  state = "1"
>  frontend = "/local/domain/35/device/vif/0"
>  mac = "00:16:3d:03:00:44"
>  online = "1"
>  frontend-id = "35"
> 
> The bridge device (vlan91) exists, trying a different bridge doesn't 
> matter. Removing the VIF completely results in the same error for the 
> VBD. Adding debugging to the hotplug/network scripts didn't reveal 
> anything, it looks like they aren't even being executed yet. Nothing is 
> logged to xen-hotplug.log.

OK, so that would imply the kernel hasn't been able to do the right
thing. Hmm.

What do you see when this happens with udev --monitor --kernel --udev
--property ?

> 
> The only thing I can think of that this may be related to, is gentoo 
> defaulted to a 10mb /dev which we filled up a few months back. We upped 
> the size to 50mb in the mount options and everything's been completely 
> stable since (~33 days). None of the /dev on the dom0's is higher than 
> 25% usage. Asides from adding the new luns, no changes have been made in 
> the past month.
> 
> To try and test if removing some devices would solve anything, I tried 
> doing an "iscsiadm -m node --logout" and it promptly hard locked the 
> entire box. After a reboot, I was unable to reproduce the problem on 
> that particular dom0.
> 
> I've still got 1 dom0 that's exhibiting the problem, if anyone is able 
> to suggest any further debugging steps?
> 
> - Nathan
> 
> 
> (XEN) Xen version 4.1.1 (root@) (gcc version 4.3.4 (Gentoo 4.3.4 p1.1, 
> pie-10.1.5) ) Mon Aug 29 16:24:12 PDT 2011
> 
> ukxen1 xen # xm info
> host                   : ukxen1
> release                : 3.0.3
> version                : #4 SMP Thu Dec 22 12:44:22 PST 2011
> machine                : x86_64
> nr_cpus                : 24
> nr_nodes               : 2
> cores_per_socket       : 6
> threads_per_core       : 2
> cpu_mhz                : 2261
> hw_caps                : 
> bfebfbff:2c100800:00000000:00003f40:029ee3ff:00000000:00000001:00000000
> virt_caps              : hvm hvm_directio
> total_memory           : 98291
> free_memory            : 91908
> free_cpus              : 0
> xen_major              : 4
> xen_minor              : 1
> xen_extra              : .1
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 
> hvm-3.0-x86_32p hvm-3.0-x86_64
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> xen_commandline        : console=vga dom0_mem=1024M dom0_max_vcpus=1 
> dom0_vcpus_pin=true
> cc_compiler            : gcc version 4.3.4 (Gentoo 4.3.4 p1.1, pie-10.1.5)
> cc_compile_by          : root
> cc_compile_domain      :
> cc_compile_date        : Mon Aug 29 16:24:12 PDT 2011
> xend_config_format     : 4
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.