[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-users] Severe megasas_raid issues when using Xen dom0 linux kernels
Have you tried to use the MegaRAID monitor to see if you can diagnose some hardware problem with the RAID? There is one you can download and run on the linux dom0, there should be a monitor you can get to from the BIOS as well.. those error messages look very much like an actual hardware fault on the RAID array. I have a lot of megasas raid both under SL5 and SL6 and have used them as xen dom0 and kvm vm hosts without problems, several different versions of xen. Steve Timm On Tue, 18 Oct 2011, David Della Vecchia wrote: I've tried debian stable and testing, centos5 and 6 with xen 3.1-4.1 (about 5 different versions in between). I'm currently running xen 4.1.1 release on centos6 with M.A.Young's centos6 xen dom0 kernel. For some reason the raid array freaks out and swaps to read-only mode for the entire virtual device the hardware raid array provides. I've tried both raid 0 and raid1 (2 1tb SCSI drives). I've had this issue in every xen install I've tried on this box, no matter what kernel version (tried as new as 3.0.1 in debian wheezy) or xen version (compiled and installed the unstable branch to test) i use. The server was running stable and fine for about a week this time before this: [root@gibson ~]# df -h -bash: /bin/df: Input/output error [root@gibson ~]# w -bash: /usr/bin/w: Input/output error [root@gibson ~]# modinfo megasas_raid -bash: /sbin/modinfo: Input/output error part of the /var/log/messages: Oct 17 13:21:09 gibson kernel: megasas: [ 0]waiting for 1 commands to complete Oct 17 13:21:10 gibson kernel: megaraid_sas: no pending cmds after reset Oct 17 13:21:10 gibson kernel: megasas: reset successful Oct 17 13:21:20 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0 retries=0 Oct 17 13:21:20 gibson kernel: megasas: [ 0]waiting for 1 commands to complete Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset Oct 17 13:21:21 gibson kernel: megasas: reset successful Oct 17 13:21:21 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=2a retries=0 Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset Oct 17 13:21:21 gibson kernel: megasas: reset successful Oct 17 13:21:41 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0 retries=0 Oct 17 13:21:41 gibson kernel: megasas: [ 0]waiting for 1 commands to complete Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset Oct 17 13:21:42 gibson kernel: megasas: reset successful Oct 17 13:21:42 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=2a retries=0 Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset Oct 17 13:21:42 gibson kernel: megasas: reset successful Oct 17 13:22:02 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0 retries=0 Oct 17 13:22:02 gibson kernel: megasas: [ 0]waiting for 1 commands to complete [root@gibson ~]# ls -al /bin/ ls: cannot access /bin/ntfs-3g.secaudit: Input/output error ls: cannot access /bin/ntfstruncate: Input/output error ls: cannot access /bin/ntfsdump_logfile: Input/output error ls: cannot access /bin/ntfsls: Input/output error ls: cannot access /bin/ntfsdecrypt: Input/output error ls: cannot access /bin/ntfs-3g.usermap: Input/output error ls: cannot access /bin/ntfsmount: Input/output error ls: cannot access /bin/ntfsfix: Input/output error ls: cannot access /bin/ntfscluster: Input/output error total 8192 dr-xr-xr-x. 2 root root 4096 Oct 15 14:49 . drwxr-xr-x. 29 root root 4096 Oct 17 12:34 .. -rwxr-xr-x. 1 root root 123 Nov 10 2010 alsaunmute -rwxr-xr-x 1 root root 27808 May 30 10:55 arch lrwxrwxrwx. 1 root root 4 Oct 13 10:36 awk -> gawk -rwxr-xr-x 1 root root 26264 May 30 10:55 basename -rwxr-xr-x 1 root root 943248 May 30 11:46 bash -rwxr-xr-x 1 root root 51344 May 30 10:55 cat -rwxr-xr-x 1 root root 12200 Jun 25 05:02 cgclassify -rwxr-xr-x 1 root root 12352 Jun 25 05:02 cgcreate -rwxr-xr-x 1 root root 11528 Jun 25 05:02 cgdelete -rwsr-xr-x 1 root root 12136 Jun 25 05:02 cgexec -rwxr-xr-x 1 root root 15760 Jun 25 05:02 cgget -rwxr-xr-x 1 root root 13160 Jun 25 05:02 cgset -rwxr-xr-x 1 root root 55472 May 30 10:55 chgrp -rwxr-xr-x 1 root root 52472 May 30 10:55 chmod -rwxr-xr-x 1 root root 57496 May 30 10:55 chown -rwxr-xr-x 1 root root 122344 May 30 10:55 cp -rwxr-xr-x 1 root root 136096 Nov 10 2010 cpio lrwxrwxrwx. 1 root root 4 Oct 13 11:00 csh -> tcsh -rwxr-xr-x 1 root root 45472 May 30 10:55 cut -rwxr-xr-x 1 root root 109896 Aug 18 2010 dash -rwxr-xr-x 1 root root 59552 May 30 10:55 date -rwxr-xr-x 1 root root 12552 Jun 25 06:47 dbus-cleanup-sockets -rwxr-xr-x. 1 root root 339048 Jun 25 06:47 dbus-daemon -rwxr-xr-x 1 root root 18464 Jun 25 06:47 dbus-monitor -rwxr-xr-x 1 root root 22376 Jun 25 06:47 dbus-send -rwxr-xr-x 1 root root 10912 Jun 25 06:47 dbus-uuidgen -rwxr-xr-x 1 root root 54040 May 30 10:55 dd -rwxr-xr-x 1 root root 70256 May 30 10:55 df -rwxr-xr-x 1 root root 9896 Jun 25 02:46 dmesg lrwxrwxrwx. 1 root root 8 Oct 13 10:36 dnsdomainname -> hostname lrwxrwxrwx. 1 root root 8 Oct 13 10:36 domainname -> hostname -rwxr-xr-x 1 root root 81120 Nov 11 2010 dumpkeys -rwxr-xr-x 1 root root 27648 May 30 10:55 echo -rwxr-xr-x 2 root root 53352 Nov 11 2010 ed -rwxr-xr-x 1 root root 106528 Aug 25 2010 egrep -rwxr-xr-x 1 root root 26368 May 30 10:55 env lrwxrwxrwx. 1 root root 2 Oct 13 10:59 ex -> vi -rwxr-xr-x 1 root root 24592 May 30 10:55 false -rwxr-xr-x 1 root root 71328 Aug 25 2010 fgrep -rwxr-xr-x 1 root root 238640 Nov 11 2010 find -rwxr-xr-x 1 root root 382456 Nov 11 2010 gawk -rwxr-xr-x 1 root root 33416 Nov 11 2010 gettext -rwxr-xr-x 1 root root 110160 Aug 25 2010 grep lrwxrwxrwx. 1 root root 3 Oct 13 10:36 gtar -> tar -rwxr-xr-x. 1 root root 61 Nov 11 2010 gunzip -rwxr-xr-x 1 root root 68544 Nov 11 2010 gzip -rwxr-xr-x 1 root root 16192 Aug 24 2010 hostname -rwxr-xr-x 1 root root 14872 Jun 25 00:09 ipcalc lrwxrwxrwx. 1 root root 20 Oct 13 10:36 iptables-xml -> /sbin/iptables-multi -rwxr-xr-x 1 root root 11248 Nov 11 2010 kbd_mode -rwxr-xr-x 1 root root 24648 Aug 22 2010 keyctl -rwxr-xr-x 1 root root 15128 Jun 25 02:46 kill -rwxr-xr-x 1 root root 26256 May 30 10:55 link -rwxr-xr-x 1 root root 49568 May 30 10:55 ln -rwxr-xr-x 1 root root 112136 Nov 11 2010 loadkeys -rwxr-xr-x 1 root root 30992 Jun 25 02:46 login -rwxr-xr-x 1 root root 58368 Sep 12 13:32 lowntfs-3g -rwxr-xr-x 1 root root 111744 May 30 10:55 ls -rwxr-xr-x 1 root root 14008 Jun 25 05:02 lscgroup -rwxr-xr-x 1 root root 12488 Jun 25 05:02 lssubsys lrwxrwxrwx. 1 root root 5 Oct 13 10:37 mail -> mailx -rwxr-xr-x 1 root root 390360 Aug 22 2010 mailx -rwxr-xr-x 1 root root 48544 May 30 10:55 mkdir -rwxr-xr-x 1 root root 32352 May 30 10:55 mknod -rwxr-xr-x 1 root root 37352 May 30 10:55 mktemp -rwxr-xr-x 1 root root 41144 Jun 25 02:46 more -rwsr-xr-x. 1 root root 74712 Jun 25 02:46 mount -rwxr-xr-x 1 root root 9800 Aug 24 2010 mountpoint -rwxr-xr-x 1 root root 111536 May 30 10:55 mv -rwxr-xr-x 1 root root 177360 Nov 12 2010 nano -rwxr-xr-x 1 root root 127816 Aug 24 2010 netstat -rwxr-xr-x 1 root root 28816 May 30 10:55 nice lrwxrwxrwx. 1 root root 8 Oct 13 10:36 nisdomainname -> hostname -rwxr-xr-x 1 root root 53576 Sep 12 13:32 ntfs-3g -rwxr-xr-x 1 root root 11016 Sep 12 13:32 ntfs-3g.probe -?????????? ? ? ? ? ? ntfs-3g.secaudit -?????????? ? ? ? ? ? ntfs-3g.usermap -rwxr-xr-x 1 root root 29896 Sep 12 13:32 ntfscat -rwxr-xr-x 1 root root 32992 Sep 12 13:32 ntfsck -?????????? ? ? ? ? ? ntfscluster -rwxr-xr-x 1 root root 36320 Sep 12 13:32 ntfscmp -?????????? ? ? ? ? ? ntfsdecrypt -?????????? ? ? ? ? ? ntfsdump_logfile -?????????? ? ? ? ? ? ntfsfix -rwxr-xr-x 1 root root 57240 Sep 12 13:32 ntfsinfo -?????????? ? ? ? ? ? ntfsls -rwxr-xr-x 1 root root 30448 Sep 12 13:32 ntfsmftalloc l?????????? ? ? ? ? ? ntfsmount -rwxr-xr-x 1 root root 34000 Sep 12 13:32 ntfsmove -?????????? ? ? ? ? ? ntfstruncate -rwxr-xr-x 1 root root 42240 Sep 12 13:32 ntfswipe -rwsr-xr-x 1 root root 41432 Nov 11 2010 ping -rwsr-xr-x 1 root root 36256 Nov 11 2010 ping6 -rwxr-xr-x 1 root root 35640 Oct 31 2010 plymouth -rwxr-xr-x 1 root root 86776 Nov 11 2010 ps -rwxr-xr-x 1 root root 31656 May 30 10:55 pwd -rwxr-xr-x 1 root root 11528 Jun 25 02:46 raw -rwxr-xr-x 1 root root 40056 May 30 10:55 readlink -rwxr-xr-x 2 root root 53352 Nov 11 2010 red -rwxr-xr-x. 1 root root 576 Apr 16 2008 redhat_lsb_init -rwxr-xr-x 1 root root 57504 May 30 10:55 rm -rwxr-xr-x 1 root root 40544 May 30 10:55 rmdir lrwxrwxrwx. 1 root root 4 Oct 13 10:39 rnano -> nano -rwxr-xr-x 1 root root 29904 Nov 11 2010 rpm lrwxrwxrwx. 1 root root 2 Oct 13 10:59 rvi -> vi lrwxrwxrwx. 1 root root 2 Oct 13 10:59 rview -> vi -rwxr-xr-x 1 root root 72248 Aug 22 2010 sed -rwxr-xr-x 1 root root 42312 Nov 11 2010 setfont -rwxr-xr-x 1 root root 23600 Aug 22 2010 setserial lrwxrwxrwx. 1 root root 4 Oct 13 10:36 sh -> bash -rwxr-xr-x 1 root root 27880 May 30 10:55 sleep -rwxr-xr-x 1 root root 99000 May 30 10:55 sort -rwxr-xr-x 1 root root 65864 May 30 10:55 stty -rwsr-xr-x 1 root root 36440 May 30 10:55 su -rwxr-xr-x 1 root root 25464 May 30 10:55 sync -rwxr-xr-x 1 root root 384920 Nov 11 2010 tar -rwxr-xr-x 1 root root 14808 Jun 25 02:46 taskset -rwxr-xr-x 1 root root 391288 Jun 25 02:05 tcsh -rwxr-xr-x 1 root root 51952 May 30 10:55 touch -rwxr-xr-x. 1 root root 11392 Nov 11 2010 tracepath -rwxr-xr-x. 1 root root 12304 Nov 11 2010 tracepath6 -rwxr-xr-x 1 root root 57384 Nov 11 2010 traceroute lrwxrwxrwx. 1 root root 10 Oct 13 10:39 traceroute6 -> traceroute -rwxr-xr-x 1 root root 24592 May 30 10:55 true -rwsr-xr-x. 1 root root 49280 Jun 25 02:46 umount -rwxr-xr-x 1 root root 27808 May 30 10:55 uname -rwxr-xr-x. 1 root root 2555 Nov 11 2010 unicode_start -rwxr-xr-x. 1 root root 363 Nov 11 2010 unicode_stop -rwxr-xr-x 1 root root 26264 May 30 10:55 unlink -rwxr-xr-x 1 root root 10208 Jun 25 00:09 usleep -rwxr-xr-x 1 root root 771800 Jun 25 04:43 vi lrwxrwxrwx. 1 root root 2 Oct 13 10:59 view -> vi lrwxrwxrwx. 1 root root 8 Oct 13 10:36 ypdomainname -> hostname -rwxr-xr-x. 1 root root 62 Nov 11 2010 zcat Here is the rough partition information for my main drive: /boot primary ext3 1gb /dev/sda1 /dev/sda2 extended lvm pv 925gb vg_gibson lvm-volumegroup 925gb / lv_root ext3 36gb swap lv_swap 2gb Server Specs: Dell Poweredge R710 32GB ECC Unbuffered Ram 2x Intel Xeon Quad Core HT 2.3Ghz (16 "cores" total) 2x 1TB WD SCSI Drives in Raid-1 Drive Nitty Gritty: Product ID: WDC WD1002FBYS-0 Revision: 0C06 Size: 953344MB Heres some more information about the raid controller also attained from the raid controller config utility: Product Name: PERC 6/i Package: 6.2.0-0013 FW Version: 1.22.02-0612 BIOS Version: 2.04.00 CtrlR Version: 1.02-015B Boot Block: 1.00.00.01-0011 Application & OS Specs: CentOS 6 w/2.6.32-131 M.A.Young centos6 xen dom0 kernel Diagnostic Attempts and Results: I've done a consistency check on the raid array and everything comes back as clean and optimal. I've ran bad block checks, partition table corruption, mbr corruption, everything i can think of. It all comes back as clean and working fine. Because of these results i have not been able to force my dedicated hosting company to replace any of the hardware. They are upgrading the raid controller software as its about 1 minor version out of date just to see if that could be the issue, i'll report back if that mysteriously fixes it but i'm not holding my breath. I've read somewhere that the 2.6.x kernels have an old version of the megaraid_sas module that will cause problems but the version included in the M.A.Young centos6 kernel is version 5.3 which is far beyond the 4.3 version that article recommends upgrading to so i'm really at a loss. Besides the version being so new the problem described in that article (the kernel not finding the drive at all on boot) is not the issue i'm having. It just freaks out randomly (i'm sure its not really randomly, just appears that way) and the OS swaps to read-only mode and the only way to reboot is basically to push the button on the front of the box. Please, if anyone can direct me towards a solution or at least down a path i have yet to try i would greatly appreciate it. I'm at my wits end, i've been fighting this mysterious monster for over a month now and it always seems to strike right before i'm about to go live with my services (first time it happened was right after i started adding customers to the box). Thanks in advance, David -- ------------------------------------------------------------------ Steven C. Timm, Ph.D (630) 840-8525 timm@xxxxxxxx http://home.fnal.gov/~timm/ Fermilab Computing Division, Scientific Computing Facilities, Grid Facilities Department, FermiGrid Services Group, Group Leader. Lead of FermiCloud project. _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |