[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] Re: blocking Xen 3.X production use: soft lockup bugs
Here are some examples of the sort of soft lockups I'm seeing -- I can't say right now if they've all been showing the same stack trace, but I'll keep an eye on that from now on. I know they haven't all been on the same CPU. Anything else anyone needs, just let me know -- and I'd like to reaffirm my earlier offer of access to one of these machines. I'm also starting to think a XenSource wiki page "how to report/workaround soft lockups" might be in order; I suspect many of the bug reports (including my own) haven't been detailed enough to differentiate between the various things that can cause soft lockups. This was on an IBM x330. Steve n4h34:~# xm create -c /etc/xen/auto/build2.t7a.org Using config file "/etc/xen/auto/build2.t7a.org". Started domain build2.t7a.org Linux version 2.6.16.13-xen (root@n4h33) (gcc version 3.3.5 (Debian 1:3.3.5-12)) #2 SMP Sun Jun 11 14:25:16 PDT 2006 BIOS-provided physical RAM map: Xen: 0000000000000000 - 0000000008000000 (usable) 0MB HIGHMEM available. 136MB LOWMEM available. ACPI in unprivileged domain disabled IRQ lockup detection disabled Built 1 zonelists Kernel command line: root=/dev/sda1 2 Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Initializing CPU#0 PID hash table entries: 1024 (order: 10, 16384 bytes) Xen reported: 1130.113 MHz processor. Dentry cache hash table entries: 32768 (order: 5, 131072 bytes) Inode-cache hash table entries: 16384 (order: 4, 65536 bytes) Software IO TLB disabled vmalloc area: c9000000-fb7fe000, maxmem 33ffe000 Memory: 114612k/139264k available (3368k kernel code, 16308k reserved, 1033k data, 196k init, 0k highmem) Checking if this processor honours the WP bit even in supervisor mode... Ok. Calibrating delay using timer specific routine.. 2261.96 BogoMIPS (lpj=11309833) Security Framework v1.0.0 initialized Capability LSM initialized Mount-cache hash table entries: 512 CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 512K Checking 'hlt' instruction... OK. Brought up 1 CPUs migration_cost=0 checking if image is initramfs... it is Freeing initrd memory: 9535k freed Grant table initialized NET: Registered protocol family 16 Brought up 1 CPUs PCI: setting up Xen PCI frontend stub ACPI: Subsystem revision 20060127 ACPI: Interpreter disabled. Linux Plug and Play Support v0.97 (c) Adam Belay xen_mem: Initialising balloon driver. SCSI subsystem initialized usbcore: registered new driver usbfs usbcore: registered new driver hub PCI: System does not support PCI PCI: System does not support PCI IA-32 Microcode Update Driver: v1.14-xen <tigran@xxxxxxxxxxx> VFS: Disk quotas dquot_6.5.1 Dquot-cache hash table entries: 1024 (order 0, 4096 bytes) JFS: nTxBlock = 1024, nTxLock = 8192 SGI XFS with ACLs, security attributes, realtime, large block numbers, no debug enabled Initializing Cryptographic API io scheduler noop registered io scheduler anticipatory registered (default) io scheduler deadline registered io scheduler cfq registered PNP: No PS/2 controller found. Probing ports directly. i8042.c: No controller found. RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize Xen virtual console successfully installed as tty1 Event-channel device installed. blkif_init: reqs=64, pages=704, mmap_vstart=0xc7400000 netfront: Initialising virtual ethernet driver. Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 50MHz system bus speed for PIO modes; override with idebus=xx Registering block device major 8 ide-floppy driver 0.99.newide Fusion MPT base driver 3.03.07 Copyright (c) 1999-2005 LSI Logic Corporation Fusion MPT SPI Host driver 3.03.07 Fusion MPT misc device (ioctl) driver 3.03.07 mptctl: Registered with Fusion MPT base driver mptctl: /dev/mptctl @ (major,minor=10,220) usbmon: debugfs is not available usbcore: registered new driver libusual mice: PS/2 mouse device common for all mice md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27 md: bitmap version 4.39 NET: Registered protocol family 2 IP route cache hash table entries: 2048 (order: 1, 8192 bytes) TCP established hash table entries: 8192 (order: 4, 65536 bytes) TCP bind hash table entries: 8192 (order: 4, 65536 bytes) TCP: Hash tables configured (established 8192 bind 8192) TCP reno registered Initializing IPsec netlink socket NET: Registered protocol family 1 NET: Registered protocol family 17 NET: Registered protocol family 8 NET: Registered protocol family 20 Using IPI No-Shortcut mode Freeing unused kernel memory: 196k freed Loading, please wait... Begin: Loading essential drivers... ... tg3: no version for "struct_module" found: kernel tainted. eepro100.c:v1.09j-t 9/29/99 Donald Becker http://www.scyld.com/network/eepro100.html eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin <saw@xxxxxxxxxxxxx> and others Intel(R) PRO/1000 Network Driver - version 6.3.9-k4 Copyright (c) 1999-2005 Intel Corporation. Done. Begin: Running /scripts/init-premount ... FATAL: Error inserting fan (/lib/modules/2.6.16.13-xen/kernel/drivers/acpi/fan.ko): No such device FATAL: Error inserting thermal (/lib/modules/2.6.16.13-xen/kernel/drivers/acpi/thermal.ko): No such device Done. Begin: Mounting root file system... ... Begin: Running /scripts/local-top ... Done. Begin: Running /scripts/local-premount ... Done. kjournald starting. Commit interval 5 seconds EXT3-fs: mounted filesystem with ordered data mode. Begin: Running /scripts/log-bottom ... Done. Done. Begin: Running /scripts/init-bottom ... Done. mount: Mounting /sys on /root/sys failed: No such file or directory INIT: version 2.85 booting Activating swap. Checking root file system... fsck 1.39 (29-May-2006) /dev/sda1: clean, 21526/917504 files, 245920/1835007 blocks EXT3 FS on sda1, internal journal System time was Wed Aug 2 22:17:34 UTC 2006. Setting the System Clock using the Hardware Clock as reference... System Clock set. System local time is now Wed Aug 2 22:17:37 UTC 2006. Loading device-mapper support. Checking all file systems... fsck 1.39 (29-May-2006) Setting kernel variables.. Mounting local filesystems... Adding 524280k swap on /swap00. Priority:-1 extents:134 across:533176k Cleaning /tmp /var/run /var/lock. Running 0dns-down to make sure resolv.conf is ok...done. Cleaning: /etc/network/ifstate. Setting up IP spoofing protection: rp_filter. Configuring network interfaces...done. Loading the saved-state of the serial devices... /dev/ttyS0: No such file or directory /dev/ttyS0: No such file or directory /dev/ttyS1: No such file or directory /dev/ttyS1: No such file or directory Not setting System Clock Initializing random number generator...done. Recovering nvi editor sessions... done. INIT: Entering runlevel: 2 Starting isconf daemonRunning isconf updateisconf: info: build2.t7a.org is on guest-1 branch isconf: info: may reboot... isconf: info: checking for updates isconf: info: fetching http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb913455c71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.911958506882 isconf: info: fetching http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb913455c71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.999292957677 isconf: info: fetching http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb913455c71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.239902520967 BUG: soft lockup detected on CPU#0! Pid: 2383, comm: isconf EIP: 0073:[<080c9763>] CPU: 0 EIP is at 0x80c9763 ESP: 007b:bfcc962c EFLAGS: 00200282 Tainted: GF (2.6.16.13-xen #2) EAX: 00000001 EBX: 0000003a ECX: bfcc9624 EDX: 00000000 ESI: 08137cb4 EDI: 00000001 EBP: bfcc9638 DS: 007b ES: 007b CR0: 80050033 CR2: b7b97000 CR3: 0055e000 CR4: 00000640 isconf: info: fetching http://10.27.4.34:65028/t7a.org/block/ff1/ff1276f7811aeeade18d54a6c3578261ff36ecbb-4fb47b36cda57ae95af56372f03bb2ca-1?challenge=0.265409462016 isconf: info: updated /etc/ldap/ldap.conf BUG: soft lockup detected on CPU#0! Pid: 2383, comm: isconf EIP: 0073:[<080af84d>] CPU: 0 EIP is at 0x80af84d ESP: 007b:bfcc96d0 EFLAGS: 00200246 Tainted: GF (2.6.16.13-xen #2) EAX: 00000001 EBX: 082031fe ECX: 082031fe EDX: b7af1f8c ESI: 00000000 EDI: 082030ec EBP: bfcc9838 DS: 007b ES: 007b CR0: 80050033 CR2: b7b97000 CR3: 0055e000 CR4: 00000640 isconf: info: fetching http://10.27.4.7:65028/t7a.org/block/c0e/c0e10bc50572deb89da6e9d96ac5971a39fddc65-fc3558eaffc90497248f97f9b0e3a924-1?challenge=0.130730726051 isconf: info: updated /etc/ca-certificates.conf isconf: info: running ['update-ca-certificates'] Updating certificates in /etc/ssl/certs....done. isconf: info: updated /etc/ldap/ldap.conf BUG: soft lockup detected on CPU#0! Pid: 1, comm: init EIP: 0061:[<c0322fe1>] CPU: 0 EIP is at netif_poll+0x101/0x810 EFLAGS: 00000216 Tainted: GF (2.6.16.13-xen #2) EAX: 00000037 EBX: c0945180 ECX: 0001134e EDX: c0945000 ESI: c0f48280 EDI: c0f499e8 EBP: c09451c0 DS: 007b ES: 007b CR0: 8005003b CR2: b7d579e0 CR3: 0057e000 CR4: 00000640 [<c03d891a>] net_rx_action+0xea/0x230 [<c0124cb5>] __do_softirq+0xf5/0x120 [<c0124d75>] do_softirq+0x95/0xa0 [<c0106c0f>] do_IRQ+0x1f/0x30 [<c0312f58>] evtchn_do_upcall+0xa8/0xf0 [<c0105178>] hypervisor_callback+0x2c/0x34 [<c02c2081>] __copy_user_intel+0x31/0xb0 [<c02c2220>] __copy_to_user_ll+0x70/0x80 [<c02c22f2>] copy_to_user+0x42/0x60 [<c0171068>] cp_new_stat64+0xf8/0x110 [<c01710b7>] sys_stat64+0x37/0x40 [<c0104fb5>] syscall_call+0x7/0xb isconf: warning: clierr: Connection reset by peer Starting system log daemon: syslogd. Starting kernel log daemon: klogd. No configuration file was found for slapd at /etc/ldap/slapd.conf. If you have moved the slapd configuration file please modify /etc/default/slapd to reflect this. If you chose to not configure slapd during installation then you need to do so prior to attempting to start slapd. An example slapd.conf is in /usr/share/slapd Starting Heimdal KDC: heimdal-kdc. Starting Heimdal password server: kpasswdd. Starting internet superserver: inetd. Starting PCMCIA services: module directory /lib/modules/2.6.16.13-xen/pcmcia not found. Starting OpenBSD Secure Shell server: sshd. Starting deferred execution scheduler: atd. Starting periodic command scheduler: cron. Debian GNU/Linux testing/unstable build2.t7a.org tty1 build2.t7a.org login: On Wed, Aug 02, 2006 at 01:54:49PM -0700, Steve Traugott wrote: > Hi All, > > I hate to say it, but it's starting to look like soft lockup bug(s) > are turning into a serious roadblock for general production use of Xen > 3.X, on a wide range of hardware. I've been using Xen since the 1.0 > days, and I have to say that this the most serious showstopper bug > I've ever hit -- it usually manifests itself during the first > significant network and/or disk I/O after starting a second or third > domU on the same box, and is the only bug I've ever hit that has > caused permanent damage -- it tends to corrupt guest filesystems. In > my case it's stopped a deployment dead in its tracks, and our only > options at this point are to go back to Xen 2.X or (horrors) to native > Linux kernels. > > The problem (or something that looks identical) is described in > several tickets, status currently NEW or REOPENED, no clear > resolution: > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=543 > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=690 > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=697 > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=705 > > In our own shop, we consistently hit soft lockups while running on > both IBM x330's and older Netengines (similar to an IBM 4000R). We've > found no workaround. We're on xen-3.0-testing, changeset 9732, kernel > 2.6.6.13. On April 6th, Keir posted a note saying this was fixed as > of a blkif_schedule() fix, which we already have because that was way > back in changeset 9587... > http://lists.xensource.com/archives/html/xen-devel/2006-04/msg00121.html. > > The most recent devel list traffic I've found which covers this is > July 7th: > http://lists.xensource.com/archives/html/xen-users/2006-07/msg00134.html > ...this message referred back to Kier's comment as describing a fix, > but it doesn't look true; while Kier's 9587 checkin may have fixed a > soft lockup problem, there appear to be more out there, or else > there's been regression. > > Do we have any consensus that this bug is fixed at all in > xen-3.0-testing, or even unstable? Is anyone who was hitting soft > lockups in testing *not* hitting them any more on the same hardware? > If so, what changeset are you on now? > > If anyone needs any more information, just let me know. As usual, if > anyone wants login and console server access to one of these boxes to > chase this down, I'm more than happy to provide that. > > Thanks, > > Steve > -- > Stephen G. Traugott (KG6HDQ) > UNIX/Linux Infrastructure Architect, TerraLuna LLC > stevegt@xxxxxxxxxxxxx > http://www.stevegt.com -- http://Infrastructures.Org -- Stephen G. Traugott (KG6HDQ) UNIX/Linux Infrastructure Architect, TerraLuna LLC stevegt@xxxxxxxxxxxxx http://www.stevegt.com -- http://Infrastructures.Org _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |