[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Lockup in netback - Xen 4.1.2 (XS 6.0.2 hotfix 7)



Hi all-
I am seeing an intermittent lockup on my machine's networking as soon
as I apply a network load.  On a pool of 80 the first one will lock up
generally within 15-20 minutes of beginning the workload.  The symptom
is I see a long list of the following in /var/log/messages:

Aug 16 18:32:49 localhost kernel: netback[1]: TXP193 is DMA mapped
Aug 16 18:32:49 localhost kernel: netback[1]: TXP211 is DMA mapped
Aug 16 18:32:49 localhost kernel: netback[1]: TXP232 is DMA mapped
Aug 16 18:32:49 localhost kernel: netback[1]: TXP157 is DMA mapped
Aug 16 18:32:49 localhost kernel: netback[0]: TXP44 is DMA mapped

this seems to clog up the networking pipeline which leads to stall in
my NIC driver:

Aug 16 18:32:58 localhost kernel: ------------[ cut here ]------------
Aug 16 18:32:58 localhost kernel: WARNING: at
net/sched/sch_generic.c:261 dev_watchdog+0x241/0x250()
Aug 16 18:32:58 localhost kernel: Hardware name: C51G,MCP51
Aug 16 18:32:58 localhost kernel: NETDEV WATCHDOG: eth0 (tg3):
transmit queue 0 timed out
Aug 16 18:32:58 localhost kernel: Modules linked in: nfs nfs_acl
auth_rpcgss sch_htb lockd sunrpc 8021q openvswitch ipt_REJECT
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp
iptable_filter ip_tables x_tables binfmt_misc nls_utf8 isofs video
output sbs sbshc fan container battery ac parport_pc lp parport nvram
thermal rtc_cmos processor evdev sg tg3 button thermal_sys rtc_core
sata_sil24 rtc_lib serio_raw tpm_tis tpm tpm_bios i2c_nforce2 pcspkr
i2c_core ide_generic dm_snapshot dm_zero dm_mirror dm_region_hash
dm_log dm_mod sata_nv pata_acpi ata_generic libata sd_mod scsi_mod
ext3 jbd uhci_hcd ohci_hcd ehci_hcd usbcore fbcon font tileblit
bitblit softcursor
Aug 16 18:32:58 localhost kernel: Pid: 0, comm: swapper Not tainted
2.6.32.12-0.7.1.xs6.0.2.553.170674xen #1
Aug 16 18:32:58 localhost kernel: Call Trace:
Aug 16 18:32:58 localhost kernel:  [<c031a1a1>] ? dev_watchdog+0x241/0x250
Aug 16 18:32:58 localhost kernel:  [<c031a1a1>] ? dev_watchdog+0x241/0x250
Aug 16 18:32:58 localhost kernel:  [<c012e0bc>] warn_slowpath_common+0x7c/0xa0
Aug 16 18:32:58 localhost kernel:  [<c031a1a1>] ? dev_watchdog+0x241/0x250
Aug 16 18:32:58 localhost kernel:  [<c012e126>] warn_slowpath_fmt+0x26/0x30
Aug 16 18:32:58 localhost kernel:  [<c031a1a1>] dev_watchdog+0x241/0x250
Aug 16 18:32:58 localhost kernel:  [<c02188f6>] ?
blk_rq_timed_out_timer+0xe6/0x110
Aug 16 18:32:58 localhost kernel:  [<c0137fe1>] run_timer_softirq+0x151/0x200
Aug 16 18:32:58 localhost kernel:  [<c0319f60>] ? dev_watchdog+0x0/0x250
Aug 16 18:32:58 localhost kernel:  [<c013359a>] __do_softirq+0xba/0x180
Aug 16 18:32:58 localhost kernel:  [<c015b657>] ? handle_IRQ_event+0x37/0x100
Aug 16 18:32:58 localhost kernel:  [<c015e774>] ? move_native_irq+0x14/0x50
Aug 16 18:32:58 localhost kernel:  [<c01336d5>] do_softirq+0x75/0x80
Aug 16 18:32:58 localhost kernel:  [<c01339bb>] irq_exit+0x2b/0x40
Aug 16 18:32:58 localhost kernel:  [<c029c7b7>] evtchn_do_upcall+0x1e7/0x330
Aug 16 18:32:58 localhost kernel:  [<c010470f>] hypervisor_callback+0x43/0x4b
Aug 16 18:32:58 localhost kernel:  [<c0107095>] ? xen_safe_halt+0xb5/0x150
Aug 16 18:32:58 localhost kernel:  [<c010adae>] xen_idle+0x1e/0x50
Aug 16 18:32:58 localhost kernel:  [<c0102a7b>] cpu_idle+0x3b/0x60
Aug 16 18:32:58 localhost kernel:  [<c0373c43>] rest_init+0x53/0x60
Aug 16 18:32:58 localhost kernel:  [<c04f5cea>] start_kernel+0x29a/0x340
Aug 16 18:32:58 localhost kernel:  [<c04f55f0>] ? unknown_bootoption+0x0/0x1f0
Aug 16 18:32:58 localhost kernel:  [<c04f507c>] i386_start_kernel+0x7c/0x90
Aug 16 18:32:58 localhost kernel: ---[ end trace 76ea5a31a8fc2f33 ]---

and after the NIC driver fails netback un-stalls itself:

Aug 16 18:33:00 localhost kernel: tg3 0000:01:00.0: tg3_stop_block
timed out, ofs=1400 enable_bit=2
Aug 16 18:33:00 localhost kernel: pci 0000:00:02.0: eth0: Link is down
Aug 16 18:33:00 localhost kernel: netback[1]: DMA mapped TXP 203 released
Aug 16 18:33:00 localhost kernel: netback[1]: DMA mapped TXP 212 released
Aug 16 18:33:00 localhost kernel: netback[2]: DMA mapped TXP 94 released
Aug 16 18:33:00 localhost kernel: netback[1]: DMA mapped TXP 159 released

To get packets moving again I have to have a serial console to the
host, rmmod the tg3 driver, modprobe it, ifconfig up the interface and
restart OVS.

I've tried a variety of things to debug the problem:
-Turning off all hardware acceleration on the NIC from ethtool
-Different OVS versions
-Using a single dom0 vcpu
-Turning off irqbalance and MSI
-Trying the latest stable kernel in my VMs (3.5.3)
-Tried a newer TG3 driver from the Citrix crew
(http://forums.citrix.com/thread.jspa?threadID=311744)

But to no avail.  I don't ever see the "is DMA mapped" messages under
normal operation, so it seems like whatever is causing dom0 to believe
that the memory in the netback/front rings is DMA mapped is the
problem.  If anyone has any suggestions on how to approach/solve this
problem I am open to ideas, I've spent a couple weeks on and off on it
with no resolution.  I'm attaching a tar with all the log messages
from the system if they can help.

Thanks in advance,
David

Attachment: crash_newdriver-1_logs.tgz
Description: GNU Zip compressed data

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.