[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-bugs] [Bug 1486] New: dom0 crashes under heavy network load



http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1486

           Summary: dom0 crashes under heavy network load
           Product: Xen
           Version: unstable
          Platform: x86-64
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P2
         Component: Hypervisor
        AssignedTo: xen-bugs@xxxxxxxxxxxxxxxxxxx
        ReportedBy: uk@xxxxxxxxxxxxx
                CC: uk@xxxxxxxxxxxxx


On a Dell PE-R710, with bnx2 network drivers (also tested with e1000 card, wich
also crashes if onboard-bnx2 is disabled, so I think this is not a nic driver
issue), dom0 crashes totally under heavy constant network and disk load
(produced in dom0 and one domU). faster reproduceable with an additional rsync
which also causes disk i/o.
In my testing scenario, 60 domU have been started, each of them had 6 disk- and
2 network blockdevices, so 8 backend-devices in use.

Testing scenario, using netcat to produce constant load (only zero bytes in
this case):
my.dom0 #: nc -l -p 1234 | pv > /dev/null
external.host #: cat /dev/zero | pv | nc ip.of.my.dom0 1234

then i ran additional rsync in order to produce net and disk i/o:
my.dom0 #:
for i in $(seq 1 1000); do echo "============== run $i ============" >>
rsync-runs.txt ; rm -rfv /var/spool/test/* ; rsync -avP --numeric-ids
--password-file=/etc/rsyncd.secrets user@xxxxxxxxxxxxx::source/*
/var/spool/test/; done
...which copies round about 1G of data in one run.

The Crash occurs in a few minutes or even several ours; testing the e1000 it
took 84 rsync runs (I do not know how long it took as it crashed last night).
I think I can crash the machine faster if I use the bnx2 card.

Here, the unstable kernel 2.6.27.5 from xenbits was used, but this issue also
affects older versions.

Stacktrace:
9 19:34:20 xh132 kernel: ------------[ cut here ]------------
Jul  9 19:34:20 xh132 kernel: WARNING: at net/sched/sch_generic.c:219
dev_watchdog+0x13c/0x1e9()
Jul  9 19:34:20 xh132 kernel: NETDEV WATCHDOG: eth0 (bnx2): transmit timed out
Jul  9 19:34:20 xh132 kernel: Modules linked in: iptable_filter(N) ip_tables(N)
x_tables(N) bridge(N) stp(N) llc(N) loop(N) dm_mod(N) 8021q(N) bonding(N)
dcdbas(N)
Jul  9 19:34:20 xh132 kernel: Supported: No
Jul  9 19:34:20 xh132 kernel: Pid: 0, comm: swapper Tainted: G         
2.6.27.5-xen0-he+4 #7
Jul  9 19:34:20 xh132 kernel:
Jul  9 19:34:20 xh132 kernel: Call Trace:
Jul  9 19:34:20 xh132 kernel: <IRQ>  [<ffffffff8022b3d7>]
warn_slowpath+0xb4/0xde
Jul  9 19:34:20 xh132 kernel: [<ffffffff80552b00>] __down_read+0xb6/0x110
Jul  9 19:34:20 xh132 kernel: [<ffffffff804d6999>] neigh_lookup+0xb0/0xc0
Jul  9 19:34:20 xh132 kernel: [<ffffffff804cafd2>] skb_queue_tail+0x17/0x3e
Jul  9 19:34:20 xh132 kernel: [<ffffffff8020d6de>] get_nsec_offset+0x9/0x2c
Jul  9 19:34:20 xh132 kernel: [<ffffffff8020d7ff>] local_clock+0x48/0x99
Jul  9 19:34:20 xh132 kernel: [<ffffffff8020d6de>] get_nsec_offset+0x9/0x2c
Jul  9 19:34:20 xh132 kernel: [<ffffffff8020d7ff>] local_clock+0x48/0x99
Jul  9 19:34:20 xh132 kernel: [<ffffffff8020d96f>] sched_clock+0x15/0x36
Jul  9 19:34:20 xh132 kernel: [<ffffffff80241ef5>] sched_clock_cpu+0x290/0x2b9
Jul  9 19:34:20 xh132 kernel: [<ffffffff8020dfea>] timer_interrupt+0x409/0x41d
Jul  9 19:34:20 xh132 kernel: [<ffffffff804ded1f>] dev_watchdog+0x13c/0x1e9
Jul  9 19:34:20 xh132 kernel: [<ffffffffa0038b31>] br_fdb_cleanup+0x0/0xd5
[bridge]
Jul  9 19:34:20 xh132 kernel: [<ffffffff802347c8>] __mod_timer+0xc7/0xd5
Jul  9 19:34:20 xh132 kernel: [<ffffffff804debe3>] dev_watchdog+0x0/0x1e9
Jul  9 19:34:20 xh132 kernel: [<ffffffff80234131>]
run_timer_softirq+0x16c/0x211
Jul  9 19:34:20 xh132 kernel: [<ffffffff8024f132>] handle_percpu_irq+0x53/0x6f
Jul  9 19:34:20 xh132 kernel: [<ffffffff8022fee0>] __do_softirq+0x92/0x13b
Jul  9 19:34:20 xh132 kernel: [<ffffffff8020b37c>] call_softirq+0x1c/0x28
Jul  9 19:34:20 xh132 kernel: [<ffffffff8020d1c3>] do_softirq+0x55/0xbb
Jul  9 19:34:20 xh132 kernel: [<ffffffff8020ae3e>]
do_hypervisor_callback+0x1e/0x30
Jul  9 19:34:20 xh132 kernel: <EOI>  [<ffffffff8020d6af>]
xen_safe_halt+0xb3/0xd9
Jul  9 19:34:20 xh132 kernel: [<ffffffff802105b3>] xen_idle+0x2e/0x67
Jul  9 19:34:20 xh132 kernel: [<ffffffff80208dfe>] cpu_idle+0x57/0x75
Jul  9 19:34:20 xh132 kernel:
Jul  9 19:34:20 xh132 kernel: ---[ end trace a04b8dccc5213f7d ]---
Jul  9 19:34:20 xh132 kernel: bnx2: eth0 NIC Copper Link is Down
Jul  9 19:34:20 xh132 kernel: bonding: bond0: link status down for active
interface eth0, disabling it in 200 ms.
Jul  9 19:34:20 xh132 kernel: bonding: bond0: link status definitely down for
interface eth0, disabling it
Jul  9 19:34:20 xh132 kernel: device eth0 left promiscuous mode
Jul  9 19:34:20 xh132 kernel: bonding: bond0: now running without any active
interface !

Please let me know if you need further information.
So perhaps you can help.

Many thanks in advance, 
best regards,
Ulf Kreutzberg


-- 
Configure bugmail: 
http://bugzilla.xensource.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

_______________________________________________
Xen-bugs mailing list
Xen-bugs@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-bugs


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.